Update README with actual dashboard interface and implementation details
This commit is contained in:
parent
a08670071c
commit
0417e2c1f1
763
README.md
763
README.md
@ -1,544 +1,405 @@
|
||||
# CM Dashboard - Infrastructure Monitoring TUI
|
||||
# CM Dashboard
|
||||
|
||||
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for specific monitoring needs and API integrations. Features real-time monitoring of all infrastructure components with intelligent email notifications and automatic status calculation.
|
||||
A real-time infrastructure monitoring system with intelligent status aggregation and email notifications, built with Rust and ZMQ.
|
||||
|
||||
## Current Implementation
|
||||
|
||||
This is a complete rewrite implementing an **individual metrics architecture** where:
|
||||
- **Agent** collects individual metrics (e.g., `cpu_load_1min`, `memory_usage_percent`) and calculates status
|
||||
- **Dashboard** subscribes to specific metrics and composes widgets
|
||||
- **Status Aggregation** provides intelligent email notifications with batching
|
||||
- **Persistent Cache** prevents false notifications on restart
|
||||
|
||||
## Dashboard Interface
|
||||
|
||||
### System Widget
|
||||
```
|
||||
┌System───────────────────────────────────────────────────────┐
|
||||
│ Memory usage │
|
||||
│✔ 3.0 / 7.8 GB │
|
||||
│ CPU load CPU temp │
|
||||
│✔ 1.05 • 0.96 • 0.58 64.0°C │
|
||||
│ C1E C3 C6 C8 C9 C10 │
|
||||
│✔ 0.5% 0.5% 10.4% 10.2% 0.4% 77.9% │
|
||||
│ GPU load GPU temp │
|
||||
│✔ — — │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
|
||||
┌system───────────────────────────────────────────┐┌services────────────────────────────────────────────────────┐
|
||||
│CPU: ││Service: Status: RAM: Disk: │
|
||||
│● Load: 0.10 0.52 0.88 • 400.0 MHz ││● docker active 27M 496MB │
|
||||
│RAM: ││● docker-registry active 19M 496MB │
|
||||
│● Used: 30% 2.3GB/7.6GB ││● gitea active 579M 2.6GB │
|
||||
│● tmp: 0.0% 0B/2.0GB ││● gitea-runner-default active 11M 2.6GB │
|
||||
│Disk nvme0n1: ││● haasp-core active 9M 1MB │
|
||||
│● Health: PASSED ││● haasp-mqtt active 3M 1MB │
|
||||
│● Usage @root: 8.3% • 75.4/906.2 GB ││● haasp-webgrid active 10M 1MB │
|
||||
│● Usage @boot: 5.9% • 0.1/1.0 GB ││● immich-server active 240M 45.1GB │
|
||||
│ ││● mosquitto active 1M 1MB │
|
||||
│ ││● mysql active 38M 225MB │
|
||||
│ ││● nginx active 28M 24MB │
|
||||
│ ││ ├─ ● gitea.cmtec.se 51ms │
|
||||
│ ││ ├─ ● haasp.cmtec.se 43ms │
|
||||
│ ││ ├─ ● haasp.net 43ms │
|
||||
│ ││ ├─ ● pages.cmtec.se 45ms │
|
||||
└─────────────────────────────────────────────────┘│ ├─ ● photos.cmtec.se 41ms │
|
||||
┌backup───────────────────────────────────────────┐│ ├─ ● unifi.cmtec.se 46ms │
|
||||
│Latest backup: ││ ├─ ● vault.cmtec.se 47ms │
|
||||
│● Status: OK ││ ├─ ● www.kryddorten.se 81ms │
|
||||
│Duration: 54s • Last: 4h ago ││ ├─ ● www.mariehall2.se 86ms │
|
||||
│Disk usage: 48.2GB/915.8GB ││● postgresql active 112M 357MB │
|
||||
│P/N: Samsung SSD 870 QVO 1TB ││● redis-immich active 8M 45.1GB │
|
||||
│S/N: S5RRNF0W800639Y ││● sshd active 2M 0 │
|
||||
│● gitea 2 archives 2.7GB ││● unifi active 594M 495MB │
|
||||
│● immich 2 archives 45.0GB ││● vaultwarden active 12M 1MB │
|
||||
│● kryddorten 2 archives 67.6MB ││ │
|
||||
│● mariehall2 2 archives 321.8MB ││ │
|
||||
│● nixosbox 2 archives 4.5MB ││ │
|
||||
│● unifi 2 archives 2.9MB ││ │
|
||||
│● vaultwarden 2 archives 305kB ││ │
|
||||
└─────────────────────────────────────────────────┘└────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Services Widget (Enhanced)
|
||||
```
|
||||
┌Services────────────────────────────────────────────────────┐
|
||||
│ Service Memory (GB) CPU Disk │
|
||||
│✔ Service Memory 7.1/23899.7 MiB — │
|
||||
│✔ Disk Usage — — 45/100 GB │
|
||||
│⚠ CPU Load — 2.18 — │
|
||||
│✔ CPU Temperature — 47.0°C — │
|
||||
│✔ docker-registry 0.0 GB 0.0% <1 MB │
|
||||
│✔ gitea 0.4/4.1 GB 0.2% 970 MB │
|
||||
│ 1 active connections │
|
||||
│✔ nginx 0.0/1.0 GB 0.0% <1 MB │
|
||||
│✔ ├─ docker.cmtec.se │
|
||||
│✔ ├─ git.cmtec.se │
|
||||
│✔ ├─ gitea.cmtec.se │
|
||||
│✔ ├─ haasp.cmtec.se │
|
||||
│✔ ├─ pages.cmtec.se │
|
||||
│✔ └─ www.kryddorten.se │
|
||||
│✔ postgresql 0.1 GB 0.0% 378 MB │
|
||||
│ 1 active connections │
|
||||
│✔ redis-immich 0.0 GB 0.4% <1 MB │
|
||||
│✔ sshd 0.0 GB 0.0% <1 MB │
|
||||
│ 1 SSH connection │
|
||||
│✔ unifi 0.9/2.0 GB 0.4% 391 MB │
|
||||
└────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
**Navigation**: `←→` switch hosts, `r` refresh, `q` quit
|
||||
|
||||
### Storage Widget
|
||||
```
|
||||
┌Storage──────────────────────────────────────────────────────┐
|
||||
│ Drive Temp Wear Spare Hours Capacity Usage │
|
||||
│✔ nvme0n1 57°C 4% 100% 11463 932G 23G (2%) │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
## Features
|
||||
|
||||
### Backups Widget
|
||||
```
|
||||
┌Backups──────────────────────────────────────────────────────┐
|
||||
│ Backup Status Details │
|
||||
│✔ Latest 3h ago 1.4 GiB │
|
||||
│ 8 archives, 2.4 GiB total │
|
||||
│✔ Disk ok 2.4/468 GB (1%) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Hosts Widget
|
||||
```
|
||||
┌Hosts────────────────────────────────────────────────────────┐
|
||||
│ Host Status Timestamp │
|
||||
│✔ cmbox ok 2025-10-13 05:45:28 │
|
||||
│✔ srv01 ok 2025-10-13 05:45:28 │
|
||||
│? labbox No data received — │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Navigation**: `←→` hosts, `r` refresh, `q` quit
|
||||
|
||||
## Key Features
|
||||
|
||||
### Real-time Monitoring
|
||||
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
|
||||
- **Performance-focused** with minimal resource usage
|
||||
- **Keyboard-driven interface** for power users
|
||||
- **ZMQ gossip network** for efficient data distribution
|
||||
|
||||
### Infrastructure Monitoring
|
||||
- **NVMe health monitoring** with wear prediction and temperature tracking
|
||||
- **CPU/Memory/GPU telemetry** with automatic thresholding
|
||||
- **Service resource monitoring** with per-service CPU and RAM usage
|
||||
- **Disk usage overview** for root filesystems
|
||||
- **Backup status** with detailed metrics and history
|
||||
- **C-state monitoring** for CPU power management analysis
|
||||
|
||||
### Intelligent Alerting
|
||||
- **Agent-calculated status** with predefined thresholds
|
||||
- **Email notifications** via SMTP with rate limiting
|
||||
- **Recovery notifications** with context about original issues
|
||||
- **Stockholm timezone** support for email timestamps
|
||||
- **Unified alert pipeline** summarizing host health
|
||||
- **Real-time monitoring** - Dashboard updates every 1-2 seconds
|
||||
- **Individual metric collection** - Granular data for flexible dashboard composition
|
||||
- **Intelligent status aggregation** - Host-level status calculated from all services
|
||||
- **Smart email notifications** - Batched, detailed alerts with service groupings
|
||||
- **Persistent state** - Prevents false notifications on restarts
|
||||
- **ZMQ communication** - Efficient agent-to-dashboard messaging
|
||||
- **Clean TUI** - Terminal-based dashboard with color-coded status indicators
|
||||
|
||||
## Architecture
|
||||
|
||||
### Agent-Dashboard Separation
|
||||
The system follows a strict separation of concerns:
|
||||
### Core Components
|
||||
|
||||
- **Agent**: Single source of truth for all status calculations using defined thresholds
|
||||
- **Dashboard**: Display-only interface that shows agent-provided status
|
||||
- **Data Flow**: Agent (calculations) → Status → Dashboard (display) → Colors
|
||||
- **Agent** (`cm-dashboard-agent`) - Collects metrics and sends via ZMQ
|
||||
- **Dashboard** (`cm-dashboard`) - Real-time TUI display consuming metrics
|
||||
- **Shared** (`cm-dashboard-shared`) - Common types and protocol
|
||||
- **Status Aggregation** - Intelligent batching and notification management
|
||||
- **Persistent Cache** - Maintains state across restarts
|
||||
|
||||
### Agent Thresholds (Production)
|
||||
- **CPU Load**: Warning ≥ 5.0, Critical ≥ 8.0
|
||||
- **Memory Usage**: Warning ≥ 80%, Critical ≥ 95%
|
||||
- **CPU Temperature**: Warning ≥ 100°C, Critical ≥ 100°C (effectively disabled)
|
||||
### Status Levels
|
||||
|
||||
### Email Notification System
|
||||
- **From**: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se)
|
||||
- **To**: `cm@cmtec.se`
|
||||
- **SMTP**: localhost:25 (postfix)
|
||||
- **Rate Limiting**: 30 minutes (configurable)
|
||||
- **Triggers**: Status degradation and recovery with detailed context
|
||||
|
||||
## Installation
|
||||
|
||||
### Requirements
|
||||
- Rust toolchain 1.75+ (install via [`rustup`](https://rustup.rs))
|
||||
- Root privileges for agent (hardware monitoring access)
|
||||
- Network access for ZMQ communication (default port 6130)
|
||||
- SMTP server for notifications (postfix recommended)
|
||||
|
||||
### Build from Source
|
||||
```bash
|
||||
git clone https://github.com/cmtec/cm-dashboard.git
|
||||
cd cm-dashboard
|
||||
cargo build --release
|
||||
```
|
||||
|
||||
Optimized binaries available at:
|
||||
- Dashboard: `target/release/cm-dashboard`
|
||||
- Agent: `target/release/cm-dashboard-agent`
|
||||
|
||||
### Installation
|
||||
```bash
|
||||
# Install dashboard
|
||||
cargo install --path dashboard
|
||||
|
||||
# Install agent (requires root for hardware access)
|
||||
sudo cargo install --path agent
|
||||
```
|
||||
- **🟢 Ok** - Service running normally
|
||||
- **🔵 Pending** - Service starting/stopping/reloading
|
||||
- **🟡 Warning** - Service issues (high load, memory, disk usage)
|
||||
- **🔴 Critical** - Service failed or critical thresholds exceeded
|
||||
- **❓ Unknown** - Service state cannot be determined
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Dashboard
|
||||
```bash
|
||||
# Run with default configuration
|
||||
cm-dashboard
|
||||
|
||||
# Specify host to monitor
|
||||
cm-dashboard --host cmbox
|
||||
|
||||
# Override ZMQ endpoints
|
||||
cm-dashboard --zmq-endpoint tcp://srv01:6130,tcp://labbox:6130
|
||||
|
||||
# Increase logging verbosity
|
||||
cm-dashboard -v
|
||||
```
|
||||
|
||||
### Agent (Pure Auto-Discovery)
|
||||
The agent requires **no configuration files** and auto-discovers all system components:
|
||||
### Build
|
||||
|
||||
```bash
|
||||
# Basic agent startup (auto-detects everything)
|
||||
sudo cm-dashboard-agent
|
||||
# With Nix (recommended)
|
||||
nix-shell -p openssl pkg-config --run "cargo build --workspace"
|
||||
|
||||
# With verbose logging for troubleshooting
|
||||
sudo cm-dashboard-agent -v
|
||||
# Or with system dependencies
|
||||
sudo apt install libssl-dev pkg-config # Ubuntu/Debian
|
||||
cargo build --workspace
|
||||
```
|
||||
|
||||
The agent automatically:
|
||||
- **Discovers storage devices** for SMART monitoring
|
||||
- **Detects running systemd services** for resource tracking
|
||||
- **Configures collection intervals** based on system capabilities
|
||||
- **Sets up email notifications** using hostname@cmtec.se
|
||||
### Run
|
||||
|
||||
```bash
|
||||
# Start agent (requires configuration file)
|
||||
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml
|
||||
|
||||
# Start dashboard
|
||||
./target/debug/cm-dashboard --config /path/to/dashboard.toml
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Dashboard Configuration
|
||||
The dashboard creates `config/dashboard.toml` on first run:
|
||||
### Agent Configuration (`agent.toml`)
|
||||
|
||||
The agent requires a comprehensive TOML configuration file:
|
||||
|
||||
```toml
|
||||
[hosts]
|
||||
default_host = "srv01"
|
||||
collection_interval_seconds = 2
|
||||
|
||||
[[hosts.hosts]]
|
||||
name = "srv01"
|
||||
[zmq]
|
||||
publisher_port = 6130
|
||||
command_port = 6131
|
||||
bind_address = "0.0.0.0"
|
||||
timeout_ms = 5000
|
||||
heartbeat_interval_ms = 30000
|
||||
|
||||
[collectors.cpu]
|
||||
enabled = true
|
||||
interval_seconds = 2
|
||||
load_warning_threshold = 9.0
|
||||
load_critical_threshold = 10.0
|
||||
temperature_warning_threshold = 100.0
|
||||
temperature_critical_threshold = 110.0
|
||||
|
||||
[[hosts.hosts]]
|
||||
name = "cmbox"
|
||||
[collectors.memory]
|
||||
enabled = true
|
||||
interval_seconds = 2
|
||||
usage_warning_percent = 80.0
|
||||
usage_critical_percent = 95.0
|
||||
|
||||
[dashboard]
|
||||
tick_rate_ms = 250
|
||||
history_duration_minutes = 60
|
||||
[collectors.disk]
|
||||
enabled = true
|
||||
interval_seconds = 300
|
||||
usage_warning_percent = 80.0
|
||||
usage_critical_percent = 90.0
|
||||
|
||||
[data_source]
|
||||
kind = "zmq"
|
||||
[[collectors.disk.filesystems]]
|
||||
name = "root"
|
||||
uuid = "4cade5ce-85a5-4a03-83c8-dfd1d3888d79"
|
||||
mount_point = "/"
|
||||
fs_type = "ext4"
|
||||
monitor = true
|
||||
|
||||
[data_source.zmq]
|
||||
endpoints = ["tcp://127.0.0.1:6130"]
|
||||
[collectors.systemd]
|
||||
enabled = true
|
||||
interval_seconds = 10
|
||||
memory_warning_mb = 1000.0
|
||||
memory_critical_mb = 2000.0
|
||||
service_name_filters = [
|
||||
"nginx", "postgresql", "redis", "docker", "sshd"
|
||||
]
|
||||
excluded_services = [
|
||||
"nginx-config-reload", "sshd-keygen"
|
||||
]
|
||||
|
||||
[notifications]
|
||||
enabled = true
|
||||
smtp_host = "localhost"
|
||||
smtp_port = 25
|
||||
from_email = "{hostname}@example.com"
|
||||
to_email = "admin@example.com"
|
||||
rate_limit_minutes = 0
|
||||
trigger_on_warnings = true
|
||||
trigger_on_failures = true
|
||||
recovery_requires_all_ok = true
|
||||
suppress_individual_recoveries = true
|
||||
|
||||
[status_aggregation]
|
||||
enabled = true
|
||||
aggregation_method = "worst_case"
|
||||
notification_interval_seconds = 30
|
||||
|
||||
[cache]
|
||||
persist_path = "/var/lib/cm-dashboard/cache.json"
|
||||
```
|
||||
|
||||
### Agent Configuration (Optional)
|
||||
The agent works without configuration but supports optional settings:
|
||||
### Dashboard Configuration (`dashboard.toml`)
|
||||
|
||||
```bash
|
||||
# Generate example configuration
|
||||
cm-dashboard-agent --help
|
||||
```toml
|
||||
[zmq]
|
||||
hosts = [
|
||||
{ name = "server1", address = "192.168.1.100", port = 6130 },
|
||||
{ name = "server2", address = "192.168.1.101", port = 6130 }
|
||||
]
|
||||
connection_timeout_ms = 5000
|
||||
reconnect_interval_ms = 10000
|
||||
|
||||
# Override specific settings
|
||||
sudo cm-dashboard-agent \
|
||||
--hostname cmbox \
|
||||
--bind tcp://*:6130 \
|
||||
--interval 5000
|
||||
[ui]
|
||||
refresh_interval_ms = 1000
|
||||
theme = "dark"
|
||||
```
|
||||
|
||||
## Widget Layout
|
||||
## Collectors
|
||||
|
||||
### Services Widget Structure
|
||||
The Services widget now displays both system metrics and services in a unified table:
|
||||
The agent implements several specialized collectors:
|
||||
|
||||
```
|
||||
┌Services────────────────────────────────────────────────────┐
|
||||
│ Service Memory (GB) CPU Disk │
|
||||
│✔ Service Memory 7.1/23899.7 MiB — │ ← System metric as service row
|
||||
│✔ Disk Usage — — 45/100 GB │ ← System metric as service row
|
||||
│⚠ CPU Load — 2.18 — │ ← System metric as service row
|
||||
│✔ CPU Temperature — 47.0°C — │ ← System metric as service row
|
||||
│✔ docker-registry 0.0 GB 0.0% <1 MB │ ← Regular service
|
||||
│✔ nginx 0.0/1.0 GB 0.0% <1 MB │ ← Regular service
|
||||
│✔ ├─ docker.cmtec.se │ ← Nginx site (sub-service)
|
||||
│✔ ├─ git.cmtec.se │ ← Nginx site (sub-service)
|
||||
│✔ └─ gitea.cmtec.se │ ← Nginx site (sub-service)
|
||||
│✔ sshd 0.0 GB 0.0% <1 MB │ ← Regular service
|
||||
│ 1 SSH connection │ ← Service description
|
||||
└────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
### CPU Collector (`cpu.rs`)
|
||||
- Load average (1, 5, 15 minute)
|
||||
- CPU temperature monitoring
|
||||
- Real-time process monitoring (top CPU consumers)
|
||||
- Status calculation with configurable thresholds
|
||||
|
||||
**Row Types:**
|
||||
- **System Metrics**: CPU Load, Service Memory, Disk Usage, CPU Temperature with status indicators
|
||||
- **Regular Services**: Full resource data (memory, CPU, disk) with optional description lines
|
||||
- **Sub-services**: Nginx sites with tree structure, status indicators only (no resource columns)
|
||||
- **Description Lines**: Connection counts and service-specific info without status indicators
|
||||
### Memory Collector (`memory.rs`)
|
||||
- RAM usage (total, used, available)
|
||||
- Swap monitoring
|
||||
- Real-time process monitoring (top RAM consumers)
|
||||
- Memory pressure detection
|
||||
|
||||
### Hosts Widget (formerly Alerts)
|
||||
The Hosts widget provides a summary view of all monitored hosts:
|
||||
### Disk Collector (`disk.rs`)
|
||||
- Filesystem usage per mount point
|
||||
- SMART health monitoring
|
||||
- Temperature and wear tracking
|
||||
- Configurable filesystem monitoring
|
||||
|
||||
```
|
||||
┌Hosts────────────────────────────────────────────────────────┐
|
||||
│ Host Status Timestamp │
|
||||
│✔ cmbox ok 2025-10-13 05:45:28 │
|
||||
│✔ srv01 ok 2025-10-13 05:45:28 │
|
||||
│? labbox No data received — │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
### Systemd Collector (`systemd.rs`)
|
||||
- Service status monitoring (`active`, `inactive`, `failed`)
|
||||
- Memory usage per service
|
||||
- Service filtering and exclusions
|
||||
- Handles transitional states (`Status::Pending`)
|
||||
|
||||
## Monitoring Components
|
||||
|
||||
### System Collector
|
||||
- **CPU Load**: 1/5/15 minute averages with warning/critical thresholds
|
||||
- **Memory Usage**: Used/total with percentage calculation
|
||||
- **CPU Temperature**: x86_pkg_temp prioritized for accuracy
|
||||
- **C-States**: Power management state distribution (C0-C10)
|
||||
|
||||
### Service Collector
|
||||
- **System Metrics as Services**: CPU Load, Service Memory, Disk Usage, CPU Temperature displayed as individual service rows
|
||||
- **Systemd Services**: Auto-discovery of interesting services with resource monitoring
|
||||
- **Nginx Site Monitoring**: Individual rows for each nginx virtual host with tree structure (`├─` and `└─`)
|
||||
- **Resource Usage**: Per-service memory, CPU, and disk consumption
|
||||
- **Service Health**: Running/stopped/degraded status with detailed failure info
|
||||
- **Connection Tracking**: SSH connections, database connections as description lines
|
||||
|
||||
### SMART Collector
|
||||
- **NVMe Health**: Temperature, wear leveling, spare blocks
|
||||
- **Drive Capacity**: Total/used space with percentage
|
||||
- **SMART Attributes**: Critical health indicators
|
||||
|
||||
### Backup Collector
|
||||
- **Restic Integration**: Backup status and history
|
||||
- **Health Monitoring**: Success/failure tracking
|
||||
- **Storage Metrics**: Backup size and retention
|
||||
|
||||
## Keyboard Controls
|
||||
|
||||
| Key | Action |
|
||||
|-----|--------|
|
||||
| `←` / `h` | Previous host |
|
||||
| `→` / `l` / `Tab` | Next host |
|
||||
| `?` | Toggle help overlay |
|
||||
| `r` | Force refresh |
|
||||
| `q` / `Esc` | Quit |
|
||||
### Backup Collector (`backup.rs`)
|
||||
- Reads TOML status files from backup systems
|
||||
- Archive age verification
|
||||
- Disk usage tracking
|
||||
- Repository health monitoring
|
||||
|
||||
## Email Notifications
|
||||
|
||||
### Notification Triggers
|
||||
- **Status Degradation**: Any status change to warning/critical
|
||||
- **Recovery**: Warning/critical status returning to ok
|
||||
- **Service Failures**: Individual service stop/start events
|
||||
### Intelligent Batching
|
||||
|
||||
The system implements smart notification batching to prevent email spam:
|
||||
|
||||
- **Real-time dashboard updates** - Status changes appear immediately
|
||||
- **Batched email notifications** - Aggregated every 30 seconds
|
||||
- **Detailed groupings** - Services organized by severity
|
||||
|
||||
### Example Alert Email
|
||||
|
||||
### Example Recovery Email
|
||||
```
|
||||
✅ RESOLVED: system cpu on cmbox
|
||||
Subject: Status Alert: 2 critical, 1 warning, 15 started
|
||||
|
||||
Status Change Alert
|
||||
Status Summary (30s duration)
|
||||
Host Status: Ok → Warning
|
||||
|
||||
Host: cmbox
|
||||
Component: system
|
||||
Metric: cpu
|
||||
Status Change: warning → ok
|
||||
Time: 2025-10-12 22:15:30 CET
|
||||
🔴 CRITICAL ISSUES (2):
|
||||
postgresql: Ok → Critical
|
||||
nginx: Warning → Critical
|
||||
|
||||
Details:
|
||||
Recovered from: CPU load (1/5/15min): 6.20 / 5.80 / 4.50
|
||||
Current status: CPU load (1/5/15min): 3.30 / 3.17 / 2.84
|
||||
🟡 WARNINGS (1):
|
||||
redis: Ok → Warning (memory usage 85%)
|
||||
|
||||
✅ RECOVERIES (0):
|
||||
|
||||
🟢 SERVICE STARTUPS (15):
|
||||
docker: Unknown → Ok
|
||||
sshd: Unknown → Ok
|
||||
...
|
||||
|
||||
--
|
||||
CM Dashboard Agent
|
||||
Generated at 2025-10-12 22:15:30 CET
|
||||
Generated at 2025-10-21 19:42:42 CET
|
||||
```
|
||||
|
||||
### Rate Limiting
|
||||
- **Default**: 30 minutes between notifications per component
|
||||
- **Testing**: Set to 0 for immediate notifications
|
||||
- **Configurable**: Adjustable per deployment needs
|
||||
## Individual Metrics Architecture
|
||||
|
||||
The system follows a **metrics-first architecture**:
|
||||
|
||||
### Agent Side
|
||||
```rust
|
||||
// Agent collects individual metrics
|
||||
vec![
|
||||
Metric::new("cpu_load_1min".to_string(), MetricValue::Float(2.5), Status::Ok),
|
||||
Metric::new("memory_usage_percent".to_string(), MetricValue::Float(78.5), Status::Warning),
|
||||
Metric::new("service_nginx_status".to_string(), MetricValue::String("active".to_string()), Status::Ok),
|
||||
]
|
||||
```
|
||||
|
||||
### Dashboard Side
|
||||
```rust
|
||||
// Widgets subscribe to specific metrics
|
||||
impl Widget for CpuWidget {
|
||||
fn update_from_metrics(&mut self, metrics: &[&Metric]) {
|
||||
for metric in metrics {
|
||||
match metric.name.as_str() {
|
||||
"cpu_load_1min" => self.load_1min = metric.value.as_f32(),
|
||||
"cpu_load_5min" => self.load_5min = metric.value.as_f32(),
|
||||
"cpu_temperature_celsius" => self.temperature = metric.value.as_f32(),
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Persistent Cache
|
||||
|
||||
The cache system prevents false notifications:
|
||||
|
||||
- **Automatic saving** - Saves when service status changes
|
||||
- **Persistent storage** - Maintains state across agent restarts
|
||||
- **Simple design** - No complex TTL or cleanup logic
|
||||
- **Status preservation** - Prevents duplicate notifications
|
||||
|
||||
## Development
|
||||
|
||||
### Project Structure
|
||||
|
||||
```
|
||||
cm-dashboard/
|
||||
├── agent/ # Monitoring agent
|
||||
├── agent/ # Metrics collection agent
|
||||
│ ├── src/
|
||||
│ │ ├── collectors/ # Data collection modules
|
||||
│ │ ├── notifications.rs # Email notification system
|
||||
│ │ └── simple_agent.rs # Main agent logic
|
||||
├── dashboard/ # TUI dashboard
|
||||
│ │ ├── collectors/ # CPU, memory, disk, systemd, backup
|
||||
│ │ ├── status/ # Status aggregation and notifications
|
||||
│ │ ├── cache/ # Persistent metric caching
|
||||
│ │ ├── config/ # TOML configuration loading
|
||||
│ │ └── notifications/ # Email notification system
|
||||
├── dashboard/ # TUI dashboard application
|
||||
│ ├── src/
|
||||
│ │ ├── ui/ # Widget implementations
|
||||
│ │ ├── data/ # Data structures
|
||||
│ │ └── app.rs # Application state
|
||||
├── shared/ # Common data structures
|
||||
└── config/ # Configuration files
|
||||
│ │ ├── ui/widgets/ # CPU, memory, services, backup widgets
|
||||
│ │ ├── metrics/ # Metric storage and filtering
|
||||
│ │ └── communication/ # ZMQ metric consumption
|
||||
├── shared/ # Shared types and utilities
|
||||
│ └── src/
|
||||
│ ├── metrics.rs # Metric, Status, and Value types
|
||||
│ ├── protocol.rs # ZMQ message format
|
||||
│ └── cache.rs # Cache configuration
|
||||
└── README.md # This file
|
||||
```
|
||||
|
||||
### Development Commands
|
||||
```bash
|
||||
# Format code
|
||||
cargo fmt
|
||||
### Building
|
||||
|
||||
# Check all packages
|
||||
cargo check
|
||||
```bash
|
||||
# Debug build
|
||||
cargo build --workspace
|
||||
|
||||
# Release build
|
||||
cargo build --workspace --release
|
||||
|
||||
# Run tests
|
||||
cargo test
|
||||
cargo test --workspace
|
||||
|
||||
# Build release
|
||||
cargo build --release
|
||||
# Check code formatting
|
||||
cargo fmt --all -- --check
|
||||
|
||||
# Run with logging
|
||||
RUST_LOG=debug cargo run -p cm-dashboard-agent
|
||||
# Run clippy linter
|
||||
cargo clippy --workspace -- -D warnings
|
||||
```
|
||||
|
||||
### Architecture Principles
|
||||
### Dependencies
|
||||
|
||||
#### Status Calculation Rules
|
||||
- **Agent calculates all status** using predefined thresholds
|
||||
- **Dashboard never calculates status** - only displays agent data
|
||||
- **No hardcoded thresholds in dashboard** widgets
|
||||
- **Use "unknown" when agent status missing** (never default to "ok")
|
||||
|
||||
#### Data Flow
|
||||
```
|
||||
System Metrics → Agent Collectors → Status Calculation → ZMQ → Dashboard → Display
|
||||
↓
|
||||
Email Notifications
|
||||
```
|
||||
|
||||
#### Pure Auto-Discovery
|
||||
- **No config files required** for basic operation
|
||||
- **Runtime discovery** of system capabilities
|
||||
- **Service auto-detection** via systemd patterns
|
||||
- **Storage device enumeration** via /sys filesystem
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### Agent Won't Start
|
||||
```bash
|
||||
# Check permissions (agent requires root)
|
||||
sudo cm-dashboard-agent -v
|
||||
|
||||
# Verify ZMQ binding
|
||||
sudo netstat -tulpn | grep 6130
|
||||
|
||||
# Check system access
|
||||
sudo smartctl --scan
|
||||
```
|
||||
|
||||
#### Dashboard Connection Issues
|
||||
```bash
|
||||
# Test ZMQ connectivity
|
||||
cm-dashboard --zmq-endpoint tcp://target-host:6130 -v
|
||||
|
||||
# Check network connectivity
|
||||
telnet target-host 6130
|
||||
```
|
||||
|
||||
#### Email Notifications Not Working
|
||||
```bash
|
||||
# Check postfix status
|
||||
sudo systemctl status postfix
|
||||
|
||||
# Test SMTP manually
|
||||
telnet localhost 25
|
||||
|
||||
# Verify notification settings
|
||||
sudo cm-dashboard-agent -v | grep notification
|
||||
```
|
||||
|
||||
### Logging
|
||||
Set `RUST_LOG=debug` for detailed logging:
|
||||
```bash
|
||||
RUST_LOG=debug sudo cm-dashboard-agent
|
||||
RUST_LOG=debug cm-dashboard
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT License - see LICENSE file for details.
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Fork the repository
|
||||
2. Create feature branch (`git checkout -b feature/amazing-feature`)
|
||||
3. Commit changes (`git commit -m 'Add amazing feature'`)
|
||||
4. Push to branch (`git push origin feature/amazing-feature`)
|
||||
5. Open Pull Request
|
||||
|
||||
For bugs and feature requests, please use GitHub Issues.
|
||||
- **tokio** - Async runtime
|
||||
- **zmq** - Message passing between agent and dashboard
|
||||
- **ratatui** - Terminal user interface
|
||||
- **serde** - Serialization for metrics and config
|
||||
- **anyhow/thiserror** - Error handling
|
||||
- **tracing** - Structured logging
|
||||
- **lettre** - SMTP email notifications
|
||||
- **clap** - Command-line argument parsing
|
||||
- **toml** - Configuration file parsing
|
||||
|
||||
## NixOS Integration
|
||||
|
||||
### Updating cm-dashboard in NixOS Configuration
|
||||
This project is designed for declarative deployment via NixOS:
|
||||
|
||||
When new code is pushed to the cm-dashboard repository, follow these steps to update the NixOS configuration:
|
||||
### Configuration Generation
|
||||
|
||||
#### 1. Get the Latest Commit Hash
|
||||
```bash
|
||||
# Get the latest commit from the API
|
||||
curl -s "https://gitea.cmtec.se/api/v1/repos/cm/cm-dashboard/commits?sha=main&limit=1" | head -20
|
||||
The NixOS module automatically generates the agent configuration:
|
||||
|
||||
# Or use git
|
||||
git log --oneline -1
|
||||
```
|
||||
|
||||
#### 2. Update the NixOS Configuration
|
||||
Edit `hosts/common/cm-dashboard.nix` and update the `rev` field:
|
||||
```nix
|
||||
src = pkgs.fetchFromGitea {
|
||||
domain = "gitea.cmtec.se";
|
||||
owner = "cm";
|
||||
repo = "cm-dashboard";
|
||||
rev = "f786d054f2ece80823f85e46933857af96e241b2"; # Update this
|
||||
hash = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; # Reset temporarily
|
||||
# hosts/common/cm-dashboard.nix
|
||||
services.cm-dashboard-agent = {
|
||||
enable = true;
|
||||
port = 6130;
|
||||
};
|
||||
```
|
||||
|
||||
#### 3. Get the Correct Hash
|
||||
Build with placeholder hash to get the actual hash:
|
||||
```bash
|
||||
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchFromGitea {
|
||||
domain = "gitea.cmtec.se";
|
||||
owner = "cm";
|
||||
repo = "cm-dashboard";
|
||||
rev = "YOUR_COMMIT_HASH";
|
||||
hash = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
|
||||
}' 2>&1 | grep "got:"
|
||||
```
|
||||
|
||||
Example output:
|
||||
```
|
||||
error: hash mismatch in fixed-output derivation '/nix/store/...':
|
||||
specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
|
||||
got: sha256-x8crxNusOUYRrkP9mYEOG+Ga3JCPIdJLkEAc5P1ZxdQ=
|
||||
```
|
||||
|
||||
#### 4. Update the Hash
|
||||
Replace the placeholder with the correct hash from the error message (the "got:" line):
|
||||
```nix
|
||||
hash = "sha256-vjy+j91iDCHUf0RE43anK4WZ+rKcyohP/3SykwZGof8="; # Use actual hash
|
||||
```
|
||||
|
||||
#### 5. Update Cargo Dependencies (if needed)
|
||||
If Cargo.lock has changed, you may need to update `cargoHash`:
|
||||
```bash
|
||||
# Build to get cargo hash error
|
||||
nix-build --no-out-link --expr 'with import <nixpkgs> {}; rustPlatform.buildRustPackage rec {
|
||||
pname = "cm-dashboard";
|
||||
version = "0.1.0";
|
||||
src = fetchFromGitea {
|
||||
domain = "gitea.cmtec.se";
|
||||
owner = "cm";
|
||||
repo = "cm-dashboard";
|
||||
rev = "YOUR_COMMIT_HASH";
|
||||
hash = "YOUR_SOURCE_HASH";
|
||||
};
|
||||
cargoHash = "";
|
||||
nativeBuildInputs = [ pkg-config ];
|
||||
buildInputs = [ openssl ];
|
||||
buildAndTestSubdir = ".";
|
||||
cargoBuildFlags = [ "--workspace" ];
|
||||
}' 2>&1 | grep "got:"
|
||||
```
|
||||
|
||||
Then update `cargoHash` in the configuration.
|
||||
|
||||
#### 6. Commit the Changes
|
||||
### Deployment
|
||||
|
||||
```bash
|
||||
# Update NixOS configuration
|
||||
git add hosts/common/cm-dashboard.nix
|
||||
git commit -m "Update cm-dashboard to latest version"
|
||||
git commit -m "Update cm-dashboard configuration"
|
||||
git push
|
||||
|
||||
# Rebuild system (user-performed)
|
||||
sudo nixos-rebuild switch --flake .
|
||||
```
|
||||
|
||||
### Example Update Process
|
||||
```bash
|
||||
# 1. Get latest commit
|
||||
LATEST_COMMIT=$(curl -s "https://gitea.cmtec.se/api/v1/repos/cm/cm-dashboard/commits?sha=main&limit=1" | grep '"sha"' | head -1 | cut -d'"' -f4)
|
||||
## Monitoring Intervals
|
||||
|
||||
# 2. Get source hash
|
||||
SOURCE_HASH=$(nix-build --no-out-link -E "with import <nixpkgs> {}; fetchFromGitea { domain = \"gitea.cmtec.se\"; owner = \"cm\"; repo = \"cm-dashboard\"; rev = \"$LATEST_COMMIT\"; hash = \"sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=\"; }" 2>&1 | grep "got:" | cut -d' ' -f12)
|
||||
- **CPU/Memory**: 2 seconds (real-time monitoring)
|
||||
- **Disk usage**: 300 seconds (5 minutes)
|
||||
- **Systemd services**: 10 seconds
|
||||
- **SMART health**: 600 seconds (10 minutes)
|
||||
- **Backup status**: 60 seconds (1 minute)
|
||||
- **Email notifications**: 30 seconds (batched)
|
||||
- **Dashboard updates**: 1 second (real-time display)
|
||||
|
||||
# 3. Update configuration and commit
|
||||
echo "Latest commit: $LATEST_COMMIT"
|
||||
echo "Source hash: $SOURCE_HASH"
|
||||
```
|
||||
## License
|
||||
|
||||
MIT License - see LICENSE file for details
|
||||
Loading…
x
Reference in New Issue
Block a user