# CM Dashboard A real-time infrastructure monitoring system with intelligent status aggregation and email notifications, built with Rust and ZMQ. ## Current Implementation This is a complete rewrite implementing an **individual metrics architecture** where: - **Agent** collects individual metrics (e.g., `cpu_load_1min`, `memory_usage_percent`) and calculates status - **Dashboard** subscribes to specific metrics and composes widgets - **Status Aggregation** provides intelligent email notifications with batching - **Persistent Cache** prevents false notifications on restart ## Dashboard Interface ``` cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox ┌system──────────────────────────────┐┌services─────────────────────────────────────────┐ │CPU: ││Service: Status: RAM: Disk: │ │● Load: 0.10 0.52 0.88 • 400.0 MHz ││● docker active 27M 496MB │ │RAM: ││● docker-registry active 19M 496MB │ │● Used: 30% 2.3GB/7.6GB ││● gitea active 579M 2.6GB │ │● tmp: 0.0% 0B/2.0GB ││● gitea-runner-default active 11M 2.6GB │ │Disk nvme0n1: ││● haasp-core active 9M 1MB │ │● Health: PASSED ││● haasp-mqtt active 3M 1MB │ │● Usage @root: 8.3% • 75.4/906.2 GB ││● haasp-webgrid active 10M 1MB │ │● Usage @boot: 5.9% • 0.1/1.0 GB ││● immich-server active 240M 45.1GB │ │ ││● mosquitto active 1M 1MB │ │ ││● mysql active 38M 225MB │ │ ││● nginx active 28M 24MB │ │ ││ ├─ ● gitea.cmtec.se 51ms │ │ ││ ├─ ● haasp.cmtec.se 43ms │ │ ││ ├─ ● haasp.net 43ms │ │ ││ ├─ ● pages.cmtec.se 45ms │ └────────────────────────────────────┘│ ├─ ● photos.cmtec.se 41ms │ ┌backup──────────────────────────────┐│ ├─ ● unifi.cmtec.se 46ms │ │Latest backup: ││ ├─ ● vault.cmtec.se 47ms │ │● Status: OK ││ ├─ ● www.kryddorten.se 81ms │ │Duration: 54s • Last: 4h ago ││ ├─ ● www.mariehall2.se 86ms │ │Disk usage: 48.2GB/915.8GB ││● postgresql active 112M 357MB │ │P/N: Samsung SSD 870 QVO 1TB ││● redis-immich active 8M 45.1GB │ │S/N: S5RRNF0W800639Y ││● sshd active 2M 0 │ │● gitea 2 archives 2.7GB ││● unifi active 594M 495MB │ │● immich 2 archives 45.0GB ││● vaultwarden active 12M 1MB │ │● kryddorten 2 archives 67.6MB ││ │ │● mariehall2 2 archives 321.8MB ││ │ │● nixosbox 2 archives 4.5MB ││ │ │● unifi 2 archives 2.9MB ││ │ │● vaultwarden 2 archives 305kB ││ │ └────────────────────────────────────┘└─────────────────────────────────────────────────┘ ``` **Navigation**: `←→` switch hosts, `r` refresh, `q` quit ## Features - **Real-time monitoring** - Dashboard updates every 1-2 seconds - **Individual metric collection** - Granular data for flexible dashboard composition - **Intelligent status aggregation** - Host-level status calculated from all services - **Smart email notifications** - Batched, detailed alerts with service groupings - **Persistent state** - Prevents false notifications on restarts - **ZMQ communication** - Efficient agent-to-dashboard messaging - **Clean TUI** - Terminal-based dashboard with color-coded status indicators ## Architecture ### Core Components - **Agent** (`cm-dashboard-agent`) - Collects metrics and sends via ZMQ - **Dashboard** (`cm-dashboard`) - Real-time TUI display consuming metrics - **Shared** (`cm-dashboard-shared`) - Common types and protocol - **Status Aggregation** - Intelligent batching and notification management - **Persistent Cache** - Maintains state across restarts ### Status Levels - **🟢 Ok** - Service running normally - **🔵 Pending** - Service starting/stopping/reloading - **🟡 Warning** - Service issues (high load, memory, disk usage) - **🔴 Critical** - Service failed or critical thresholds exceeded - **❓ Unknown** - Service state cannot be determined ## Quick Start ### Build ```bash # With Nix (recommended) nix-shell -p openssl pkg-config --run "cargo build --workspace" # Or with system dependencies sudo apt install libssl-dev pkg-config # Ubuntu/Debian cargo build --workspace ``` ### Run ```bash # Start agent (requires configuration file) ./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml # Start dashboard ./target/debug/cm-dashboard --config /path/to/dashboard.toml ``` ## Configuration ### Agent Configuration (`agent.toml`) The agent requires a comprehensive TOML configuration file: ```toml collection_interval_seconds = 2 [zmq] publisher_port = 6130 command_port = 6131 bind_address = "0.0.0.0" timeout_ms = 5000 heartbeat_interval_ms = 30000 [collectors.cpu] enabled = true interval_seconds = 2 load_warning_threshold = 9.0 load_critical_threshold = 10.0 temperature_warning_threshold = 100.0 temperature_critical_threshold = 110.0 [collectors.memory] enabled = true interval_seconds = 2 usage_warning_percent = 80.0 usage_critical_percent = 95.0 [collectors.disk] enabled = true interval_seconds = 300 usage_warning_percent = 80.0 usage_critical_percent = 90.0 [[collectors.disk.filesystems]] name = "root" uuid = "4cade5ce-85a5-4a03-83c8-dfd1d3888d79" mount_point = "/" fs_type = "ext4" monitor = true [collectors.systemd] enabled = true interval_seconds = 10 memory_warning_mb = 1000.0 memory_critical_mb = 2000.0 service_name_filters = [ "nginx", "postgresql", "redis", "docker", "sshd" ] excluded_services = [ "nginx-config-reload", "sshd-keygen" ] [notifications] enabled = true smtp_host = "localhost" smtp_port = 25 from_email = "{hostname}@example.com" to_email = "admin@example.com" rate_limit_minutes = 0 trigger_on_warnings = true trigger_on_failures = true recovery_requires_all_ok = true suppress_individual_recoveries = true [status_aggregation] enabled = true aggregation_method = "worst_case" notification_interval_seconds = 30 [cache] persist_path = "/var/lib/cm-dashboard/cache.json" ``` ### Dashboard Configuration (`dashboard.toml`) ```toml [zmq] hosts = [ { name = "server1", address = "192.168.1.100", port = 6130 }, { name = "server2", address = "192.168.1.101", port = 6130 } ] connection_timeout_ms = 5000 reconnect_interval_ms = 10000 [ui] refresh_interval_ms = 1000 theme = "dark" ``` ## Collectors The agent implements several specialized collectors: ### CPU Collector (`cpu.rs`) - Load average (1, 5, 15 minute) - CPU temperature monitoring - Real-time process monitoring (top CPU consumers) - Status calculation with configurable thresholds ### Memory Collector (`memory.rs`) - RAM usage (total, used, available) - Swap monitoring - Real-time process monitoring (top RAM consumers) - Memory pressure detection ### Disk Collector (`disk.rs`) - Filesystem usage per mount point - SMART health monitoring - Temperature and wear tracking - Configurable filesystem monitoring ### Systemd Collector (`systemd.rs`) - Service status monitoring (`active`, `inactive`, `failed`) - Memory usage per service - Service filtering and exclusions - Handles transitional states (`Status::Pending`) ### Backup Collector (`backup.rs`) - Reads TOML status files from backup systems - Archive age verification - Disk usage tracking - Repository health monitoring ## Email Notifications ### Intelligent Batching The system implements smart notification batching to prevent email spam: - **Real-time dashboard updates** - Status changes appear immediately - **Batched email notifications** - Aggregated every 30 seconds - **Detailed groupings** - Services organized by severity ### Example Alert Email ``` Subject: Status Alert: 2 critical, 1 warning, 15 started Status Summary (30s duration) Host Status: Ok → Warning 🔴 CRITICAL ISSUES (2): postgresql: Ok → Critical nginx: Warning → Critical 🟡 WARNINGS (1): redis: Ok → Warning (memory usage 85%) ✅ RECOVERIES (0): 🟢 SERVICE STARTUPS (15): docker: Unknown → Ok sshd: Unknown → Ok ... -- CM Dashboard Agent Generated at 2025-10-21 19:42:42 CET ``` ## Individual Metrics Architecture The system follows a **metrics-first architecture**: ### Agent Side ```rust // Agent collects individual metrics vec![ Metric::new("cpu_load_1min".to_string(), MetricValue::Float(2.5), Status::Ok), Metric::new("memory_usage_percent".to_string(), MetricValue::Float(78.5), Status::Warning), Metric::new("service_nginx_status".to_string(), MetricValue::String("active".to_string()), Status::Ok), ] ``` ### Dashboard Side ```rust // Widgets subscribe to specific metrics impl Widget for CpuWidget { fn update_from_metrics(&mut self, metrics: &[&Metric]) { for metric in metrics { match metric.name.as_str() { "cpu_load_1min" => self.load_1min = metric.value.as_f32(), "cpu_load_5min" => self.load_5min = metric.value.as_f32(), "cpu_temperature_celsius" => self.temperature = metric.value.as_f32(), _ => {} } } } } ``` ## Persistent Cache The cache system prevents false notifications: - **Automatic saving** - Saves when service status changes - **Persistent storage** - Maintains state across agent restarts - **Simple design** - No complex TTL or cleanup logic - **Status preservation** - Prevents duplicate notifications ## Development ### Project Structure ``` cm-dashboard/ ├── agent/ # Metrics collection agent │ ├── src/ │ │ ├── collectors/ # CPU, memory, disk, systemd, backup │ │ ├── status/ # Status aggregation and notifications │ │ ├── cache/ # Persistent metric caching │ │ ├── config/ # TOML configuration loading │ │ └── notifications/ # Email notification system ├── dashboard/ # TUI dashboard application │ ├── src/ │ │ ├── ui/widgets/ # CPU, memory, services, backup widgets │ │ ├── metrics/ # Metric storage and filtering │ │ └── communication/ # ZMQ metric consumption ├── shared/ # Shared types and utilities │ └── src/ │ ├── metrics.rs # Metric, Status, and Value types │ ├── protocol.rs # ZMQ message format │ └── cache.rs # Cache configuration └── README.md # This file ``` ### Building ```bash # Debug build cargo build --workspace # Release build cargo build --workspace --release # Run tests cargo test --workspace # Check code formatting cargo fmt --all -- --check # Run clippy linter cargo clippy --workspace -- -D warnings ``` ### Dependencies - **tokio** - Async runtime - **zmq** - Message passing between agent and dashboard - **ratatui** - Terminal user interface - **serde** - Serialization for metrics and config - **anyhow/thiserror** - Error handling - **tracing** - Structured logging - **lettre** - SMTP email notifications - **clap** - Command-line argument parsing - **toml** - Configuration file parsing ## NixOS Integration This project is designed for declarative deployment via NixOS: ### Configuration Generation The NixOS module automatically generates the agent configuration: ```nix # hosts/common/cm-dashboard.nix services.cm-dashboard-agent = { enable = true; port = 6130; }; ``` ### Deployment ```bash # Update NixOS configuration git add hosts/common/cm-dashboard.nix git commit -m "Update cm-dashboard configuration" git push # Rebuild system (user-performed) sudo nixos-rebuild switch --flake . ``` ## Monitoring Intervals - **CPU/Memory**: 2 seconds (real-time monitoring) - **Disk usage**: 300 seconds (5 minutes) - **Systemd services**: 10 seconds - **SMART health**: 600 seconds (10 minutes) - **Backup status**: 60 seconds (1 minute) - **Email notifications**: 30 seconds (batched) - **Dashboard updates**: 1 second (real-time display) ## License MIT License - see LICENSE file for details