# CM Dashboard A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture. ## Features ### Core Monitoring - **Real-time metrics**: CPU, RAM, Storage, and Service status - **Multi-host support**: Monitor multiple servers from single dashboard - **Service management**: Start/stop services with intelligent status tracking - **NixOS integration**: System rebuild via SSH + tmux popup - **Backup monitoring**: Borgbackup status and scheduling - **Email notifications**: Intelligent batching prevents spam ### User-Stopped Service Tracking Services stopped via the dashboard are intelligently tracked to prevent false alerts: - **Smart status reporting**: User-stopped services show as Status::OK instead of Warning - **Persistent storage**: Tracking survives agent restarts via JSON storage - **Automatic management**: Flags cleared when services restarted via dashboard - **Maintenance friendly**: No false alerts during intentional service operations ## Architecture ### Individual Metrics Philosophy - **Agent**: Collects individual metrics, calculates status using thresholds - **Dashboard**: Subscribes to specific metrics, composes widgets from individual data - **ZMQ Communication**: Efficient real-time metric transmission - **Status Aggregation**: Host-level status calculated from all service metrics ### Components ``` ┌─────────────────┐ ZMQ ┌─────────────────┐ │ │◄──────────►│ │ │ Agent │ Metrics │ Dashboard │ │ - Collectors │ │ - TUI │ │ - Status │ │ - Widgets │ │ - Tracking │ │ - Commands │ │ │ │ │ └─────────────────┘ └─────────────────┘ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ JSON Storage │ │ SSH + tmux │ │ - User-stopped │ │ - Remote rebuild│ │ - Cache │ │ - Process │ │ - State │ │ isolation │ └─────────────────┘ └─────────────────┘ ``` ### Service Control Flow 1. **User Action**: Dashboard sends `UserStart`/`UserStop` commands 2. **Agent Processing**: - Marks service as user-stopped (if stopping) - Executes `systemctl start/stop service` - Syncs state to global tracker 3. **Status Calculation**: - Systemd collector checks user-stopped flag - Reports Status::OK for user-stopped inactive services - Normal Warning status for system failures ## Interface ``` cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox ┌system──────────────────────────────┐┌services─────────────────────────────────────────┐ │NixOS: ││Service: Status: RAM: Disk: │ │Build: 25.05.20251004.3bcc93c ││● docker active 27M 496MB │ │Agent: v0.1.43 ││● gitea active 579M 2.6GB │ │Active users: cm, simon ││● nginx active 28M 24MB │ │CPU: ││ ├─ ● gitea.cmtec.se 51ms │ │● Load: 0.10 0.52 0.88 • 3000MHz ││ ├─ ● photos.cmtec.se 41ms │ │RAM: ││● postgresql active 112M 357MB │ │● Usage: 33% 2.6GB/7.6GB ││● redis-immich user-stopped │ │● /tmp: 0% 0B/2.0GB ││● sshd active 2M 0 │ │Storage: ││● unifi active 594M 495MB │ │● root (Single): ││ │ │ ├─ ● nvme0n1 W: 1% ││ │ │ └─ ● 18% 167.4GB/928.2GB ││ │ └────────────────────────────────────┘└─────────────────────────────────────────────────┘ ``` ### Navigation - **Tab**: Switch between hosts - **↑↓ or j/k**: Navigate services - **s**: Start selected service (UserStart) - **S**: Stop selected service (UserStop) - **J**: Show service logs (journalctl in tmux popup) - **L**: Show custom log files (tail -f custom paths in tmux popup) - **R**: Rebuild current host - **B**: Run backup on current host - **q**: Quit ### Status Indicators - **Green ●**: Active service - **Yellow ◐**: Inactive service (system issue) - **Red ◯**: Failed service - **Blue arrows**: Service transitioning (↑ starting, ↓ stopping, ↻ restarting) - **"user-stopped"**: Service stopped via dashboard (Status::OK) ## Quick Start ### Building ```bash # With Nix (recommended) nix-shell -p openssl pkg-config --run "cargo build --workspace" # Or with system dependencies sudo apt install libssl-dev pkg-config # Ubuntu/Debian cargo build --workspace ``` ### Running ```bash # Start agent (requires configuration) ./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml # Start dashboard (inside tmux session) tmux ./target/debug/cm-dashboard --config /etc/cm-dashboard/dashboard.toml ``` ## Configuration ### Agent Configuration ```toml collection_interval_seconds = 2 [zmq] publisher_port = 6130 command_port = 6131 bind_address = "0.0.0.0" transmission_interval_seconds = 2 [collectors.cpu] enabled = true interval_seconds = 2 load_warning_threshold = 5.0 load_critical_threshold = 10.0 [collectors.memory] enabled = true interval_seconds = 2 usage_warning_percent = 80.0 usage_critical_percent = 90.0 [collectors.systemd] enabled = true interval_seconds = 10 service_name_filters = ["nginx*", "postgresql*", "docker*", "sshd*"] excluded_services = ["nginx-config-reload", "systemd-", "getty@"] nginx_latency_critical_ms = 1000.0 http_timeout_seconds = 10 [notifications] enabled = true smtp_host = "localhost" smtp_port = 25 from_email = "{hostname}@example.com" to_email = "admin@example.com" aggregation_interval_seconds = 30 ``` ### Dashboard Configuration ```toml [zmq] subscriber_ports = [6130] [hosts] predefined_hosts = ["cmbox", "srv01", "srv02"] [ssh] rebuild_user = "cm" rebuild_alias = "nixos-rebuild-cmtec" backup_alias = "cm-backup-run" ``` ## Technical Implementation ### Collectors #### Systemd Collector - **Service Discovery**: Uses `systemctl list-unit-files` + `list-units --all` - **Status Calculation**: Checks user-stopped flag before assigning Warning status - **Memory Tracking**: Per-service memory usage via `systemctl show` - **Sub-services**: Nginx site latency, Docker containers - **User-stopped Integration**: `UserStoppedServiceTracker::is_service_user_stopped()` #### User-Stopped Service Tracker - **Storage**: `/var/lib/cm-dashboard/user-stopped-services.json` - **Thread Safety**: Global singleton with `Arc>` - **Persistence**: Automatic save on state changes - **Global Access**: Static methods for collector integration #### Other Collectors - **CPU**: Load average, temperature, frequency monitoring - **Memory**: RAM/swap usage, tmpfs monitoring - **Disk**: Filesystem usage, SMART health data - **NixOS**: Build version, active users, agent version - **Backup**: Borgbackup repository status and metrics ### ZMQ Protocol ```rust // Metric Message #[derive(Serialize, Deserialize)] pub struct MetricMessage { pub hostname: String, pub timestamp: u64, pub metrics: Vec, } // Service Commands pub enum AgentCommand { ServiceControl { service_name: String, action: ServiceAction, }, SystemRebuild { /* SSH config */ }, CollectNow, } pub enum ServiceAction { Start, // System-initiated Stop, // System-initiated UserStart, // User via dashboard (clears user-stopped) UserStop, // User via dashboard (marks user-stopped) Status, } ``` ### Maintenance Mode Suppress notifications during planned maintenance: ```bash # Enable maintenance mode touch /tmp/cm-maintenance # Perform maintenance systemctl stop service # ... work ... systemctl start service # Disable maintenance mode rm /tmp/cm-maintenance ``` ## Email Notifications ### Intelligent Batching - **Real-time dashboard**: Immediate status updates - **Batched emails**: Aggregated every 30 seconds - **Smart grouping**: Services organized by severity - **Recovery suppression**: Reduces notification spam ### Example Alert ``` Subject: Status Alert: 1 critical, 2 warnings, 0 recoveries Status Summary (30s duration) Host Status: Ok → Warning 🔴 CRITICAL ISSUES (1): postgresql: Ok → Critical (memory usage 95%) 🟡 WARNINGS (2): nginx: Ok → Warning (high load 8.5) redis: user-stopped → Warning (restarted by system) ✅ RECOVERIES (0): -- CM Dashboard Agent v0.1.43 ``` ## Development ### Project Structure ``` cm-dashboard/ ├── agent/ # Metrics collection agent │ ├── src/ │ │ ├── collectors/ # CPU, memory, disk, systemd, backup, nixos │ │ ├── service_tracker.rs # User-stopped service tracking │ │ ├── status/ # Status aggregation and notifications │ │ ├── config/ # TOML configuration loading │ │ └── communication/ # ZMQ message handling ├── dashboard/ # TUI dashboard application │ ├── src/ │ │ ├── ui/widgets/ # CPU, memory, services, backup, system │ │ ├── communication/ # ZMQ consumption and commands │ │ └── app.rs # Main application loop ├── shared/ # Shared types and utilities │ └── src/ │ ├── metrics.rs # Metric, Status, StatusTracker types │ ├── protocol.rs # ZMQ message format │ └── cache.rs # Cache configuration └── CLAUDE.md # Development guidelines and rules ``` ### Testing ```bash # Build and test nix-shell -p openssl pkg-config --run "cargo build --workspace" nix-shell -p openssl pkg-config --run "cargo test --workspace" # Code quality cargo fmt --all cargo clippy --workspace -- -D warnings ``` ## Deployment ### Automated Binary Releases ```bash # Create new release cd ~/projects/cm-dashboard git tag v0.1.X git push origin v0.1.X ``` This triggers automated: - Static binary compilation with `RUSTFLAGS="-C target-feature=+crt-static"` - GitHub-style release creation - Tarball upload to Gitea ### NixOS Integration Update `~/projects/nixosbox/hosts/services/cm-dashboard.nix`: ```nix version = "v0.1.43"; src = pkgs.fetchurl { url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz"; sha256 = "sha256-HASH"; }; ``` Get hash via: ```bash cd ~/projects/nixosbox nix-build --no-out-link -E 'with import {}; fetchurl { url = "URL_HERE"; sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; }' 2>&1 | grep "got:" ``` ## Monitoring Intervals - **Metrics Collection**: 2 seconds (CPU, memory, services) - **Metric Transmission**: 2 seconds (ZMQ publish) - **Dashboard Updates**: 1 second (UI refresh) - **Email Notifications**: 30 seconds (batched) - **Disk Monitoring**: 300 seconds (5 minutes) - **Service Discovery**: 300 seconds (5 minutes cache) ## License MIT License - see LICENSE file for details.