# CM Dashboard - Infrastructure Monitoring TUI ## Overview A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture. ## Current Features ### Core Functionality - **Real-time Monitoring**: CPU, RAM, Storage, and Service status - **Service Management**: Start/stop services with user-stopped tracking - **Multi-host Support**: Monitor multiple servers from single dashboard - **NixOS Integration**: System rebuild via SSH + tmux popup - **Backup Monitoring**: Borgbackup status and scheduling ### User-Stopped Service Tracking - Services stopped via dashboard are marked as "user-stopped" - User-stopped services report Status::OK instead of Warning - Prevents false alerts during intentional maintenance - Persistent storage survives agent restarts - Automatic flag clearing when services are restarted via dashboard ### Custom Service Logs - Configure service-specific log file paths per host in dashboard config - Press `L` on any service to view custom log files via `tail -f` - Configuration format in dashboard config: ```toml [service_logs] hostname1 = [ { service_name = "nginx", log_file_path = "/var/log/nginx/access.log" }, { service_name = "app", log_file_path = "/var/log/myapp/app.log" } ] hostname2 = [ { service_name = "database", log_file_path = "/var/log/postgres/postgres.log" } ] ``` ### Service Management - **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services - **Service Actions**: - `s` - Start service (sends UserStart command) - `S` - Stop service (sends UserStop command) - `J` - Show service logs (journalctl in tmux popup) - `L` - Show custom log files (tail -f custom paths in tmux popup) - `R` - Rebuild current host - **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed) - **Transitional Icons**: Blue arrows during operations ### Navigation - **Tab**: Switch between hosts - **↑↓ or j/k**: Select services - **s**: Start selected service (UserStart) - **S**: Stop selected service (UserStop) - **J**: Show service logs (journalctl) - **L**: Show custom log files - **R**: Rebuild current host - **B**: Run backup on current host - **q**: Quit dashboard ## Core Architecture Principles ### Structured Data Architecture (✅ IMPLEMENTED v0.1.131) Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access. **Previous (String Metrics):** - ❌ Agent sent individual metrics with string names like `disk_nvme0n1_temperature` - ❌ Dashboard parsed metric names with underscore counting and string splitting - ❌ Complex and error-prone metric filtering and extraction logic **Current (Structured Data):** ```json { "hostname": "cmbox", "agent_version": "v0.1.131", "timestamp": 1763926877, "system": { "cpu": { "load_1min": 3.5, "load_5min": 3.57, "load_15min": 3.58, "frequency_mhz": 1500, "temperature_celsius": 45.2 }, "memory": { "usage_percent": 25.0, "total_gb": 23.3, "used_gb": 5.9, "swap_total_gb": 10.7, "swap_used_gb": 0.99, "tmpfs": [ { "mount": "/tmp", "usage_percent": 15.0, "used_gb": 0.3, "total_gb": 2.0 } ] }, "storage": { "drives": [ { "name": "nvme0n1", "health": "PASSED", "temperature_celsius": 29.0, "wear_percent": 1.0, "filesystems": [ { "mount": "/", "usage_percent": 24.0, "used_gb": 224.9, "total_gb": 928.2 } ] } ], "pools": [ { "name": "srv_media", "mount": "/srv/media", "type": "mergerfs", "health": "healthy", "usage_percent": 63.0, "used_gb": 2355.2, "total_gb": 3686.4, "data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }], "parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }] } ] } }, "services": [ { "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 } ], "backup": { "status": "completed", "last_run": 1763920000, "next_scheduled": 1764006400, "total_size_gb": 150.5, "repository_health": "ok" } } ``` - ✅ Agent sends structured JSON over ZMQ (no legacy support) - ✅ Type-safe data access: `data.system.storage.drives[0].temperature_celsius` - ✅ Complete metric coverage: CPU, memory, storage, services, backup - ✅ Backward compatibility via bridge conversion to existing UI widgets - ✅ All string parsing bugs eliminated ### Cached Collector Architecture (✅ IMPLEMENTED) **Problem:** Blocking collectors prevent timely ZMQ transmission, causing false "host offline" alerts. **Previous (Sequential Blocking):** ``` Every 1 second: └─ collect_all_data() [BLOCKS for 2-10+ seconds] ├─ CPU (fast: 10ms) ├─ Memory (fast: 20ms) ├─ Disk SMART (slow: 3s per drive × 4 drives = 12s) ├─ Service disk usage (slow: 2-8s per service) └─ Docker (medium: 500ms) └─ send_via_zmq() [Only after ALL collection completes] Result: If any collector takes >10s → "host offline" false alert ``` **New (Cached Independent Collectors):** ``` Shared Cache: Arc> Background Collectors (independent async tasks): ├─ Fast collectors (CPU, RAM, Network) │ └─ Update cache every 1 second ├─ Medium collectors (Services, Docker) │ └─ Update cache every 5 seconds └─ Slow collectors (Disk usage, SMART data) └─ Update cache every 60 seconds ZMQ Sender (separate async task): Every 1 second: └─ Read current cache └─ Send via ZMQ [Always instant, never blocked] ``` **Benefits:** - ✅ ZMQ sends every 1 second regardless of collector speed - ✅ No false "host offline" alerts from slow collectors - ✅ Different update rates for different metrics (CPU=1s, SMART=60s) - ✅ System stays responsive even with slow operations - ✅ Slow collectors can use longer timeouts without blocking **Implementation Details:** - **Shared cache**: `Arc>` initialized at agent startup - **Collector intervals**: Fully configurable via NixOS config (`interval_seconds` per collector) - Recommended: Fast (1-10s): CPU, Memory, Network - Recommended: Medium (30-60s): Backup, NixOS - Recommended: Slow (60-300s): Disk, Systemd - **Independent tasks**: Each collector spawned as separate tokio task in `Agent::new()` - **Cache updates**: Collectors acquire write lock → update → release immediately - **ZMQ sender**: Main loop reads cache every `collection_interval_seconds` and broadcasts - **Notification check**: Runs every `notifications.check_interval_seconds` - **Lock strategy**: Short-lived write locks prevent blocking, read locks for transmission - **Stale data**: Acceptable for slow-changing metrics (SMART data, disk usage) **Configuration (NixOS):** All intervals and timeouts configurable in `services/cm-dashboard.nix`: Collection Intervals: - `collectors.cpu.interval_seconds` (default: 10s) - `collectors.memory.interval_seconds` (default: 2s) - `collectors.disk.interval_seconds` (default: 300s) - `collectors.systemd.interval_seconds` (default: 10s) - `collectors.backup.interval_seconds` (default: 60s) - `collectors.network.interval_seconds` (default: 10s) - `collectors.nixos.interval_seconds` (default: 60s) - `notifications.check_interval_seconds` (default: 30s) - `collection_interval_seconds` - ZMQ transmission rate (default: 2s) Command Timeouts (prevent resource leaks from hung commands): - `collectors.disk.command_timeout_seconds` (default: 30s) - lsblk, smartctl, etc. - `collectors.systemd.command_timeout_seconds` (default: 15s) - systemctl, docker, du - `collectors.network.command_timeout_seconds` (default: 10s) - ip route, ip addr **Code Locations:** - agent/src/agent.rs:59-133 - Collector task spawning - agent/src/agent.rs:151-179 - Independent collector task runner - agent/src/agent.rs:199-207 - ZMQ sender in main loop ### Maintenance Mode - Agent checks for `/tmp/cm-maintenance` file before sending notifications - File presence suppresses all email notifications while continuing monitoring - Dashboard continues to show real status, only notifications are blocked Usage: ```bash # Enable maintenance mode touch /tmp/cm-maintenance # Run maintenance tasks systemctl stop service # ... maintenance work ... systemctl start service # Disable maintenance mode rm /tmp/cm-maintenance ``` ## Development and Deployment Architecture ### Development Path - **Location:** `~/projects/cm-dashboard` - **Purpose:** Development workflow only - for committing new code - **Access:** Only for developers to commit changes ### Deployment Path - **Location:** `/var/lib/cm-dashboard/nixos-config` - **Purpose:** Production deployment only - agent clones/pulls from git - **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild ### Git Flow ``` Development: ~/projects/cm-dashboard → git commit → git push Deployment: git pull → /var/lib/cm-dashboard/nixos-config → rebuild ``` ## Automated Binary Release System CM Dashboard uses automated binary releases instead of source builds. ### Creating New Releases ```bash cd ~/projects/cm-dashboard git tag v0.1.X git push origin v0.1.X ``` This automatically: - Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"` - Creates GitHub-style release with tarball - Uploads binaries via Gitea API ### NixOS Configuration Updates Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`: ```nix version = "v0.1.X"; src = pkgs.fetchurl { url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz"; sha256 = "sha256-NEW_HASH_HERE"; }; ``` ### Get Release Hash ```bash cd ~/projects/nixosbox nix-build --no-out-link -E 'with import {}; fetchurl { url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz"; sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; }' 2>&1 | grep "got:" ``` ### Building **Testing & Building:** - **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"` - **Clean compilation**: Remove `target/` between major changes ## Enhanced Storage Pool Visualization ### Auto-Discovery Architecture The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping. ### Discovery Process **At Agent Startup:** 1. Parse `/proc/mounts` to identify all mounted filesystems 2. Detect MergerFS pools by analyzing `fuse.mergerfs` mount sources 3. Identify member disks and potential parity relationships via heuristics 4. Store discovered storage topology for continuous monitoring 5. Generate pool-aware metrics with hierarchical relationships **Continuous Monitoring:** - Use stored discovery data for efficient metric collection - Monitor individual drives for SMART data, temperature, wear - Calculate pool-level health based on member drive status - Generate enhanced metrics for dashboard visualization ### Supported Storage Types **Single Disks:** - ext4, xfs, btrfs mounted directly - Individual drive monitoring with SMART data - Traditional single-disk display for root, boot, etc. **MergerFS Pools:** - Auto-detect from `/proc/mounts` fuse.mergerfs entries - Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2") - Heuristic parity disk detection (sequential device names, "parity" in path) - Pool health calculation (healthy/degraded/critical) - Hierarchical tree display with data/parity disk grouping **Future Extensions Ready:** - RAID arrays via `/proc/mdstat` parsing - ZFS pools via `zpool status` integration - LVM logical volumes via `lvs` discovery ### Configuration ```toml [collectors.disk] enabled = true auto_discover = true # Default: true # Optional exclusions for special filesystems exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"] exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"] ``` ### Display Format ``` Network: ● eno1: ├─ ip: 192.168.30.105 └─ tailscale0: 100.125.108.16 ● eno2: └─ ip: 192.168.32.105 CPU: ● Load: 0.23 0.21 0.13 └─ Freq: 1048 MHz RAM: ● Usage: 25% 5.8GB/23.3GB ├─ ● /tmp: 2% 0.5GB/2GB └─ ● /var/tmp: 0% 0GB/1.0GB Storage: ● 844B9A25 T: 25C W: 4% ├─ ● /: 55% 250.5GB/456.4GB └─ ● /boot: 26% 0.3GB/1.0GB ● mergerfs /srv/media: ├─ ● 63% 2355.2GB/3686.4GB ├─ ● Data_1: WDZQ8H8D T: 28°C ├─ ● Data_2: GGA04461 T: 28°C └─ ● Parity: WDZS8RY0 T: 29°C Backup: ● WD-WCC7K1234567 T: 32°C W: 12% ├─ Last: 2h ago (12.3GB) ├─ Next: in 22h └─ ● Usage: 45% 678GB/1.5TB ``` ## Important Communication Guidelines Keep responses concise and focused. Avoid extensive implementation summaries unless requested. ## Commit Message Guidelines **NEVER mention:** - Claude or any AI assistant names - Automation or AI-generated content - Any reference to automated code generation **ALWAYS:** - Focus purely on technical changes and their purpose - Use standard software development commit message format - Describe what was changed and why, not how it was created - Write from the perspective of a human developer **Examples:** - ❌ "Generated with Claude Code" - ❌ "AI-assisted implementation" - ❌ "Automated refactoring" - ✅ "Implement maintenance mode for backup operations" - ✅ "Restructure storage widget with improved layout" - ✅ "Update CPU thresholds to production values" ## Implementation Rules 1. **Agent Status Authority**: Agent calculates status for each metric using thresholds 2. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name 3. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status **NEVER:** - Copy/paste ANY code from legacy implementations - Calculate status in dashboard widgets - Hardcode metric names in widgets (use const arrays) - Create files unless absolutely necessary for achieving goals - Create documentation files unless explicitly requested **ALWAYS:** - Prefer editing existing files to creating new ones - Follow existing code conventions and patterns - Use existing libraries and utilities - Follow security best practices