# CM Dashboard - Infrastructure Monitoring TUI ## Overview A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture. ## Current Features ### Core Functionality - **Real-time Monitoring**: CPU, RAM, Storage, and Service status - **Service Management**: Start/stop services with user-stopped tracking - **Multi-host Support**: Monitor multiple servers from single dashboard - **NixOS Integration**: System rebuild via SSH + tmux popup - **Backup Monitoring**: Borgbackup status and scheduling ### User-Stopped Service Tracking - Services stopped via dashboard are marked as "user-stopped" - User-stopped services report Status::OK instead of Warning - Prevents false alerts during intentional maintenance - Persistent storage survives agent restarts - Automatic flag clearing when services are restarted via dashboard ### Custom Service Logs - Configure service-specific log file paths per host in dashboard config - Press `L` on any service to view custom log files via `tail -f` - Configuration format in dashboard config: ```toml [service_logs] hostname1 = [ { service_name = "nginx", log_file_path = "/var/log/nginx/access.log" }, { service_name = "app", log_file_path = "/var/log/myapp/app.log" } ] hostname2 = [ { service_name = "database", log_file_path = "/var/log/postgres/postgres.log" } ] ``` ### Service Management - **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services - **Service Actions**: - `s` - Start service (sends UserStart command) - `S` - Stop service (sends UserStop command) - `J` - Show service logs (journalctl in tmux popup) - `L` - Show custom log files (tail -f custom paths in tmux popup) - `R` - Rebuild current host - **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed) - **Transitional Icons**: Blue arrows during operations ### Navigation - **Tab**: Switch between hosts - **↑↓ or j/k**: Select services - **s**: Start selected service (UserStart) - **S**: Stop selected service (UserStop) - **J**: Show service logs (journalctl) - **L**: Show custom log files - **R**: Rebuild current host - **B**: Run backup on current host - **q**: Quit dashboard ## Core Architecture Principles ### Structured Data Architecture (✅ IMPLEMENTED v0.1.131) Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access. **Previous (String Metrics):** - ❌ Agent sent individual metrics with string names like `disk_nvme0n1_temperature` - ❌ Dashboard parsed metric names with underscore counting and string splitting - ❌ Complex and error-prone metric filtering and extraction logic **Current (Structured Data):** ```json { "hostname": "cmbox", "agent_version": "v0.1.131", "timestamp": 1763926877, "system": { "cpu": { "load_1min": 3.5, "load_5min": 3.57, "load_15min": 3.58, "frequency_mhz": 1500, "temperature_celsius": 45.2 }, "memory": { "usage_percent": 25.0, "total_gb": 23.3, "used_gb": 5.9, "swap_total_gb": 10.7, "swap_used_gb": 0.99, "tmpfs": [ { "mount": "/tmp", "usage_percent": 15.0, "used_gb": 0.3, "total_gb": 2.0 } ] }, "storage": { "drives": [ { "name": "nvme0n1", "health": "PASSED", "temperature_celsius": 29.0, "wear_percent": 1.0, "filesystems": [ { "mount": "/", "usage_percent": 24.0, "used_gb": 224.9, "total_gb": 928.2 } ] } ], "pools": [ { "name": "srv_media", "mount": "/srv/media", "type": "mergerfs", "health": "healthy", "usage_percent": 63.0, "used_gb": 2355.2, "total_gb": 3686.4, "data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }], "parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }] } ] } }, "services": [ { "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 } ], "backup": { "status": "completed", "last_run": 1763920000, "next_scheduled": 1764006400, "total_size_gb": 150.5, "repository_health": "ok" } } ``` - ✅ Agent sends structured JSON over ZMQ (no legacy support) - ✅ Type-safe data access: `data.system.storage.drives[0].temperature_celsius` - ✅ Complete metric coverage: CPU, memory, storage, services, backup - ✅ Backward compatibility via bridge conversion to existing UI widgets - ✅ All string parsing bugs eliminated ### Maintenance Mode - Agent checks for `/tmp/cm-maintenance` file before sending notifications - File presence suppresses all email notifications while continuing monitoring - Dashboard continues to show real status, only notifications are blocked Usage: ```bash # Enable maintenance mode touch /tmp/cm-maintenance # Run maintenance tasks systemctl stop service # ... maintenance work ... systemctl start service # Disable maintenance mode rm /tmp/cm-maintenance ``` ## Development and Deployment Architecture ### Development Path - **Location:** `~/projects/cm-dashboard` - **Purpose:** Development workflow only - for committing new code - **Access:** Only for developers to commit changes ### Deployment Path - **Location:** `/var/lib/cm-dashboard/nixos-config` - **Purpose:** Production deployment only - agent clones/pulls from git - **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild ### Git Flow ``` Development: ~/projects/cm-dashboard → git commit → git push Deployment: git pull → /var/lib/cm-dashboard/nixos-config → rebuild ``` ## Automated Binary Release System CM Dashboard uses automated binary releases instead of source builds. ### Creating New Releases ```bash cd ~/projects/cm-dashboard git tag v0.1.X git push origin v0.1.X ``` This automatically: - Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"` - Creates GitHub-style release with tarball - Uploads binaries via Gitea API ### NixOS Configuration Updates Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`: ```nix version = "v0.1.X"; src = pkgs.fetchurl { url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz"; sha256 = "sha256-NEW_HASH_HERE"; }; ``` ### Get Release Hash ```bash cd ~/projects/nixosbox nix-build --no-out-link -E 'with import {}; fetchurl { url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz"; sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; }' 2>&1 | grep "got:" ``` ### Building **Testing & Building:** - **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"` - **Clean compilation**: Remove `target/` between major changes ## Enhanced Storage Pool Visualization ### Auto-Discovery Architecture The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping. ### Discovery Process **At Agent Startup:** 1. Parse `/proc/mounts` to identify all mounted filesystems 2. Detect MergerFS pools by analyzing `fuse.mergerfs` mount sources 3. Identify member disks and potential parity relationships via heuristics 4. Store discovered storage topology for continuous monitoring 5. Generate pool-aware metrics with hierarchical relationships **Continuous Monitoring:** - Use stored discovery data for efficient metric collection - Monitor individual drives for SMART data, temperature, wear - Calculate pool-level health based on member drive status - Generate enhanced metrics for dashboard visualization ### Supported Storage Types **Single Disks:** - ext4, xfs, btrfs mounted directly - Individual drive monitoring with SMART data - Traditional single-disk display for root, boot, etc. **MergerFS Pools:** - Auto-detect from `/proc/mounts` fuse.mergerfs entries - Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2") - Heuristic parity disk detection (sequential device names, "parity" in path) - Pool health calculation (healthy/degraded/critical) - Hierarchical tree display with data/parity disk grouping **Future Extensions Ready:** - RAID arrays via `/proc/mdstat` parsing - ZFS pools via `zpool status` integration - LVM logical volumes via `lvs` discovery ### Configuration ```toml [collectors.disk] enabled = true auto_discover = true # Default: true # Optional exclusions for special filesystems exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"] exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"] ``` ### Display Format ``` CPU: ● Load: 0.23 0.21 0.13 └─ Freq: 1048 MHz RAM: ● Usage: 25% 5.8GB/23.3GB ├─ ● /tmp: 2% 0.5GB/2GB └─ ● /var/tmp: 0% 0GB/1.0GB Storage: ● mergerfs (2+1): ├─ Total: ● 63% 2355.2GB/3686.4GB ├─ Data Disks: │ ├─ ● sdb T: 24°C W: 5% │ └─ ● sdd T: 27°C W: 5% ├─ Parity: ● sdc T: 24°C W: 5% └─ Mount: /srv/media ● nvme0n1 T: 25C W: 4% ├─ ● /: 55% 250.5GB/456.4GB └─ ● /boot: 26% 0.3GB/1.0GB ``` ## Important Communication Guidelines Keep responses concise and focused. Avoid extensive implementation summaries unless requested. ## Commit Message Guidelines **NEVER mention:** - Claude or any AI assistant names - Automation or AI-generated content - Any reference to automated code generation **ALWAYS:** - Focus purely on technical changes and their purpose - Use standard software development commit message format - Describe what was changed and why, not how it was created - Write from the perspective of a human developer **Examples:** - ❌ "Generated with Claude Code" - ❌ "AI-assisted implementation" - ❌ "Automated refactoring" - ✅ "Implement maintenance mode for backup operations" - ✅ "Restructure storage widget with improved layout" - ✅ "Update CPU thresholds to production values" ## Completed Architecture Migration (v0.1.131) ## Complete Fix Plan (v0.1.140) **🎯 Goal: Fix ALL Issues - Display AND Core Functionality** ### Current Broken State (v0.1.139) **❌ What's Broken:** ``` ✅ Data Collection: Agent collects structured data correctly ❌ Storage Display: Shows wrong mount points, missing temperature/wear ❌ Status Evaluation: Everything shows "OK" regardless of actual values ❌ Notifications: Not working - can't send alerts when systems fail ❌ Thresholds: Not being evaluated (CPU load, memory usage, disk temperature) ``` **Root Cause:** During atomic migration, I removed core monitoring functionality and only fixed data collection, making the dashboard useless as a monitoring tool. ### Complete Fix Plan - Do Everything Right #### Phase 1: Fix Storage Display (CURRENT) - ✅ Use `lsblk` instead of `findmnt` (eliminates `/nix/store` bind mount issue) - ✅ Add `sudo smartctl` for permissions - ✅ Fix NVMe SMART parsing (`Temperature:` and `Percentage Used:`) - 🔄 Test that dashboard shows: `● nvme0n1 T: 28°C W: 1%` correctly #### Phase 2: Restore Status Evaluation System - **CPU Status**: Evaluate load averages against thresholds → Status::Warning/Critical - **Memory Status**: Evaluate usage_percent against thresholds → Status::Warning/Critical - **Storage Status**: Evaluate temperature & usage against thresholds → Status::Warning/Critical - **Service Status**: Evaluate service states → Status::Warning if inactive - **Overall Host Status**: Aggregate component statuses → host-level status #### Phase 3: Restore Notification System - **Status Change Detection**: Track when component status changes from OK→Warning/Critical - **Email Notifications**: Send alerts when status degrades - **Notification Rate Limiting**: Prevent spam (existing logic) - **Maintenance Mode**: Honor `/tmp/cm-maintenance` to suppress alerts - **Batched Notifications**: Group multiple alerts into single email #### Phase 4: Integration & Testing - **AgentData Status Fields**: Add status fields to structured data - **Dashboard Status Display**: Show colored indicators based on actual status - **End-to-End Testing**: Verify alerts fire when thresholds exceeded - **Verify All Thresholds**: CPU load, memory usage, disk temperature, service states ### Target Architecture (CORRECT) **Complete Flow:** ``` Collectors → AgentData → StatusEvaluator → Notifications ↘ ↗ ZMQ → Dashboard → Status Display ``` **Key Components:** 1. **Collectors**: Populate AgentData with raw metrics 2. **StatusEvaluator**: Apply thresholds to AgentData → Status enum values 3. **Notifications**: Send emails on status changes (OK→Warning/Critical) 4. **Dashboard**: Display data with correct status colors/indicators ### Implementation Rules **MUST COMPLETE ALL:** - Fix storage display to show correct mount points and temperature - Restore working status evaluation (thresholds → Status enum) - Restore working notifications (email alerts on status changes) - Test that monitoring actually works (alerts fire when appropriate) **NO SHORTCUTS:** - Don't commit partial fixes - Don't claim functionality works when it doesn't - Test every component thoroughly - Keep existing configuration and thresholds working **Success Criteria:** - Dashboard shows `● nvme0n1 T: 28°C W: 1%` format - High CPU load triggers Warning status and email alert - High memory usage triggers Warning status and email alert - High disk temperature triggers Warning status and email alert - Failed services trigger Warning status and email alert - Maintenance mode suppresses notifications as expected ## Implementation Rules 1. **Agent Status Authority**: Agent calculates status for each metric using thresholds 2. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name 3. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status **NEVER:** - Copy/paste ANY code from legacy implementations - Calculate status in dashboard widgets - Hardcode metric names in widgets (use const arrays) - Create files unless absolutely necessary for achieving goals - Create documentation files unless explicitly requested **ALWAYS:** - Prefer editing existing files to creating new ones - Follow existing code conventions and patterns - Use existing libraries and utilities - Follow security best practices