# CM Dashboard - Infrastructure Monitoring TUI ## Overview A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture. ## Current Features ### Core Functionality - **Real-time Monitoring**: CPU, RAM, Storage, and Service status - **Service Management**: Start/stop services with user-stopped tracking - **Multi-host Support**: Monitor multiple servers from single dashboard - **NixOS Integration**: System rebuild via SSH + tmux popup - **Backup Monitoring**: Borgbackup status and scheduling ### User-Stopped Service Tracking - Services stopped via dashboard are marked as "user-stopped" - User-stopped services report Status::OK instead of Warning - Prevents false alerts during intentional maintenance - Persistent storage survives agent restarts - Automatic flag clearing when services are restarted via dashboard ### Custom Service Logs - Configure service-specific log file paths per host in dashboard config - Press `L` on any service to view custom log files via `tail -f` - Configuration format in dashboard config: ```toml [service_logs] hostname1 = [ { service_name = "nginx", log_file_path = "/var/log/nginx/access.log" }, { service_name = "app", log_file_path = "/var/log/myapp/app.log" } ] hostname2 = [ { service_name = "database", log_file_path = "/var/log/postgres/postgres.log" } ] ``` ### Service Management - **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services - **Service Actions**: - `s` - Start service (sends UserStart command) - `S` - Stop service (sends UserStop command) - `J` - Show service logs (journalctl in tmux popup) - `L` - Show custom log files (tail -f custom paths in tmux popup) - `R` - Rebuild current host - **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed) - **Transitional Icons**: Blue arrows during operations ### Navigation - **Tab**: Switch between hosts - **↑↓ or j/k**: Select services - **s**: Start selected service (UserStart) - **S**: Stop selected service (UserStop) - **J**: Show service logs (journalctl) - **L**: Show custom log files - **R**: Rebuild current host - **B**: Run backup on current host - **q**: Quit dashboard ## Core Architecture Principles ### Structured Data Architecture (✅ IMPLEMENTED v0.1.131) Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access. **Previous (String Metrics):** - ❌ Agent sent individual metrics with string names like `disk_nvme0n1_temperature` - ❌ Dashboard parsed metric names with underscore counting and string splitting - ❌ Complex and error-prone metric filtering and extraction logic **Current (Structured Data):** ```json { "hostname": "cmbox", "agent_version": "v0.1.131", "timestamp": 1763926877, "system": { "cpu": { "load_1min": 3.5, "load_5min": 3.57, "load_15min": 3.58, "frequency_mhz": 1500, "temperature_celsius": 45.2 }, "memory": { "usage_percent": 25.0, "total_gb": 23.3, "used_gb": 5.9, "swap_total_gb": 10.7, "swap_used_gb": 0.99, "tmpfs": [ { "mount": "/tmp", "usage_percent": 15.0, "used_gb": 0.3, "total_gb": 2.0 } ] }, "storage": { "drives": [ { "name": "nvme0n1", "health": "PASSED", "temperature_celsius": 29.0, "wear_percent": 1.0, "filesystems": [ { "mount": "/", "usage_percent": 24.0, "used_gb": 224.9, "total_gb": 928.2 } ] } ], "pools": [ { "name": "srv_media", "mount": "/srv/media", "type": "mergerfs", "health": "healthy", "usage_percent": 63.0, "used_gb": 2355.2, "total_gb": 3686.4, "data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }], "parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }] } ] } }, "services": [ { "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 } ], "backup": { "status": "completed", "last_run": 1763920000, "next_scheduled": 1764006400, "total_size_gb": 150.5, "repository_health": "ok" } } ``` - ✅ Agent sends structured JSON over ZMQ (no legacy support) - ✅ Type-safe data access: `data.system.storage.drives[0].temperature_celsius` - ✅ Complete metric coverage: CPU, memory, storage, services, backup - ✅ Backward compatibility via bridge conversion to existing UI widgets - ✅ All string parsing bugs eliminated ### Maintenance Mode - Agent checks for `/tmp/cm-maintenance` file before sending notifications - File presence suppresses all email notifications while continuing monitoring - Dashboard continues to show real status, only notifications are blocked Usage: ```bash # Enable maintenance mode touch /tmp/cm-maintenance # Run maintenance tasks systemctl stop service # ... maintenance work ... systemctl start service # Disable maintenance mode rm /tmp/cm-maintenance ``` ## Development and Deployment Architecture ### Development Path - **Location:** `~/projects/cm-dashboard` - **Purpose:** Development workflow only - for committing new code - **Access:** Only for developers to commit changes ### Deployment Path - **Location:** `/var/lib/cm-dashboard/nixos-config` - **Purpose:** Production deployment only - agent clones/pulls from git - **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild ### Git Flow ``` Development: ~/projects/cm-dashboard → git commit → git push Deployment: git pull → /var/lib/cm-dashboard/nixos-config → rebuild ``` ## Automated Binary Release System CM Dashboard uses automated binary releases instead of source builds. ### Creating New Releases ```bash cd ~/projects/cm-dashboard git tag v0.1.X git push origin v0.1.X ``` This automatically: - Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"` - Creates GitHub-style release with tarball - Uploads binaries via Gitea API ### NixOS Configuration Updates Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`: ```nix version = "v0.1.X"; src = pkgs.fetchurl { url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz"; sha256 = "sha256-NEW_HASH_HERE"; }; ``` ### Get Release Hash ```bash cd ~/projects/nixosbox nix-build --no-out-link -E 'with import {}; fetchurl { url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz"; sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; }' 2>&1 | grep "got:" ``` ### Building **Testing & Building:** - **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"` - **Clean compilation**: Remove `target/` between major changes ## Enhanced Storage Pool Visualization ### Auto-Discovery Architecture The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping. ### Discovery Process **At Agent Startup:** 1. Parse `/proc/mounts` to identify all mounted filesystems 2. Detect MergerFS pools by analyzing `fuse.mergerfs` mount sources 3. Identify member disks and potential parity relationships via heuristics 4. Store discovered storage topology for continuous monitoring 5. Generate pool-aware metrics with hierarchical relationships **Continuous Monitoring:** - Use stored discovery data for efficient metric collection - Monitor individual drives for SMART data, temperature, wear - Calculate pool-level health based on member drive status - Generate enhanced metrics for dashboard visualization ### Supported Storage Types **Single Disks:** - ext4, xfs, btrfs mounted directly - Individual drive monitoring with SMART data - Traditional single-disk display for root, boot, etc. **MergerFS Pools:** - Auto-detect from `/proc/mounts` fuse.mergerfs entries - Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2") - Heuristic parity disk detection (sequential device names, "parity" in path) - Pool health calculation (healthy/degraded/critical) - Hierarchical tree display with data/parity disk grouping **Future Extensions Ready:** - RAID arrays via `/proc/mdstat` parsing - ZFS pools via `zpool status` integration - LVM logical volumes via `lvs` discovery ### Configuration ```toml [collectors.disk] enabled = true auto_discover = true # Default: true # Optional exclusions for special filesystems exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"] exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"] ``` ### Display Format ``` CPU: ● Load: 0.23 0.21 0.13 └─ Freq: 1048 MHz RAM: ● Usage: 25% 5.8GB/23.3GB ├─ ● /tmp: 2% 0.5GB/2GB └─ ● /var/tmp: 0% 0GB/1.0GB Storage: ● mergerfs (2+1): ├─ Total: ● 63% 2355.2GB/3686.4GB ├─ Data Disks: │ ├─ ● sdb T: 24°C W: 5% │ └─ ● sdd T: 27°C W: 5% ├─ Parity: ● sdc T: 24°C W: 5% └─ Mount: /srv/media ● nvme0n1 T: 25C W: 4% ├─ ● /: 55% 250.5GB/456.4GB └─ ● /boot: 26% 0.3GB/1.0GB Backup: ● WD-WCC7K1234567 T: 32°C W: 12% ├─ Last: 2h ago (12.3GB) ├─ Next: in 22h └─ ● Usage: 45% 678GB/1.5TB ``` ## Important Communication Guidelines Keep responses concise and focused. Avoid extensive implementation summaries unless requested. ## Commit Message Guidelines **NEVER mention:** - Claude or any AI assistant names - Automation or AI-generated content - Any reference to automated code generation **ALWAYS:** - Focus purely on technical changes and their purpose - Use standard software development commit message format - Describe what was changed and why, not how it was created - Write from the perspective of a human developer **Examples:** - ❌ "Generated with Claude Code" - ❌ "AI-assisted implementation" - ❌ "Automated refactoring" - ✅ "Implement maintenance mode for backup operations" - ✅ "Restructure storage widget with improved layout" - ✅ "Update CPU thresholds to production values" ## Completed Architecture Migration (v0.1.131) ## ✅ COMPLETE MONITORING SYSTEM RESTORATION (v0.1.141) **🎉 SUCCESS: All Issues Fixed - Complete Functional Monitoring System** ### ✅ Completed Implementation (v0.1.141) **All Major Issues Resolved:** ``` ✅ Data Collection: Agent collects structured data correctly ✅ Storage Display: Perfect format with correct mount points and temperature/wear ✅ Status Evaluation: All metrics properly evaluated against thresholds ✅ Notifications: Working email alerts on status changes ✅ Thresholds: All collectors using configured thresholds for status calculation ✅ Build Information: NixOS version displayed correctly ✅ Mount Point Consistency: Stable, sorted display order ``` ### ✅ All Phases Completed Successfully #### ✅ Phase 1: Storage Display - COMPLETED - ✅ Use `lsblk` instead of `findmnt` (eliminated `/nix/store` bind mount issue) - ✅ Add `sudo smartctl` for permissions (SMART data collection working) - ✅ Fix NVMe SMART parsing (`Temperature:` and `Percentage Used:` fields) - ✅ Consistent filesystem/tmpfs sorting (no more random order swapping) - ✅ **VERIFIED**: Dashboard shows `● nvme0n1 T: 28°C W: 1%` correctly #### ✅ Phase 2: Status Evaluation System - COMPLETED - ✅ **CPU Status**: Load averages and temperature evaluated against `HysteresisThresholds` - ✅ **Memory Status**: Usage percentage evaluated against thresholds - ✅ **Storage Status**: Drive temperature, health, and filesystem usage evaluated - ✅ **Service Status**: Service states properly tracked and evaluated - ✅ **Status Fields**: All AgentData structures include status information - ✅ **Threshold Integration**: All collectors use their configured thresholds #### ✅ Phase 3: Notification System - COMPLETED - ✅ **Status Change Detection**: Agent tracks status between collection cycles - ✅ **Email Notifications**: Alerts sent on degradation (OK→Warning/Critical, Warning→Critical) - ✅ **Notification Content**: Detailed alerts with metric values and timestamps - ✅ **NotificationManager Integration**: Fully restored and operational - ✅ **Maintenance Mode**: `/tmp/cm-maintenance` file support maintained #### ✅ Phase 4: Integration & Testing - COMPLETED - ✅ **AgentData Status Fields**: All structured data includes status evaluation - ✅ **Status Processing**: Agent applies thresholds at collection time - ✅ **End-to-End Flow**: Collection → Evaluation → Notification → Display - ✅ **Dynamic Versioning**: Agent version from `CARGO_PKG_VERSION` - ✅ **Build Information**: NixOS generation display restored ### ✅ Final Architecture - WORKING **Complete Operational Flow:** ``` Collectors → AgentData (with Status) → NotificationManager → Email Alerts ↘ ↗ ZMQ → Dashboard → Perfect Display ``` **Operational Components:** 1. ✅ **Collectors**: Populate AgentData with metrics AND status evaluation 2. ✅ **Status Evaluation**: `HysteresisThresholds.evaluate()` applied per collector 3. ✅ **Notifications**: Email alerts on status change detection 4. ✅ **Display**: Correct mount points, temperature, wear, and build information ### ✅ Success Criteria - ALL MET **Display Requirements:** - ✅ Dashboard shows `● nvme0n1 T: 28°C W: 1%` format perfectly - ✅ Mount points show `/` and `/boot` (not `root`/`boot`) - ✅ Build information shows actual NixOS version (not "unknown") - ✅ Consistent sorting eliminates random order changes **Monitoring Requirements:** - ✅ High CPU load triggers Warning/Critical status and email alert - ✅ High memory usage triggers Warning/Critical status and email alert - ✅ High disk temperature triggers Warning/Critical status and email alert - ✅ Failed services trigger Warning/Critical status and email alert - ✅ Maintenance mode suppresses notifications as expected ### 🚀 Production Ready **CM Dashboard v0.1.141 is a complete, functional infrastructure monitoring system:** - **Real-time Monitoring**: All system components with 1-second intervals - **Intelligent Alerting**: Email notifications on threshold violations - **Perfect Display**: Accurate mount points, temperatures, and system information - **Status-Aware**: All metrics evaluated against configurable thresholds - **Production Ready**: Full monitoring capabilities restored **The monitoring system is fully operational and ready for production use.** ## Implementation Rules 1. **Agent Status Authority**: Agent calculates status for each metric using thresholds 2. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name 3. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status **NEVER:** - Copy/paste ANY code from legacy implementations - Calculate status in dashboard widgets - Hardcode metric names in widgets (use const arrays) - Create files unless absolutely necessary for achieving goals - Create documentation files unless explicitly requested **ALWAYS:** - Prefer editing existing files to creating new ones - Follow existing code conventions and patterns - Use existing libraries and utilities - Follow security best practices