# CM Dashboard - Infrastructure Monitoring TUI A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for specific monitoring needs and API integrations. Features real-time monitoring of all infrastructure components with intelligent email notifications and automatic status calculation. ``` ┌─────────────────────────────────────────────────────────────────────┐ │ CM Dashboard • cmbox │ ├─────────────────────────────────────────────────────────────────────┤ │ Storage • ok:1 warn:0 crit:0 │ Services • ok:1 warn:0 fail:0 │ │ ┌─────────────────────────────────┐ │ ┌─────────────────────────────── │ │ │ │Drive Temp Wear Spare Hours │ │ │Service memory: 7.1/23899.7 MiB│ │ │ │nvme0n1 28°C 1% 100% 14489 │ │ │Disk usage: — │ │ │ │ Capacity Usage │ │ │ Service Memory Disk │ │ │ │ 954G 77G (8%) │ │ │✔ sshd 7.1 MiB — │ │ │ └─────────────────────────────────┘ │ └─────────────────────────────── │ │ ├─────────────────────────────────────────────────────────────────────┤ │ CPU / Memory • warn │ Backups │ │ System memory: 5251.7/23899.7 MiB │ Host cmbox awaiting backup │ │ │ CPU load (1/5/15): 2.18 2.66 2.56 │ metrics │ │ │ CPU freq: 1100.1 MHz │ │ │ │ CPU temp: 47.0°C │ │ │ ├─────────────────────────────────────────────────────────────────────┤ │ Alerts • ok:0 warn:3 fail:0 │ Status • ZMQ connected │ │ cmbox: warning: CPU load 2.18 │ Monitoring • hosts: 3 │ │ │ srv01: pending: awaiting metrics │ Data source: ZMQ – connected │ │ │ labbox: pending: awaiting metrics │ Active host: cmbox (1/3) │ │ └─────────────────────────────────────────────────────────────────────┘ Keys: [←→] hosts [r]efresh [q]uit ``` ## Key Features ### Real-time Monitoring - **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01 - **Performance-focused** with minimal resource usage - **Keyboard-driven interface** for power users - **ZMQ gossip network** for efficient data distribution ### Infrastructure Monitoring - **NVMe health monitoring** with wear prediction and temperature tracking - **CPU/Memory/GPU telemetry** with automatic thresholding - **Service resource monitoring** with per-service CPU and RAM usage - **Disk usage overview** for root filesystems - **Backup status** with detailed metrics and history - **C-state monitoring** for CPU power management analysis ### Intelligent Alerting - **Agent-calculated status** with predefined thresholds - **Email notifications** via SMTP with rate limiting - **Recovery notifications** with context about original issues - **Stockholm timezone** support for email timestamps - **Unified alert pipeline** summarizing host health ## Architecture ### Agent-Dashboard Separation The system follows a strict separation of concerns: - **Agent**: Single source of truth for all status calculations using defined thresholds - **Dashboard**: Display-only interface that shows agent-provided status - **Data Flow**: Agent (calculations) → Status → Dashboard (display) → Colors ### Agent Thresholds (Production) - **CPU Load**: Warning ≥ 5.0, Critical ≥ 8.0 - **Memory Usage**: Warning ≥ 80%, Critical ≥ 95% - **CPU Temperature**: Warning ≥ 100°C, Critical ≥ 100°C (effectively disabled) ### Email Notification System - **From**: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se) - **To**: `cm@cmtec.se` - **SMTP**: localhost:25 (postfix) - **Rate Limiting**: 30 minutes (configurable) - **Triggers**: Status degradation and recovery with detailed context ## Installation ### Requirements - Rust toolchain 1.75+ (install via [`rustup`](https://rustup.rs)) - Root privileges for agent (hardware monitoring access) - Network access for ZMQ communication (default port 6130) - SMTP server for notifications (postfix recommended) ### Build from Source ```bash git clone https://github.com/cmtec/cm-dashboard.git cd cm-dashboard cargo build --release ``` Optimized binaries available at: - Dashboard: `target/release/cm-dashboard` - Agent: `target/release/cm-dashboard-agent` ### Installation ```bash # Install dashboard cargo install --path dashboard # Install agent (requires root for hardware access) sudo cargo install --path agent ``` ## Quick Start ### Dashboard ```bash # Run with default configuration cm-dashboard # Specify host to monitor cm-dashboard --host cmbox # Override ZMQ endpoints cm-dashboard --zmq-endpoint tcp://srv01:6130,tcp://labbox:6130 # Increase logging verbosity cm-dashboard -v ``` ### Agent (Pure Auto-Discovery) The agent requires **no configuration files** and auto-discovers all system components: ```bash # Basic agent startup (auto-detects everything) sudo cm-dashboard-agent # With verbose logging for troubleshooting sudo cm-dashboard-agent -v ``` The agent automatically: - **Discovers storage devices** for SMART monitoring - **Detects running systemd services** for resource tracking - **Configures collection intervals** based on system capabilities - **Sets up email notifications** using hostname@cmtec.se ## Configuration ### Dashboard Configuration The dashboard creates `config/dashboard.toml` on first run: ```toml [hosts] default_host = "srv01" [[hosts.hosts]] name = "srv01" enabled = true [[hosts.hosts]] name = "cmbox" enabled = true [dashboard] tick_rate_ms = 250 history_duration_minutes = 60 [data_source] kind = "zmq" [data_source.zmq] endpoints = ["tcp://127.0.0.1:6130"] ``` ### Agent Configuration (Optional) The agent works without configuration but supports optional settings: ```bash # Generate example configuration cm-dashboard-agent --help # Override specific settings sudo cm-dashboard-agent \ --hostname cmbox \ --bind tcp://*:6130 \ --interval 5000 ``` ## Monitoring Components ### System Collector - **CPU Load**: 1/5/15 minute averages with warning/critical thresholds - **Memory Usage**: Used/total with percentage calculation - **CPU Temperature**: x86_pkg_temp prioritized for accuracy - **C-States**: Power management state distribution (C0-C10) ### Service Collector - **Systemd Services**: Auto-discovery of interesting services - **Resource Usage**: Per-service memory and disk consumption - **Service Health**: Running/stopped status with detailed failure info ### SMART Collector - **NVMe Health**: Temperature, wear leveling, spare blocks - **Drive Capacity**: Total/used space with percentage - **SMART Attributes**: Critical health indicators ### Backup Collector - **Restic Integration**: Backup status and history - **Health Monitoring**: Success/failure tracking - **Storage Metrics**: Backup size and retention ## Keyboard Controls | Key | Action | |-----|--------| | `←` / `h` | Previous host | | `→` / `l` / `Tab` | Next host | | `?` | Toggle help overlay | | `r` | Force refresh | | `q` / `Esc` | Quit | ## Email Notifications ### Notification Triggers - **Status Degradation**: Any status change to warning/critical - **Recovery**: Warning/critical status returning to ok - **Service Failures**: Individual service stop/start events ### Example Recovery Email ``` ✅ RESOLVED: system cpu on cmbox Status Change Alert Host: cmbox Component: system Metric: cpu Status Change: warning → ok Time: 2025-10-12 22:15:30 CET Details: Recovered from: CPU load (1/5/15min): 6.20 / 5.80 / 4.50 Current status: CPU load (1/5/15min): 3.30 / 3.17 / 2.84 -- CM Dashboard Agent Generated at 2025-10-12 22:15:30 CET ``` ### Rate Limiting - **Default**: 30 minutes between notifications per component - **Testing**: Set to 0 for immediate notifications - **Configurable**: Adjustable per deployment needs ## Development ### Project Structure ``` cm-dashboard/ ├── agent/ # Monitoring agent │ ├── src/ │ │ ├── collectors/ # Data collection modules │ │ ├── notifications.rs # Email notification system │ │ └── simple_agent.rs # Main agent logic ├── dashboard/ # TUI dashboard │ ├── src/ │ │ ├── ui/ # Widget implementations │ │ ├── data/ # Data structures │ │ └── app.rs # Application state ├── shared/ # Common data structures └── config/ # Configuration files ``` ### Development Commands ```bash # Format code cargo fmt # Check all packages cargo check # Run tests cargo test # Build release cargo build --release # Run with logging RUST_LOG=debug cargo run -p cm-dashboard-agent ``` ### Architecture Principles #### Status Calculation Rules - **Agent calculates all status** using predefined thresholds - **Dashboard never calculates status** - only displays agent data - **No hardcoded thresholds in dashboard** widgets - **Use "unknown" when agent status missing** (never default to "ok") #### Data Flow ``` System Metrics → Agent Collectors → Status Calculation → ZMQ → Dashboard → Display ↓ Email Notifications ``` #### Pure Auto-Discovery - **No config files required** for basic operation - **Runtime discovery** of system capabilities - **Service auto-detection** via systemd patterns - **Storage device enumeration** via /sys filesystem ## Troubleshooting ### Common Issues #### Agent Won't Start ```bash # Check permissions (agent requires root) sudo cm-dashboard-agent -v # Verify ZMQ binding sudo netstat -tulpn | grep 6130 # Check system access sudo smartctl --scan ``` #### Dashboard Connection Issues ```bash # Test ZMQ connectivity cm-dashboard --zmq-endpoint tcp://target-host:6130 -v # Check network connectivity telnet target-host 6130 ``` #### Email Notifications Not Working ```bash # Check postfix status sudo systemctl status postfix # Test SMTP manually telnet localhost 25 # Verify notification settings sudo cm-dashboard-agent -v | grep notification ``` ### Logging Set `RUST_LOG=debug` for detailed logging: ```bash RUST_LOG=debug sudo cm-dashboard-agent RUST_LOG=debug cm-dashboard ``` ## License MIT License - see LICENSE file for details. ## Contributing 1. Fork the repository 2. Create feature branch (`git checkout -b feature/amazing-feature`) 3. Commit changes (`git commit -m 'Add amazing feature'`) 4. Push to branch (`git push origin feature/amazing-feature`) 5. Open Pull Request For bugs and feature requests, please use GitHub Issues.