# CM Dashboard - Infrastructure Monitoring TUI ## Overview A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations. ## Project Goals ### Core Objectives - **Real-time monitoring** of all infrastructure components - **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01 - **Performance-focused** with minimal resource usage - **Keyboard-driven interface** for power users - **Integration** with existing monitoring APIs (ports 6127, 6128, 6129) ### Key Features - **NVMe health monitoring** with wear prediction - **CPU / memory / GPU telemetry** with automatic thresholding - **Service resource monitoring** with per-service CPU and RAM usage - **Disk usage overview** for root filesystems - **Backup status** with detailed metrics and history - **Unified alert pipeline** summarising host health - **Historical data tracking** and trend analysis ## Technical Architecture ### Technology Stack - **Language**: Rust 🦀 - **TUI Framework**: ratatui (modern tui-rs fork) - **Async Runtime**: tokio - **HTTP Client**: reqwest - **Serialization**: serde - **CLI**: clap - **Error Handling**: anyhow - **Time**: chrono ### Dependencies ```toml [dependencies] ratatui = "0.24" # Modern TUI framework crossterm = "0.27" # Cross-platform terminal handling tokio = { version = "1.0", features = ["full"] } # Async runtime reqwest = { version = "0.11", features = ["json"] } # HTTP client serde = { version = "1.0", features = ["derive"] } # JSON parsing clap = { version = "4.0", features = ["derive"] } # CLI args anyhow = "1.0" # Error handling chrono = "0.4" # Time handling ``` ## Project Structure ``` cm-dashboard/ ├── Cargo.toml ├── README.md ├── CLAUDE.md # This file ├── src/ │ ├── main.rs # Entry point & CLI │ ├── app.rs # Main application state │ ├── ui/ │ │ ├── mod.rs │ │ ├── dashboard.rs # Main dashboard layout │ │ ├── nvme.rs # NVMe health widget │ │ ├── services.rs # Services status widget │ │ ├── memory.rs # RAM optimization widget │ │ ├── backup.rs # Backup status widget │ │ └── alerts.rs # Alerts/notifications widget │ ├── api/ │ │ ├── mod.rs │ │ ├── client.rs # HTTP client wrapper │ │ ├── smart.rs # Smart metrics API (port 6127) │ │ ├── service.rs # Service metrics API (port 6128) │ │ └── backup.rs # Backup metrics API (port 6129) │ ├── data/ │ │ ├── mod.rs │ │ ├── metrics.rs # Data structures │ │ ├── history.rs # Historical data storage │ │ └── config.rs # Host configuration │ └── config.rs # Application configuration ├── config/ │ ├── hosts.toml # Host definitions │ └── dashboard.toml # Dashboard layout config └── docs/ ├── API.md # API integration documentation └── WIDGETS.md # Widget development guide ``` ## API Integration ### Existing CMTEC APIs 1. **Smart Metrics API** (port 6127) - NVMe health status (wear, temperature, power-on hours) - Disk space information - SMART health indicators 2. **Service Metrics API** (port 6128) - Service status and resource usage - Service memory consumption vs limits - Host CPU load / frequency / temperature - Root disk utilisation snapshot - GPU utilisation and temperature (if available) 3. **Backup Metrics API** (port 6129) - Backup status and history - Repository statistics - Service integration status ### Data Structures ```rust #[derive(Deserialize, Debug)] pub struct SmartMetrics { pub status: String, pub drives: Vec, pub summary: DriveSummary, pub issues: Vec, pub timestamp: u64, } #[derive(Deserialize, Debug)] pub struct ServiceMetrics { pub summary: ServiceSummary, pub services: Vec, pub timestamp: u64, } #[derive(Deserialize, Debug)] pub struct ServiceSummary { pub healthy: usize, pub degraded: usize, pub failed: usize, pub memory_used_mb: f32, pub memory_quota_mb: f32, pub system_memory_used_mb: f32, pub system_memory_total_mb: f32, pub disk_used_gb: f32, pub disk_total_gb: f32, pub cpu_load_1: f32, pub cpu_load_5: f32, pub cpu_load_15: f32, pub cpu_freq_mhz: Option, pub cpu_temp_c: Option, pub gpu_load_percent: Option, pub gpu_temp_c: Option, } #[derive(Deserialize, Debug)] pub struct BackupMetrics { pub overall_status: String, pub backup: BackupInfo, pub service: BackupServiceInfo, pub timestamp: u64, } ``` ## Dashboard Layout Design ### Main Dashboard View ``` ┌─────────────────────────────────────────────────────────────────────┐ │ 📊 CMTEC Infrastructure Dashboard srv01 │ ├─────────────────────────────────────────────────────────────────────┤ │ 💾 NVMe Health │ 🐏 RAM Optimization │ │ ┌─────────────────────────┐ │ ┌─────────────────────────────────────┐ │ │ │ Wear: 4% (█░░░░░░░░░░) │ │ │ Physical: 2.4G/7.6G (32%) │ │ │ │ Temp: 56°C │ │ │ zram: 64B/1.9G (64:1 compression) │ │ │ │ Hours: 11419h (475d) │ │ │ tmpfs: /var/log 88K/512M │ │ │ │ Status: ✅ PASSED │ │ │ Kernel: vm.dirty_ratio=5 │ │ │ └─────────────────────────┘ │ └─────────────────────────────────────┘ │ ├─────────────────────────────────────────────────────────────────────┤ │ 🔧 Services Status │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ ✅ Gitea (256M/4G, 15G/100G) ✅ smart-metrics-api │ │ │ │ ✅ Immich (1.2G/4G, 45G/500G) ✅ service-metrics-api │ │ │ │ ✅ Vaultwarden (45M/1G, 512M/1G) ✅ backup-metrics-api │ │ │ │ ✅ UniFi (234M/2G, 1.2G/5G) ✅ WordPress M2 │ │ │ └─────────────────────────────────────────────────────────────────┘ │ ├─────────────────────────────────────────────────────────────────────┤ │ 📧 Recent Alerts │ 💾 Backup Status │ │ 10:15 NVMe wear OK → 4% │ Last: ✅ Success (04:00) │ │ 04:00 Backup completed successfully │ Duration: 45m 32s │ │ Yesterday: Email notification test │ Size: 15.2GB → 4.1GB │ │ │ Next: Tomorrow 04:00 │ └─────────────────────────────────────────────────────────────────────┘ Keys: [h]osts [r]efresh [s]ettings [a]lerts [←→] navigate [q]uit ``` ### Multi-Host View ``` ┌─────────────────────────────────────────────────────────────────────┐ │ 🖥️ CMTEC Host Overview │ ├─────────────────────────────────────────────────────────────────────┤ │ Host │ NVMe Wear │ RAM Usage │ Services │ Last Alert │ ├─────────────────────────────────────────────────────────────────────┤ │ srv01 │ 4% ✅ │ 32% ✅ │ 8/8 ✅ │ 04:00 Backup OK │ │ cmbox │ 12% ✅ │ 45% ✅ │ 3/3 ✅ │ Yesterday Email test │ │ labbox │ 8% ✅ │ 28% ✅ │ 2/2 ✅ │ 2h ago NVMe temp OK │ │ simonbox │ 15% ✅ │ 67% ⚠️ │ 4/4 ✅ │ Gaming session active │ │ steambox │ 23% ✅ │ 78% ⚠️ │ 2/2 ✅ │ High RAM usage │ └─────────────────────────────────────────────────────────────────────┘ Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit ``` ## Development Phases ### Phase 1: Foundation (Week 1-2) - [x] Project setup with Cargo.toml - [ ] Basic TUI framework with ratatui - [ ] HTTP client for API connections - [ ] Data structures for metrics - [ ] Simple single-host dashboard **Deliverables:** - Working TUI that connects to srv01 - Real-time display of basic metrics - Keyboard navigation ### Phase 2: Core Features (Week 3-4) - [ ] All widget implementations - [ ] Multi-host configuration - [ ] Historical data storage - [ ] Alert system integration - [ ] Configuration management **Deliverables:** - Full-featured dashboard - Multi-host monitoring - Historical trending - Configuration file support ### Phase 3: Advanced Features (Week 5-6) - [ ] Predictive analytics - [ ] Custom alert rules - [ ] Export capabilities - [ ] Performance optimizations - [ ] Error handling & resilience **Deliverables:** - Production-ready dashboard - Advanced monitoring features - Comprehensive error handling - Performance benchmarks ### Phase 4: Polish & Documentation (Week 7-8) - [ ] Code documentation - [ ] User documentation - [ ] Installation scripts - [ ] Testing suite - [ ] Release preparation **Deliverables:** - Complete documentation - Installation packages - Test coverage - Release v1.0 ## Configuration ### Host Configuration (config/hosts.toml) ```toml [hosts] [hosts.srv01] name = "srv01" address = "192.168.30.100" smart_api = 6127 service_api = 6128 backup_api = 6129 role = "server" [hosts.cmbox] name = "cmbox" address = "192.168.30.101" smart_api = 6127 service_api = 6128 backup_api = 6129 role = "workstation" [hosts.labbox] name = "labbox" address = "192.168.30.102" smart_api = 6127 service_api = 6128 backup_api = 6129 role = "lab" ``` ### Dashboard Configuration (config/dashboard.toml) ```toml [dashboard] refresh_interval = 5 # seconds history_retention = 7 # days theme = "dark" [widgets] nvme_wear_threshold = 70 temperature_threshold = 70 memory_warning_threshold = 80 memory_critical_threshold = 90 [alerts] email_enabled = true sound_enabled = false desktop_notifications = true ``` ## Key Features ### Real-time Monitoring - **Auto-refresh** configurable intervals (1-60 seconds) - **Async data fetching** from multiple hosts simultaneously - **Connection status** indicators for each host - **Graceful degradation** when hosts are unreachable ### Historical Tracking - **SQLite database** for local storage - **Trend analysis** for wear levels and resource usage - **Retention policies** configurable per metric type - **Export capabilities** (CSV, JSON) ### Alert System - **Threshold-based alerts** for all metrics - **Email integration** with existing notification system - **Alert acknowledgment** and history - **Custom alert rules** with logical operators ### Multi-Host Management - **Auto-discovery** of hosts on network - **Host grouping** by role (server, workstation, lab) - **Bulk operations** across multiple hosts - **Host-specific configurations** ## Performance Requirements ### Resource Usage - **Memory**: < 50MB runtime footprint - **CPU**: < 1% average CPU usage - **Network**: Minimal bandwidth (< 1KB/s per host) - **Startup**: < 2 seconds cold start ### Responsiveness - **UI updates**: 60 FPS smooth rendering - **Data refresh**: < 500ms API response handling - **Navigation**: Instant keyboard response - **Error recovery**: < 5 seconds reconnection ## Security Considerations ### Network Security - **Local network only** - no external connections - **Authentication** for API access if implemented - **Encrypted storage** for sensitive configuration - **Audit logging** for administrative actions ### Data Privacy - **Local storage** only - no cloud dependencies - **Configurable retention** for historical data - **Secure deletion** of expired data - **No sensitive data logging** ## Testing Strategy ### Unit Tests - API client modules - Data parsing and validation - Configuration management - Alert logic ### Integration Tests - Multi-host connectivity - API error handling - Database operations - Alert delivery ### Performance Tests - Memory usage under load - Network timeout handling - Large dataset rendering - Extended runtime stability ## Deployment ### Installation ```bash # Development build cargo build --release # Install from source cargo install --path . # Future: Package distribution # Package for NixOS inclusion ``` ### Usage ```bash # Start dashboard cm-dashboard # Specify config cm-dashboard --config /path/to/config # Single host mode cm-dashboard --host srv01 # Debug mode cm-dashboard --verbose ``` ## Maintenance ### Regular Tasks - **Database cleanup** - automated retention policies - **Log rotation** - configurable log levels and retention - **Configuration validation** - startup configuration checks - **Performance monitoring** - built-in metrics for dashboard itself ### Updates - **Auto-update checks** - optional feature - **Configuration migration** - version compatibility - **API compatibility** - backwards compatibility with monitoring APIs - **Feature toggles** - enable/disable features without rebuild ## Future Enhancements ### Proposed: ZMQ Metrics Agent Architecture #### **Current Limitations of HTTP-based APIs** - **Performance overhead**: Python scripts with HTTP servers on each host - **Network complexity**: Multiple firewall ports (6127-6129) per host - **Polling inefficiency**: Manual refresh cycles instead of real-time streaming - **Scalability concerns**: Resource usage grows linearly with hosts #### **Proposed: Rust ZMQ Gossip Network** **Core Concept**: Replace HTTP polling with a peer-to-peer ZMQ gossip network where lightweight Rust agents stream metrics in real-time. ``` ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ cmbox │<-->│ labbox │<-->│ srv01 │<-->│steambox │ │ :6130 │ │ :6130 │ │ :6130 │ │ :6130 │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ ^ ^ ^ └────────────────────────────┼──────────────┘ v ┌─────────┐ │simonbox │ │ :6130 │ └─────────┘ ``` **Architecture Benefits**: - **No central router**: Peer-to-peer gossip eliminates single point of failure - **Self-healing**: Network automatically routes around failed hosts - **Real-time streaming**: Metrics pushed immediately on change - **Performance**: Rust agents ~10-100x faster than Python - **Simplified networking**: Single ZMQ port (6130) vs multiple HTTP ports - **Lower resource usage**: Minimal memory/CPU footprint per agent #### **Implementation Plan** **Phase 1: Agent Development** ```rust // Lightweight agent on each host pub struct MetricsAgent { neighbors: Vec, // ["srv01:6130", "cmbox:6130"] collectors: Vec>, // SMART, Service, Backup gossip_interval: Duration, // How often to broadcast zmq_context: zmq::Context, } // Message format for metrics #[derive(Serialize, Deserialize)] struct MetricsMessage { hostname: String, agent_type: AgentType, // Smart, Service, Backup timestamp: u64, metrics: MetricsData, hop_count: u8, // Prevent infinite loops } ``` **Phase 2: Dashboard Integration** - **ZMQ Subscriber**: Dashboard subscribes to gossip stream on srv01 - **Real-time updates**: WebSocket connection to TUI for live streaming - **Historical storage**: Optional persistence layer for trending **Phase 3: Migration Strategy** - **Parallel deployment**: Run ZMQ agents alongside existing HTTP APIs - **A/B comparison**: Validate metrics accuracy and performance - **Gradual cutover**: Switch dashboard to ZMQ, then remove HTTP services #### **Configuration Integration** **Agent Configuration** (per-host): ```toml [metrics_agent] enabled = true port = 6130 neighbors = ["srv01:6130", "cmbox:6130"] # Redundant connections role = "agent" # or "dashboard" for srv01 [collectors] smart_metrics = { enabled = true, interval_ms = 5000 } service_metrics = { enabled = true, interval_ms = 2000 } # srv01 only backup_metrics = { enabled = true, interval_ms = 30000 } # srv01 only ``` **Dashboard Configuration** (updated): ```toml [data_source] type = "zmq_gossip" # vs current "http_polling" listen_port = 6130 buffer_size = 1000 real_time_updates = true [legacy_support] http_apis_enabled = true # For migration period fallback_to_http = true # If ZMQ unavailable ``` #### **Performance Comparison** | Metric | Current (HTTP) | Proposed (ZMQ) | |--------|---------------|----------------| | Collection latency | ~50ms | ~1ms | | Network overhead | HTTP headers + JSON | Binary ZMQ frames | | Resource per host | ~5MB (Python + HTTP) | ~1MB (Rust agent) | | Update frequency | 5s polling | Real-time push | | Network ports | 3 per host | 1 per host | | Failure recovery | Manual retry | Auto-reconnect | #### **Development Roadmap** **Week 1-2**: Basic ZMQ agent - Rust binary with ZMQ gossip protocol - SMART metrics collection - Configuration management **Week 3-4**: Dashboard integration - ZMQ subscriber in cm-dashboard - Real-time TUI updates - Parallel HTTP/ZMQ operation **Week 5-6**: Production readiness - Service/backup metrics support - Error handling and resilience - Performance benchmarking **Week 7-8**: Migration and cleanup - Switch dashboard to ZMQ-only - Remove legacy HTTP APIs - Documentation and deployment ### Potential Features - **Plugin system** for custom widgets - **REST API** for external integrations - **Mobile companion app** for alerts - **Grafana integration** for advanced graphing - **Prometheus metrics export** - **Custom scripting** for automated responses - **Machine learning** for predictive analytics - **Clustering support** for high availability ### Integration Opportunities - **Home Assistant** integration - **Slack/Discord** notifications - **SNMP support** for network equipment - **Docker/Kubernetes** container monitoring - **Cloud metrics** integration (if needed) ## Success Metrics ### Technical Success - **Zero crashes** during normal operation - **Sub-second response** times for all operations - **99.9% uptime** for monitoring (excluding network issues) - **Minimal resource usage** as specified ### User Success - **Faster problem detection** compared to Glance - **Reduced time to resolution** for issues - **Improved infrastructure awareness** - **Enhanced operational efficiency** --- ## Development Log ### Project Initialization - Repository created: `/home/cm/projects/cm-dashboard` - Initial planning: TUI dashboard to replace Glance - Technology selected: Rust + ratatui - Architecture designed: Multi-host monitoring with existing API integration ### Current Status (HTTP-based) - **Functional TUI**: Basic dashboard rendering with ratatui - **HTTP API integration**: Connects to ports 6127, 6128, 6129 - **Multi-host support**: Configurable host management - **Async architecture**: Tokio-based concurrent metrics fetching - **Configuration system**: TOML-based host and dashboard configuration ### Proposed Evolution: ZMQ Agent System **Rationale for Change**: The current HTTP polling approach has fundamental limitations: 1. **Latency**: 5-second refresh cycles miss rapid changes 2. **Resource overhead**: Python HTTP servers consume unnecessary resources 3. **Network complexity**: Multiple ports per host complicate firewall management 4. **Scalability**: Linear resource growth with host count **Solution**: Peer-to-peer ZMQ gossip network with Rust agents provides: - **Real-time streaming**: Sub-second metric propagation - **Fault tolerance**: Network self-heals around failed hosts - **Performance**: Native Rust speed vs interpreted Python - **Simplicity**: Single port per host, no central coordination ### ZMQ Agent Development Plan **Component 1: cm-metrics-agent** (New Rust binary) ```toml [package] name = "cm-metrics-agent" version = "0.1.0" [dependencies] zmq = "0.10" serde = { version = "1.0", features = ["derive"] } tokio = { version = "1.0", features = ["full"] } smartmontools-rs = "0.1" # Or direct smartctl bindings ``` **Component 2: Dashboard Integration** (Update cm-dashboard) - Add ZMQ subscriber mode alongside HTTP client - Implement real-time metric streaming - Provide migration path from HTTP to ZMQ **Migration Strategy**: 1. **Phase 1**: Deploy agents alongside existing APIs 2. **Phase 2**: Switch dashboard to ZMQ mode 3. **Phase 3**: Remove HTTP APIs from NixOS configurations **Performance Targets**: - **Agent footprint**: < 2MB RAM, < 1% CPU - **Metric latency**: < 100ms propagation across network - **Network efficiency**: < 1KB/s per host steady state