Replaced system-wide disk usage with accurate per-service tracking by scanning service-specific directories. Services like sshd now correctly show minimal disk usage instead of misleading system totals. - Rename storage widget and add drive capacity/usage columns - Move host display to main dashboard title for cleaner layout - Replace separate alert displays with color-coded row highlighting - Add per-service disk usage collection using du command - Update services widget formatting to handle small disk values - Restructure into workspace with dedicated agent and dashboard packages
23 KiB
CM Dashboard - Infrastructure Monitoring TUI
Overview
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.
Project Goals
Core Objectives
- Real-time monitoring of all infrastructure components
- Multi-host support for cmbox, labbox, simonbox, steambox, srv01
- Performance-focused with minimal resource usage
- Keyboard-driven interface for power users
- Integration with existing monitoring APIs (ports 6127, 6128, 6129)
Key Features
- NVMe health monitoring with wear prediction
- CPU / memory / GPU telemetry with automatic thresholding
- Service resource monitoring with per-service CPU and RAM usage
- Disk usage overview for root filesystems
- Backup status with detailed metrics and history
- Unified alert pipeline summarising host health
- Historical data tracking and trend analysis
Technical Architecture
Technology Stack
- Language: Rust 🦀
- TUI Framework: ratatui (modern tui-rs fork)
- Async Runtime: tokio
- HTTP Client: reqwest
- Serialization: serde
- CLI: clap
- Error Handling: anyhow
- Time: chrono
Dependencies
[dependencies]
ratatui = "0.24" # Modern TUI framework
crossterm = "0.27" # Cross-platform terminal handling
tokio = { version = "1.0", features = ["full"] } # Async runtime
reqwest = { version = "0.11", features = ["json"] } # HTTP client
serde = { version = "1.0", features = ["derive"] } # JSON parsing
clap = { version = "4.0", features = ["derive"] } # CLI args
anyhow = "1.0" # Error handling
chrono = "0.4" # Time handling
Project Structure
cm-dashboard/
├── Cargo.toml
├── README.md
├── CLAUDE.md # This file
├── src/
│ ├── main.rs # Entry point & CLI
│ ├── app.rs # Main application state
│ ├── ui/
│ │ ├── mod.rs
│ │ ├── dashboard.rs # Main dashboard layout
│ │ ├── nvme.rs # NVMe health widget
│ │ ├── services.rs # Services status widget
│ │ ├── memory.rs # RAM optimization widget
│ │ ├── backup.rs # Backup status widget
│ │ └── alerts.rs # Alerts/notifications widget
│ ├── api/
│ │ ├── mod.rs
│ │ ├── client.rs # HTTP client wrapper
│ │ ├── smart.rs # Smart metrics API (port 6127)
│ │ ├── service.rs # Service metrics API (port 6128)
│ │ └── backup.rs # Backup metrics API (port 6129)
│ ├── data/
│ │ ├── mod.rs
│ │ ├── metrics.rs # Data structures
│ │ ├── history.rs # Historical data storage
│ │ └── config.rs # Host configuration
│ └── config.rs # Application configuration
├── config/
│ ├── hosts.toml # Host definitions
│ └── dashboard.toml # Dashboard layout config
└── docs/
├── API.md # API integration documentation
└── WIDGETS.md # Widget development guide
API Integration
Existing CMTEC APIs
-
Smart Metrics API (port 6127)
- NVMe health status (wear, temperature, power-on hours)
- Disk space information
- SMART health indicators
-
Service Metrics API (port 6128)
- Service status and resource usage
- Service memory consumption vs limits
- Host CPU load / frequency / temperature
- Root disk utilisation snapshot
- GPU utilisation and temperature (if available)
-
Backup Metrics API (port 6129)
- Backup status and history
- Repository statistics
- Service integration status
Data Structures
#[derive(Deserialize, Debug)]
pub struct SmartMetrics {
pub status: String,
pub drives: Vec<DriveInfo>,
pub summary: DriveSummary,
pub issues: Vec<String>,
pub timestamp: u64,
}
#[derive(Deserialize, Debug)]
pub struct ServiceMetrics {
pub summary: ServiceSummary,
pub services: Vec<ServiceInfo>,
pub timestamp: u64,
}
#[derive(Deserialize, Debug)]
pub struct ServiceSummary {
pub healthy: usize,
pub degraded: usize,
pub failed: usize,
pub memory_used_mb: f32,
pub memory_quota_mb: f32,
pub system_memory_used_mb: f32,
pub system_memory_total_mb: f32,
pub disk_used_gb: f32,
pub disk_total_gb: f32,
pub cpu_load_1: f32,
pub cpu_load_5: f32,
pub cpu_load_15: f32,
pub cpu_freq_mhz: Option<f32>,
pub cpu_temp_c: Option<f32>,
pub gpu_load_percent: Option<f32>,
pub gpu_temp_c: Option<f32>,
}
#[derive(Deserialize, Debug)]
pub struct BackupMetrics {
pub overall_status: String,
pub backup: BackupInfo,
pub service: BackupServiceInfo,
pub timestamp: u64,
}
Dashboard Layout Design
Main Dashboard View
┌─────────────────────────────────────────────────────────────────────┐
│ 📊 CMTEC Infrastructure Dashboard srv01 │
├─────────────────────────────────────────────────────────────────────┤
│ 💾 NVMe Health │ 🐏 RAM Optimization │
│ ┌─────────────────────────┐ │ ┌─────────────────────────────────────┐ │
│ │ Wear: 4% (█░░░░░░░░░░) │ │ │ Physical: 2.4G/7.6G (32%) │ │
│ │ Temp: 56°C │ │ │ zram: 64B/1.9G (64:1 compression) │ │
│ │ Hours: 11419h (475d) │ │ │ tmpfs: /var/log 88K/512M │ │
│ │ Status: ✅ PASSED │ │ │ Kernel: vm.dirty_ratio=5 │ │
│ └─────────────────────────┘ │ └─────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│ 🔧 Services Status │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ✅ Gitea (256M/4G, 15G/100G) ✅ smart-metrics-api │ │
│ │ ✅ Immich (1.2G/4G, 45G/500G) ✅ service-metrics-api │ │
│ │ ✅ Vaultwarden (45M/1G, 512M/1G) ✅ backup-metrics-api │ │
│ │ ✅ UniFi (234M/2G, 1.2G/5G) ✅ WordPress M2 │ │
│ └─────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│ 📧 Recent Alerts │ 💾 Backup Status │
│ 10:15 NVMe wear OK → 4% │ Last: ✅ Success (04:00) │
│ 04:00 Backup completed successfully │ Duration: 45m 32s │
│ Yesterday: Email notification test │ Size: 15.2GB → 4.1GB │
│ │ Next: Tomorrow 04:00 │
└─────────────────────────────────────────────────────────────────────┘
Keys: [h]osts [r]efresh [s]ettings [a]lerts [←→] navigate [q]uit
Multi-Host View
┌─────────────────────────────────────────────────────────────────────┐
│ 🖥️ CMTEC Host Overview │
├─────────────────────────────────────────────────────────────────────┤
│ Host │ NVMe Wear │ RAM Usage │ Services │ Last Alert │
├─────────────────────────────────────────────────────────────────────┤
│ srv01 │ 4% ✅ │ 32% ✅ │ 8/8 ✅ │ 04:00 Backup OK │
│ cmbox │ 12% ✅ │ 45% ✅ │ 3/3 ✅ │ Yesterday Email test │
│ labbox │ 8% ✅ │ 28% ✅ │ 2/2 ✅ │ 2h ago NVMe temp OK │
│ simonbox │ 15% ✅ │ 67% ⚠️ │ 4/4 ✅ │ Gaming session active │
│ steambox │ 23% ✅ │ 78% ⚠️ │ 2/2 ✅ │ High RAM usage │
└─────────────────────────────────────────────────────────────────────┘
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
Development Phases
Phase 1: Foundation (Week 1-2)
- Project setup with Cargo.toml
- Basic TUI framework with ratatui
- HTTP client for API connections
- Data structures for metrics
- Simple single-host dashboard
Deliverables:
- Working TUI that connects to srv01
- Real-time display of basic metrics
- Keyboard navigation
Phase 2: Core Features (Week 3-4)
- All widget implementations
- Multi-host configuration
- Historical data storage
- Alert system integration
- Configuration management
Deliverables:
- Full-featured dashboard
- Multi-host monitoring
- Historical trending
- Configuration file support
Phase 3: Advanced Features (Week 5-6)
- Predictive analytics
- Custom alert rules
- Export capabilities
- Performance optimizations
- Error handling & resilience
Deliverables:
- Production-ready dashboard
- Advanced monitoring features
- Comprehensive error handling
- Performance benchmarks
Phase 4: Polish & Documentation (Week 7-8)
- Code documentation
- User documentation
- Installation scripts
- Testing suite
- Release preparation
Deliverables:
- Complete documentation
- Installation packages
- Test coverage
- Release v1.0
Configuration
Host Configuration (config/hosts.toml)
[hosts]
[hosts.srv01]
name = "srv01"
address = "192.168.30.100"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "server"
[hosts.cmbox]
name = "cmbox"
address = "192.168.30.101"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "workstation"
[hosts.labbox]
name = "labbox"
address = "192.168.30.102"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "lab"
Dashboard Configuration (config/dashboard.toml)
[dashboard]
refresh_interval = 5 # seconds
history_retention = 7 # days
theme = "dark"
[widgets]
nvme_wear_threshold = 70
temperature_threshold = 70
memory_warning_threshold = 80
memory_critical_threshold = 90
[alerts]
email_enabled = true
sound_enabled = false
desktop_notifications = true
Key Features
Real-time Monitoring
- Auto-refresh configurable intervals (1-60 seconds)
- Async data fetching from multiple hosts simultaneously
- Connection status indicators for each host
- Graceful degradation when hosts are unreachable
Historical Tracking
- SQLite database for local storage
- Trend analysis for wear levels and resource usage
- Retention policies configurable per metric type
- Export capabilities (CSV, JSON)
Alert System
- Threshold-based alerts for all metrics
- Email integration with existing notification system
- Alert acknowledgment and history
- Custom alert rules with logical operators
Multi-Host Management
- Auto-discovery of hosts on network
- Host grouping by role (server, workstation, lab)
- Bulk operations across multiple hosts
- Host-specific configurations
Performance Requirements
Resource Usage
- Memory: < 50MB runtime footprint
- CPU: < 1% average CPU usage
- Network: Minimal bandwidth (< 1KB/s per host)
- Startup: < 2 seconds cold start
Responsiveness
- UI updates: 60 FPS smooth rendering
- Data refresh: < 500ms API response handling
- Navigation: Instant keyboard response
- Error recovery: < 5 seconds reconnection
Security Considerations
Network Security
- Local network only - no external connections
- Authentication for API access if implemented
- Encrypted storage for sensitive configuration
- Audit logging for administrative actions
Data Privacy
- Local storage only - no cloud dependencies
- Configurable retention for historical data
- Secure deletion of expired data
- No sensitive data logging
Testing Strategy
Unit Tests
- API client modules
- Data parsing and validation
- Configuration management
- Alert logic
Integration Tests
- Multi-host connectivity
- API error handling
- Database operations
- Alert delivery
Performance Tests
- Memory usage under load
- Network timeout handling
- Large dataset rendering
- Extended runtime stability
Deployment
Installation
# Development build
cargo build --release
# Install from source
cargo install --path .
# Future: Package distribution
# Package for NixOS inclusion
Usage
# Start dashboard
cm-dashboard
# Specify config
cm-dashboard --config /path/to/config
# Single host mode
cm-dashboard --host srv01
# Debug mode
cm-dashboard --verbose
Maintenance
Regular Tasks
- Database cleanup - automated retention policies
- Log rotation - configurable log levels and retention
- Configuration validation - startup configuration checks
- Performance monitoring - built-in metrics for dashboard itself
Updates
- Auto-update checks - optional feature
- Configuration migration - version compatibility
- API compatibility - backwards compatibility with monitoring APIs
- Feature toggles - enable/disable features without rebuild
Future Enhancements
Proposed: ZMQ Metrics Agent Architecture
Current Limitations of HTTP-based APIs
- Performance overhead: Python scripts with HTTP servers on each host
- Network complexity: Multiple firewall ports (6127-6129) per host
- Polling inefficiency: Manual refresh cycles instead of real-time streaming
- Scalability concerns: Resource usage grows linearly with hosts
Proposed: Rust ZMQ Gossip Network
Core Concept: Replace HTTP polling with a peer-to-peer ZMQ gossip network where lightweight Rust agents stream metrics in real-time.
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ cmbox │<-->│ labbox │<-->│ srv01 │<-->│steambox │
│ :6130 │ │ :6130 │ │ :6130 │ │ :6130 │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
^ ^ ^
└────────────────────────────┼──────────────┘
v
┌─────────┐
│simonbox │
│ :6130 │
└─────────┘
Architecture Benefits:
- No central router: Peer-to-peer gossip eliminates single point of failure
- Self-healing: Network automatically routes around failed hosts
- Real-time streaming: Metrics pushed immediately on change
- Performance: Rust agents ~10-100x faster than Python
- Simplified networking: Single ZMQ port (6130) vs multiple HTTP ports
- Lower resource usage: Minimal memory/CPU footprint per agent
Implementation Plan
Phase 1: Agent Development
// Lightweight agent on each host
pub struct MetricsAgent {
neighbors: Vec<String>, // ["srv01:6130", "cmbox:6130"]
collectors: Vec<Box<dyn Collector>>, // SMART, Service, Backup
gossip_interval: Duration, // How often to broadcast
zmq_context: zmq::Context,
}
// Message format for metrics
#[derive(Serialize, Deserialize)]
struct MetricsMessage {
hostname: String,
agent_type: AgentType, // Smart, Service, Backup
timestamp: u64,
metrics: MetricsData,
hop_count: u8, // Prevent infinite loops
}
Phase 2: Dashboard Integration
- ZMQ Subscriber: Dashboard subscribes to gossip stream on srv01
- Real-time updates: WebSocket connection to TUI for live streaming
- Historical storage: Optional persistence layer for trending
Phase 3: Migration Strategy
- Parallel deployment: Run ZMQ agents alongside existing HTTP APIs
- A/B comparison: Validate metrics accuracy and performance
- Gradual cutover: Switch dashboard to ZMQ, then remove HTTP services
Configuration Integration
Agent Configuration (per-host):
[metrics_agent]
enabled = true
port = 6130
neighbors = ["srv01:6130", "cmbox:6130"] # Redundant connections
role = "agent" # or "dashboard" for srv01
[collectors]
smart_metrics = { enabled = true, interval_ms = 5000 }
service_metrics = { enabled = true, interval_ms = 2000 } # srv01 only
backup_metrics = { enabled = true, interval_ms = 30000 } # srv01 only
Dashboard Configuration (updated):
[data_source]
type = "zmq_gossip" # vs current "http_polling"
listen_port = 6130
buffer_size = 1000
real_time_updates = true
[legacy_support]
http_apis_enabled = true # For migration period
fallback_to_http = true # If ZMQ unavailable
Performance Comparison
| Metric | Current (HTTP) | Proposed (ZMQ) |
|---|---|---|
| Collection latency | ~50ms | ~1ms |
| Network overhead | HTTP headers + JSON | Binary ZMQ frames |
| Resource per host | ~5MB (Python + HTTP) | ~1MB (Rust agent) |
| Update frequency | 5s polling | Real-time push |
| Network ports | 3 per host | 1 per host |
| Failure recovery | Manual retry | Auto-reconnect |
Development Roadmap
Week 1-2: Basic ZMQ agent
- Rust binary with ZMQ gossip protocol
- SMART metrics collection
- Configuration management
Week 3-4: Dashboard integration
- ZMQ subscriber in cm-dashboard
- Real-time TUI updates
- Parallel HTTP/ZMQ operation
Week 5-6: Production readiness
- Service/backup metrics support
- Error handling and resilience
- Performance benchmarking
Week 7-8: Migration and cleanup
- Switch dashboard to ZMQ-only
- Remove legacy HTTP APIs
- Documentation and deployment
Potential Features
- Plugin system for custom widgets
- REST API for external integrations
- Mobile companion app for alerts
- Grafana integration for advanced graphing
- Prometheus metrics export
- Custom scripting for automated responses
- Machine learning for predictive analytics
- Clustering support for high availability
Integration Opportunities
- Home Assistant integration
- Slack/Discord notifications
- SNMP support for network equipment
- Docker/Kubernetes container monitoring
- Cloud metrics integration (if needed)
Success Metrics
Technical Success
- Zero crashes during normal operation
- Sub-second response times for all operations
- 99.9% uptime for monitoring (excluding network issues)
- Minimal resource usage as specified
User Success
- Faster problem detection compared to Glance
- Reduced time to resolution for issues
- Improved infrastructure awareness
- Enhanced operational efficiency
Development Log
Project Initialization
- Repository created:
/home/cm/projects/cm-dashboard - Initial planning: TUI dashboard to replace Glance
- Technology selected: Rust + ratatui
- Architecture designed: Multi-host monitoring with existing API integration
Current Status (HTTP-based)
- Functional TUI: Basic dashboard rendering with ratatui
- HTTP API integration: Connects to ports 6127, 6128, 6129
- Multi-host support: Configurable host management
- Async architecture: Tokio-based concurrent metrics fetching
- Configuration system: TOML-based host and dashboard configuration
Proposed Evolution: ZMQ Agent System
Rationale for Change: The current HTTP polling approach has fundamental limitations:
- Latency: 5-second refresh cycles miss rapid changes
- Resource overhead: Python HTTP servers consume unnecessary resources
- Network complexity: Multiple ports per host complicate firewall management
- Scalability: Linear resource growth with host count
Solution: Peer-to-peer ZMQ gossip network with Rust agents provides:
- Real-time streaming: Sub-second metric propagation
- Fault tolerance: Network self-heals around failed hosts
- Performance: Native Rust speed vs interpreted Python
- Simplicity: Single port per host, no central coordination
ZMQ Agent Development Plan
Component 1: cm-metrics-agent (New Rust binary)
[package]
name = "cm-metrics-agent"
version = "0.1.0"
[dependencies]
zmq = "0.10"
serde = { version = "1.0", features = ["derive"] }
tokio = { version = "1.0", features = ["full"] }
smartmontools-rs = "0.1" # Or direct smartctl bindings
Component 2: Dashboard Integration (Update cm-dashboard)
- Add ZMQ subscriber mode alongside HTTP client
- Implement real-time metric streaming
- Provide migration path from HTTP to ZMQ
Migration Strategy:
- Phase 1: Deploy agents alongside existing APIs
- Phase 2: Switch dashboard to ZMQ mode
- Phase 3: Remove HTTP APIs from NixOS configurations
Performance Targets:
- Agent footprint: < 2MB RAM, < 1% CPU
- Metric latency: < 100ms propagation across network
- Network efficiency: < 1KB/s per host steady state