cm-dashboard/CLAUDE.md
Christoffer Martinsson 2581435b10 Implement per-service disk usage monitoring
Replaced system-wide disk usage with accurate per-service tracking by scanning
service-specific directories. Services like sshd now correctly show minimal
disk usage instead of misleading system totals.

- Rename storage widget and add drive capacity/usage columns
- Move host display to main dashboard title for cleaner layout
- Replace separate alert displays with color-coded row highlighting
- Add per-service disk usage collection using du command
- Update services widget formatting to handle small disk values
- Restructure into workspace with dedicated agent and dashboard packages
2025-10-11 22:59:16 +02:00

23 KiB

CM Dashboard - Infrastructure Monitoring TUI

Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.

Project Goals

Core Objectives

  • Real-time monitoring of all infrastructure components
  • Multi-host support for cmbox, labbox, simonbox, steambox, srv01
  • Performance-focused with minimal resource usage
  • Keyboard-driven interface for power users
  • Integration with existing monitoring APIs (ports 6127, 6128, 6129)

Key Features

  • NVMe health monitoring with wear prediction
  • CPU / memory / GPU telemetry with automatic thresholding
  • Service resource monitoring with per-service CPU and RAM usage
  • Disk usage overview for root filesystems
  • Backup status with detailed metrics and history
  • Unified alert pipeline summarising host health
  • Historical data tracking and trend analysis

Technical Architecture

Technology Stack

  • Language: Rust 🦀
  • TUI Framework: ratatui (modern tui-rs fork)
  • Async Runtime: tokio
  • HTTP Client: reqwest
  • Serialization: serde
  • CLI: clap
  • Error Handling: anyhow
  • Time: chrono

Dependencies

[dependencies]
ratatui = "0.24"           # Modern TUI framework
crossterm = "0.27"         # Cross-platform terminal handling
tokio = { version = "1.0", features = ["full"] }  # Async runtime
reqwest = { version = "0.11", features = ["json"] }  # HTTP client
serde = { version = "1.0", features = ["derive"] }   # JSON parsing
clap = { version = "4.0", features = ["derive"] }    # CLI args
anyhow = "1.0"             # Error handling
chrono = "0.4"             # Time handling

Project Structure

cm-dashboard/
├── Cargo.toml
├── README.md
├── CLAUDE.md              # This file
├── src/
│   ├── main.rs            # Entry point & CLI
│   ├── app.rs             # Main application state
│   ├── ui/
│   │   ├── mod.rs
│   │   ├── dashboard.rs   # Main dashboard layout
│   │   ├── nvme.rs        # NVMe health widget
│   │   ├── services.rs    # Services status widget
│   │   ├── memory.rs      # RAM optimization widget
│   │   ├── backup.rs      # Backup status widget
│   │   └── alerts.rs      # Alerts/notifications widget
│   ├── api/
│   │   ├── mod.rs
│   │   ├── client.rs      # HTTP client wrapper
│   │   ├── smart.rs       # Smart metrics API (port 6127)
│   │   ├── service.rs     # Service metrics API (port 6128)
│   │   └── backup.rs      # Backup metrics API (port 6129)
│   ├── data/
│   │   ├── mod.rs
│   │   ├── metrics.rs     # Data structures
│   │   ├── history.rs     # Historical data storage
│   │   └── config.rs      # Host configuration
│   └── config.rs          # Application configuration
├── config/
│   ├── hosts.toml         # Host definitions
│   └── dashboard.toml     # Dashboard layout config
└── docs/
    ├── API.md             # API integration documentation
    └── WIDGETS.md         # Widget development guide

API Integration

Existing CMTEC APIs

  1. Smart Metrics API (port 6127)

    • NVMe health status (wear, temperature, power-on hours)
    • Disk space information
    • SMART health indicators
  2. Service Metrics API (port 6128)

    • Service status and resource usage
    • Service memory consumption vs limits
    • Host CPU load / frequency / temperature
    • Root disk utilisation snapshot
    • GPU utilisation and temperature (if available)
  3. Backup Metrics API (port 6129)

    • Backup status and history
    • Repository statistics
    • Service integration status

Data Structures

#[derive(Deserialize, Debug)]
pub struct SmartMetrics {
    pub status: String,
    pub drives: Vec<DriveInfo>,
    pub summary: DriveSummary,
    pub issues: Vec<String>,
    pub timestamp: u64,
}

#[derive(Deserialize, Debug)]
pub struct ServiceMetrics {
    pub summary: ServiceSummary,
    pub services: Vec<ServiceInfo>,
    pub timestamp: u64,
}

#[derive(Deserialize, Debug)]
pub struct ServiceSummary {
    pub healthy: usize,
    pub degraded: usize,
    pub failed: usize,
    pub memory_used_mb: f32,
    pub memory_quota_mb: f32,
    pub system_memory_used_mb: f32,
    pub system_memory_total_mb: f32,
    pub disk_used_gb: f32,
    pub disk_total_gb: f32,
    pub cpu_load_1: f32,
    pub cpu_load_5: f32,
    pub cpu_load_15: f32,
    pub cpu_freq_mhz: Option<f32>,
    pub cpu_temp_c: Option<f32>,
    pub gpu_load_percent: Option<f32>,
    pub gpu_temp_c: Option<f32>,
}

#[derive(Deserialize, Debug)]
pub struct BackupMetrics {
    pub overall_status: String,
    pub backup: BackupInfo,
    pub service: BackupServiceInfo,
    pub timestamp: u64,
}

Dashboard Layout Design

Main Dashboard View

┌─────────────────────────────────────────────────────────────────────┐
│ 📊 CMTEC Infrastructure Dashboard                          srv01     │
├─────────────────────────────────────────────────────────────────────┤
│ 💾 NVMe Health              │ 🐏 RAM Optimization                    │
│ ┌─────────────────────────┐  │ ┌─────────────────────────────────────┐ │
│ │ Wear: 4% (█░░░░░░░░░░)   │  │ │ Physical: 2.4G/7.6G (32%)          │ │
│ │ Temp: 56°C              │  │ │ zram: 64B/1.9G (64:1 compression)  │ │
│ │ Hours: 11419h (475d)    │  │ │ tmpfs: /var/log 88K/512M           │ │
│ │ Status: ✅ PASSED       │  │ │ Kernel: vm.dirty_ratio=5           │ │
│ └─────────────────────────┘  │ └─────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│ 🔧 Services Status                                                   │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ✅ Gitea (256M/4G, 15G/100G)      ✅ smart-metrics-api         │ │
│ │ ✅ Immich (1.2G/4G, 45G/500G)     ✅ service-metrics-api       │ │
│ │ ✅ Vaultwarden (45M/1G, 512M/1G)  ✅ backup-metrics-api        │ │
│ │ ✅ UniFi (234M/2G, 1.2G/5G)       ✅ WordPress M2              │ │
│ └─────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│ 📧 Recent Alerts                     │ 💾 Backup Status             │
│ 10:15 NVMe wear OK → 4%             │ Last: ✅ Success (04:00)      │
│ 04:00 Backup completed successfully │ Duration: 45m 32s            │
│ Yesterday: Email notification test   │ Size: 15.2GB → 4.1GB        │
│                                     │ Next: Tomorrow 04:00         │
└─────────────────────────────────────────────────────────────────────┘
Keys: [h]osts [r]efresh [s]ettings [a]lerts [←→] navigate [q]uit

Multi-Host View

┌─────────────────────────────────────────────────────────────────────┐
│ 🖥️  CMTEC Host Overview                                              │
├─────────────────────────────────────────────────────────────────────┤
│ Host      │ NVMe Wear │ RAM Usage │ Services │ Last Alert            │
├─────────────────────────────────────────────────────────────────────┤
│ srv01     │ 4%   ✅   │ 32%  ✅   │ 8/8  ✅  │ 04:00 Backup OK       │
│ cmbox     │ 12%  ✅   │ 45%  ✅   │ 3/3  ✅  │ Yesterday Email test  │
│ labbox    │ 8%   ✅   │ 28%  ✅   │ 2/2  ✅  │ 2h ago NVMe temp OK   │
│ simonbox  │ 15%  ✅   │ 67%  ⚠️   │ 4/4  ✅  │ Gaming session active │
│ steambox  │ 23%  ✅   │ 78%  ⚠️   │ 2/2  ✅  │ High RAM usage        │
└─────────────────────────────────────────────────────────────────────┘
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit

Development Phases

Phase 1: Foundation (Week 1-2)

  • Project setup with Cargo.toml
  • Basic TUI framework with ratatui
  • HTTP client for API connections
  • Data structures for metrics
  • Simple single-host dashboard

Deliverables:

  • Working TUI that connects to srv01
  • Real-time display of basic metrics
  • Keyboard navigation

Phase 2: Core Features (Week 3-4)

  • All widget implementations
  • Multi-host configuration
  • Historical data storage
  • Alert system integration
  • Configuration management

Deliverables:

  • Full-featured dashboard
  • Multi-host monitoring
  • Historical trending
  • Configuration file support

Phase 3: Advanced Features (Week 5-6)

  • Predictive analytics
  • Custom alert rules
  • Export capabilities
  • Performance optimizations
  • Error handling & resilience

Deliverables:

  • Production-ready dashboard
  • Advanced monitoring features
  • Comprehensive error handling
  • Performance benchmarks

Phase 4: Polish & Documentation (Week 7-8)

  • Code documentation
  • User documentation
  • Installation scripts
  • Testing suite
  • Release preparation

Deliverables:

  • Complete documentation
  • Installation packages
  • Test coverage
  • Release v1.0

Configuration

Host Configuration (config/hosts.toml)

[hosts]

[hosts.srv01]
name = "srv01"
address = "192.168.30.100"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "server"

[hosts.cmbox]
name = "cmbox"
address = "192.168.30.101"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "workstation"

[hosts.labbox]
name = "labbox" 
address = "192.168.30.102"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "lab"

Dashboard Configuration (config/dashboard.toml)

[dashboard]
refresh_interval = 5  # seconds
history_retention = 7  # days
theme = "dark"

[widgets]
nvme_wear_threshold = 70
temperature_threshold = 70
memory_warning_threshold = 80
memory_critical_threshold = 90

[alerts]
email_enabled = true
sound_enabled = false
desktop_notifications = true

Key Features

Real-time Monitoring

  • Auto-refresh configurable intervals (1-60 seconds)
  • Async data fetching from multiple hosts simultaneously
  • Connection status indicators for each host
  • Graceful degradation when hosts are unreachable

Historical Tracking

  • SQLite database for local storage
  • Trend analysis for wear levels and resource usage
  • Retention policies configurable per metric type
  • Export capabilities (CSV, JSON)

Alert System

  • Threshold-based alerts for all metrics
  • Email integration with existing notification system
  • Alert acknowledgment and history
  • Custom alert rules with logical operators

Multi-Host Management

  • Auto-discovery of hosts on network
  • Host grouping by role (server, workstation, lab)
  • Bulk operations across multiple hosts
  • Host-specific configurations

Performance Requirements

Resource Usage

  • Memory: < 50MB runtime footprint
  • CPU: < 1% average CPU usage
  • Network: Minimal bandwidth (< 1KB/s per host)
  • Startup: < 2 seconds cold start

Responsiveness

  • UI updates: 60 FPS smooth rendering
  • Data refresh: < 500ms API response handling
  • Navigation: Instant keyboard response
  • Error recovery: < 5 seconds reconnection

Security Considerations

Network Security

  • Local network only - no external connections
  • Authentication for API access if implemented
  • Encrypted storage for sensitive configuration
  • Audit logging for administrative actions

Data Privacy

  • Local storage only - no cloud dependencies
  • Configurable retention for historical data
  • Secure deletion of expired data
  • No sensitive data logging

Testing Strategy

Unit Tests

  • API client modules
  • Data parsing and validation
  • Configuration management
  • Alert logic

Integration Tests

  • Multi-host connectivity
  • API error handling
  • Database operations
  • Alert delivery

Performance Tests

  • Memory usage under load
  • Network timeout handling
  • Large dataset rendering
  • Extended runtime stability

Deployment

Installation

# Development build
cargo build --release

# Install from source
cargo install --path .

# Future: Package distribution
# Package for NixOS inclusion

Usage

# Start dashboard
cm-dashboard

# Specify config
cm-dashboard --config /path/to/config

# Single host mode
cm-dashboard --host srv01

# Debug mode
cm-dashboard --verbose

Maintenance

Regular Tasks

  • Database cleanup - automated retention policies
  • Log rotation - configurable log levels and retention
  • Configuration validation - startup configuration checks
  • Performance monitoring - built-in metrics for dashboard itself

Updates

  • Auto-update checks - optional feature
  • Configuration migration - version compatibility
  • API compatibility - backwards compatibility with monitoring APIs
  • Feature toggles - enable/disable features without rebuild

Future Enhancements

Proposed: ZMQ Metrics Agent Architecture

Current Limitations of HTTP-based APIs

  • Performance overhead: Python scripts with HTTP servers on each host
  • Network complexity: Multiple firewall ports (6127-6129) per host
  • Polling inefficiency: Manual refresh cycles instead of real-time streaming
  • Scalability concerns: Resource usage grows linearly with hosts

Proposed: Rust ZMQ Gossip Network

Core Concept: Replace HTTP polling with a peer-to-peer ZMQ gossip network where lightweight Rust agents stream metrics in real-time.

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│  cmbox  │<-->│ labbox  │<-->│ srv01   │<-->│steambox │
│ :6130   │    │ :6130   │    │ :6130   │    │ :6130   │
└─────────┘    └─────────┘    └─────────┘    └─────────┘
      ^                            ^              ^
      └────────────────────────────┼──────────────┘
                                   v
                              ┌─────────┐
                              │simonbox │
                              │ :6130   │
                              └─────────┘

Architecture Benefits:

  • No central router: Peer-to-peer gossip eliminates single point of failure
  • Self-healing: Network automatically routes around failed hosts
  • Real-time streaming: Metrics pushed immediately on change
  • Performance: Rust agents ~10-100x faster than Python
  • Simplified networking: Single ZMQ port (6130) vs multiple HTTP ports
  • Lower resource usage: Minimal memory/CPU footprint per agent

Implementation Plan

Phase 1: Agent Development

// Lightweight agent on each host
pub struct MetricsAgent {
    neighbors: Vec<String>,           // ["srv01:6130", "cmbox:6130"]
    collectors: Vec<Box<dyn Collector>>, // SMART, Service, Backup
    gossip_interval: Duration,        // How often to broadcast
    zmq_context: zmq::Context,
}

// Message format for metrics
#[derive(Serialize, Deserialize)]
struct MetricsMessage {
    hostname: String,
    agent_type: AgentType,     // Smart, Service, Backup
    timestamp: u64,
    metrics: MetricsData,
    hop_count: u8,             // Prevent infinite loops
}

Phase 2: Dashboard Integration

  • ZMQ Subscriber: Dashboard subscribes to gossip stream on srv01
  • Real-time updates: WebSocket connection to TUI for live streaming
  • Historical storage: Optional persistence layer for trending

Phase 3: Migration Strategy

  • Parallel deployment: Run ZMQ agents alongside existing HTTP APIs
  • A/B comparison: Validate metrics accuracy and performance
  • Gradual cutover: Switch dashboard to ZMQ, then remove HTTP services

Configuration Integration

Agent Configuration (per-host):

[metrics_agent]
enabled = true
port = 6130
neighbors = ["srv01:6130", "cmbox:6130"]  # Redundant connections
role = "agent"  # or "dashboard" for srv01

[collectors]
smart_metrics = { enabled = true, interval_ms = 5000 }
service_metrics = { enabled = true, interval_ms = 2000 }  # srv01 only
backup_metrics = { enabled = true, interval_ms = 30000 }  # srv01 only

Dashboard Configuration (updated):

[data_source]
type = "zmq_gossip"  # vs current "http_polling"
listen_port = 6130
buffer_size = 1000
real_time_updates = true

[legacy_support]
http_apis_enabled = true  # For migration period
fallback_to_http = true   # If ZMQ unavailable

Performance Comparison

Metric Current (HTTP) Proposed (ZMQ)
Collection latency ~50ms ~1ms
Network overhead HTTP headers + JSON Binary ZMQ frames
Resource per host ~5MB (Python + HTTP) ~1MB (Rust agent)
Update frequency 5s polling Real-time push
Network ports 3 per host 1 per host
Failure recovery Manual retry Auto-reconnect

Development Roadmap

Week 1-2: Basic ZMQ agent

  • Rust binary with ZMQ gossip protocol
  • SMART metrics collection
  • Configuration management

Week 3-4: Dashboard integration

  • ZMQ subscriber in cm-dashboard
  • Real-time TUI updates
  • Parallel HTTP/ZMQ operation

Week 5-6: Production readiness

  • Service/backup metrics support
  • Error handling and resilience
  • Performance benchmarking

Week 7-8: Migration and cleanup

  • Switch dashboard to ZMQ-only
  • Remove legacy HTTP APIs
  • Documentation and deployment

Potential Features

  • Plugin system for custom widgets
  • REST API for external integrations
  • Mobile companion app for alerts
  • Grafana integration for advanced graphing
  • Prometheus metrics export
  • Custom scripting for automated responses
  • Machine learning for predictive analytics
  • Clustering support for high availability

Integration Opportunities

  • Home Assistant integration
  • Slack/Discord notifications
  • SNMP support for network equipment
  • Docker/Kubernetes container monitoring
  • Cloud metrics integration (if needed)

Success Metrics

Technical Success

  • Zero crashes during normal operation
  • Sub-second response times for all operations
  • 99.9% uptime for monitoring (excluding network issues)
  • Minimal resource usage as specified

User Success

  • Faster problem detection compared to Glance
  • Reduced time to resolution for issues
  • Improved infrastructure awareness
  • Enhanced operational efficiency

Development Log

Project Initialization

  • Repository created: /home/cm/projects/cm-dashboard
  • Initial planning: TUI dashboard to replace Glance
  • Technology selected: Rust + ratatui
  • Architecture designed: Multi-host monitoring with existing API integration

Current Status (HTTP-based)

  • Functional TUI: Basic dashboard rendering with ratatui
  • HTTP API integration: Connects to ports 6127, 6128, 6129
  • Multi-host support: Configurable host management
  • Async architecture: Tokio-based concurrent metrics fetching
  • Configuration system: TOML-based host and dashboard configuration

Proposed Evolution: ZMQ Agent System

Rationale for Change: The current HTTP polling approach has fundamental limitations:

  1. Latency: 5-second refresh cycles miss rapid changes
  2. Resource overhead: Python HTTP servers consume unnecessary resources
  3. Network complexity: Multiple ports per host complicate firewall management
  4. Scalability: Linear resource growth with host count

Solution: Peer-to-peer ZMQ gossip network with Rust agents provides:

  • Real-time streaming: Sub-second metric propagation
  • Fault tolerance: Network self-heals around failed hosts
  • Performance: Native Rust speed vs interpreted Python
  • Simplicity: Single port per host, no central coordination

ZMQ Agent Development Plan

Component 1: cm-metrics-agent (New Rust binary)

[package]
name = "cm-metrics-agent"
version = "0.1.0"

[dependencies]
zmq = "0.10"
serde = { version = "1.0", features = ["derive"] }
tokio = { version = "1.0", features = ["full"] }
smartmontools-rs = "0.1"  # Or direct smartctl bindings

Component 2: Dashboard Integration (Update cm-dashboard)

  • Add ZMQ subscriber mode alongside HTTP client
  • Implement real-time metric streaming
  • Provide migration path from HTTP to ZMQ

Migration Strategy:

  1. Phase 1: Deploy agents alongside existing APIs
  2. Phase 2: Switch dashboard to ZMQ mode
  3. Phase 3: Remove HTTP APIs from NixOS configurations

Performance Targets:

  • Agent footprint: < 2MB RAM, < 1% CPU
  • Metric latency: < 100ms propagation across network
  • Network efficiency: < 1KB/s per host steady state