Christoffer Martinsson 2581435b10 Implement per-service disk usage monitoring

Replaced system-wide disk usage with accurate per-service tracking by scanning
service-specific directories. Services like sshd now correctly show minimal
disk usage instead of misleading system totals.

- Rename storage widget and add drive capacity/usage columns
- Move host display to main dashboard title for cleaner layout
- Replace separate alert displays with color-coded row highlighting
- Add per-service disk usage collection using du command
- Update services widget formatting to handle small disk values
- Restructure into workspace with dedicated agent and dashboard packages

2025-10-11 22:59:16 +02:00

23 KiB

Raw Blame History

CM Dashboard - Infrastructure Monitoring TUI

Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.

Project Goals

Core Objectives

Real-time monitoring of all infrastructure components
Multi-host support for cmbox, labbox, simonbox, steambox, srv01
Performance-focused with minimal resource usage
Keyboard-driven interface for power users
Integration with existing monitoring APIs (ports 6127, 6128, 6129)

Key Features

NVMe health monitoring with wear prediction
CPU / memory / GPU telemetry with automatic thresholding
Service resource monitoring with per-service CPU and RAM usage
Disk usage overview for root filesystems
Backup status with detailed metrics and history
Unified alert pipeline summarising host health
Historical data tracking and trend analysis

Technical Architecture

Technology Stack

Language: Rust 🦀
TUI Framework: ratatui (modern tui-rs fork)
Async Runtime: tokio
HTTP Client: reqwest
Serialization: serde
CLI: clap
Error Handling: anyhow
Time: chrono

Dependencies

[dependencies]
ratatui = "0.24"           # Modern TUI framework
crossterm = "0.27"         # Cross-platform terminal handling
tokio = { version = "1.0", features = ["full"] }  # Async runtime
reqwest = { version = "0.11", features = ["json"] }  # HTTP client
serde = { version = "1.0", features = ["derive"] }   # JSON parsing
clap = { version = "4.0", features = ["derive"] }    # CLI args
anyhow = "1.0"             # Error handling
chrono = "0.4"             # Time handling

Project Structure

cm-dashboard/
├── Cargo.toml
├── README.md
├── CLAUDE.md              # This file
├── src/
│   ├── main.rs            # Entry point & CLI
│   ├── app.rs             # Main application state
│   ├── ui/
│   │   ├── mod.rs
│   │   ├── dashboard.rs   # Main dashboard layout
│   │   ├── nvme.rs        # NVMe health widget
│   │   ├── services.rs    # Services status widget
│   │   ├── memory.rs      # RAM optimization widget
│   │   ├── backup.rs      # Backup status widget
│   │   └── alerts.rs      # Alerts/notifications widget
│   ├── api/
│   │   ├── mod.rs
│   │   ├── client.rs      # HTTP client wrapper
│   │   ├── smart.rs       # Smart metrics API (port 6127)
│   │   ├── service.rs     # Service metrics API (port 6128)
│   │   └── backup.rs      # Backup metrics API (port 6129)
│   ├── data/
│   │   ├── mod.rs
│   │   ├── metrics.rs     # Data structures
│   │   ├── history.rs     # Historical data storage
│   │   └── config.rs      # Host configuration
│   └── config.rs          # Application configuration
├── config/
│   ├── hosts.toml         # Host definitions
│   └── dashboard.toml     # Dashboard layout config
└── docs/
    ├── API.md             # API integration documentation
    └── WIDGETS.md         # Widget development guide

API Integration

Existing CMTEC APIs

Smart Metrics API (port 6127)
- NVMe health status (wear, temperature, power-on hours)
- Disk space information
- SMART health indicators
Service Metrics API (port 6128)
- Service status and resource usage
- Service memory consumption vs limits
- Host CPU load / frequency / temperature
- Root disk utilisation snapshot
- GPU utilisation and temperature (if available)
Backup Metrics API (port 6129)
- Backup status and history
- Repository statistics
- Service integration status

Data Structures

#[derive(Deserialize, Debug)]
pub struct SmartMetrics {
    pub status: String,
    pub drives: Vec<DriveInfo>,
    pub summary: DriveSummary,
    pub issues: Vec<String>,
    pub timestamp: u64,
}

#[derive(Deserialize, Debug)]
pub struct ServiceMetrics {
    pub summary: ServiceSummary,
    pub services: Vec<ServiceInfo>,
    pub timestamp: u64,
}

#[derive(Deserialize, Debug)]
pub struct ServiceSummary {
    pub healthy: usize,
    pub degraded: usize,
    pub failed: usize,
    pub memory_used_mb: f32,
    pub memory_quota_mb: f32,
    pub system_memory_used_mb: f32,
    pub system_memory_total_mb: f32,
    pub disk_used_gb: f32,
    pub disk_total_gb: f32,
    pub cpu_load_1: f32,
    pub cpu_load_5: f32,
    pub cpu_load_15: f32,
    pub cpu_freq_mhz: Option<f32>,
    pub cpu_temp_c: Option<f32>,
    pub gpu_load_percent: Option<f32>,
    pub gpu_temp_c: Option<f32>,
}

#[derive(Deserialize, Debug)]
pub struct BackupMetrics {
    pub overall_status: String,
    pub backup: BackupInfo,
    pub service: BackupServiceInfo,
    pub timestamp: u64,
}

Dashboard Layout Design

Main Dashboard View

┌─────────────────────────────────────────────────────────────────────┐
│ 📊 CMTEC Infrastructure Dashboard                          srv01     │
├─────────────────────────────────────────────────────────────────────┤
│ 💾 NVMe Health              │ 🐏 RAM Optimization                    │
│ ┌─────────────────────────┐  │ ┌─────────────────────────────────────┐ │
│ │ Wear: 4% (█░░░░░░░░░░)   │  │ │ Physical: 2.4G/7.6G (32%)          │ │
│ │ Temp: 56°C              │  │ │ zram: 64B/1.9G (64:1 compression)  │ │
│ │ Hours: 11419h (475d)    │  │ │ tmpfs: /var/log 88K/512M           │ │
│ │ Status: ✅ PASSED       │  │ │ Kernel: vm.dirty_ratio=5           │ │
│ └─────────────────────────┘  │ └─────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│ 🔧 Services Status                                                   │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ✅ Gitea (256M/4G, 15G/100G)      ✅ smart-metrics-api         │ │
│ │ ✅ Immich (1.2G/4G, 45G/500G)     ✅ service-metrics-api       │ │
│ │ ✅ Vaultwarden (45M/1G, 512M/1G)  ✅ backup-metrics-api        │ │
│ │ ✅ UniFi (234M/2G, 1.2G/5G)       ✅ WordPress M2              │ │
│ └─────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│ 📧 Recent Alerts                     │ 💾 Backup Status             │
│ 10:15 NVMe wear OK → 4%             │ Last: ✅ Success (04:00)      │
│ 04:00 Backup completed successfully │ Duration: 45m 32s            │
│ Yesterday: Email notification test   │ Size: 15.2GB → 4.1GB        │
│                                     │ Next: Tomorrow 04:00         │
└─────────────────────────────────────────────────────────────────────┘
Keys: [h]osts [r]efresh [s]ettings [a]lerts [←→] navigate [q]uit

Multi-Host View

┌─────────────────────────────────────────────────────────────────────┐
│ 🖥️  CMTEC Host Overview                                              │
├─────────────────────────────────────────────────────────────────────┤
│ Host      │ NVMe Wear │ RAM Usage │ Services │ Last Alert            │
├─────────────────────────────────────────────────────────────────────┤
│ srv01     │ 4%   ✅   │ 32%  ✅   │ 8/8  ✅  │ 04:00 Backup OK       │
│ cmbox     │ 12%  ✅   │ 45%  ✅   │ 3/3  ✅  │ Yesterday Email test  │
│ labbox    │ 8%   ✅   │ 28%  ✅   │ 2/2  ✅  │ 2h ago NVMe temp OK   │
│ simonbox  │ 15%  ✅   │ 67%  ⚠️   │ 4/4  ✅  │ Gaming session active │
│ steambox  │ 23%  ✅   │ 78%  ⚠️   │ 2/2  ✅  │ High RAM usage        │
└─────────────────────────────────────────────────────────────────────┘
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit

Development Phases

Phase 1: Foundation (Week 1-2)

Project setup with Cargo.toml
Basic TUI framework with ratatui
HTTP client for API connections
Data structures for metrics
Simple single-host dashboard

Deliverables:

Working TUI that connects to srv01
Real-time display of basic metrics
Keyboard navigation

Phase 2: Core Features (Week 3-4)

All widget implementations
Multi-host configuration
Historical data storage
Alert system integration
Configuration management

Deliverables:

Full-featured dashboard
Multi-host monitoring
Historical trending
Configuration file support

Phase 3: Advanced Features (Week 5-6)

Predictive analytics
Custom alert rules
Export capabilities
Performance optimizations
Error handling & resilience

Deliverables:

Production-ready dashboard
Advanced monitoring features
Comprehensive error handling
Performance benchmarks

Phase 4: Polish & Documentation (Week 7-8)

Code documentation
User documentation
Installation scripts
Testing suite
Release preparation

Deliverables:

Complete documentation
Installation packages
Test coverage
Release v1.0

Configuration

Host Configuration (config/hosts.toml)

[hosts]

[hosts.srv01]
name = "srv01"
address = "192.168.30.100"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "server"

[hosts.cmbox]
name = "cmbox"
address = "192.168.30.101"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "workstation"

[hosts.labbox]
name = "labbox" 
address = "192.168.30.102"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "lab"

Dashboard Configuration (config/dashboard.toml)

[dashboard]
refresh_interval = 5  # seconds
history_retention = 7  # days
theme = "dark"

[widgets]
nvme_wear_threshold = 70
temperature_threshold = 70
memory_warning_threshold = 80
memory_critical_threshold = 90

[alerts]
email_enabled = true
sound_enabled = false
desktop_notifications = true

Key Features

Real-time Monitoring

Auto-refresh configurable intervals (1-60 seconds)
Async data fetching from multiple hosts simultaneously
Connection status indicators for each host
Graceful degradation when hosts are unreachable

Historical Tracking

SQLite database for local storage
Trend analysis for wear levels and resource usage
Retention policies configurable per metric type
Export capabilities (CSV, JSON)

Alert System

Threshold-based alerts for all metrics
Email integration with existing notification system
Alert acknowledgment and history
Custom alert rules with logical operators

Multi-Host Management

Auto-discovery of hosts on network
Host grouping by role (server, workstation, lab)
Bulk operations across multiple hosts
Host-specific configurations

Performance Requirements

Resource Usage

Memory: < 50MB runtime footprint
CPU: < 1% average CPU usage
Network: Minimal bandwidth (< 1KB/s per host)
Startup: < 2 seconds cold start

Responsiveness

UI updates: 60 FPS smooth rendering
Data refresh: < 500ms API response handling
Navigation: Instant keyboard response
Error recovery: < 5 seconds reconnection

Security Considerations

Network Security

Local network only - no external connections
Authentication for API access if implemented
Encrypted storage for sensitive configuration
Audit logging for administrative actions

Data Privacy

Local storage only - no cloud dependencies
Configurable retention for historical data
Secure deletion of expired data
No sensitive data logging

Testing Strategy

Unit Tests

API client modules
Data parsing and validation
Configuration management
Alert logic

Integration Tests

Multi-host connectivity
API error handling
Database operations
Alert delivery

Performance Tests

Memory usage under load
Network timeout handling
Large dataset rendering
Extended runtime stability

Deployment

Installation

# Development build
cargo build --release

# Install from source
cargo install --path .

# Future: Package distribution
# Package for NixOS inclusion

Usage

# Start dashboard
cm-dashboard

# Specify config
cm-dashboard --config /path/to/config

# Single host mode
cm-dashboard --host srv01

# Debug mode
cm-dashboard --verbose

Maintenance

Regular Tasks

Database cleanup - automated retention policies
Log rotation - configurable log levels and retention
Configuration validation - startup configuration checks
Performance monitoring - built-in metrics for dashboard itself

Updates

Auto-update checks - optional feature
Configuration migration - version compatibility
API compatibility - backwards compatibility with monitoring APIs
Feature toggles - enable/disable features without rebuild

Future Enhancements

Proposed: ZMQ Metrics Agent Architecture

Current Limitations of HTTP-based APIs

Performance overhead: Python scripts with HTTP servers on each host
Network complexity: Multiple firewall ports (6127-6129) per host
Polling inefficiency: Manual refresh cycles instead of real-time streaming
Scalability concerns: Resource usage grows linearly with hosts

Proposed: Rust ZMQ Gossip Network

Core Concept: Replace HTTP polling with a peer-to-peer ZMQ gossip network where lightweight Rust agents stream metrics in real-time.

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│  cmbox  │<-->│ labbox  │<-->│ srv01   │<-->│steambox │
│ :6130   │    │ :6130   │    │ :6130   │    │ :6130   │
└─────────┘    └─────────┘    └─────────┘    └─────────┘
      ^                            ^              ^
      └────────────────────────────┼──────────────┘
                                   v
                              ┌─────────┐
                              │simonbox │
                              │ :6130   │
                              └─────────┘

Architecture Benefits:

No central router: Peer-to-peer gossip eliminates single point of failure
Self-healing: Network automatically routes around failed hosts
Real-time streaming: Metrics pushed immediately on change
Performance: Rust agents ~10-100x faster than Python
Simplified networking: Single ZMQ port (6130) vs multiple HTTP ports
Lower resource usage: Minimal memory/CPU footprint per agent

Implementation Plan

Phase 1: Agent Development

// Lightweight agent on each host
pub struct MetricsAgent {
    neighbors: Vec<String>,           // ["srv01:6130", "cmbox:6130"]
    collectors: Vec<Box<dyn Collector>>, // SMART, Service, Backup
    gossip_interval: Duration,        // How often to broadcast
    zmq_context: zmq::Context,
}

// Message format for metrics
#[derive(Serialize, Deserialize)]
struct MetricsMessage {
    hostname: String,
    agent_type: AgentType,     // Smart, Service, Backup
    timestamp: u64,
    metrics: MetricsData,
    hop_count: u8,             // Prevent infinite loops
}

Phase 2: Dashboard Integration

ZMQ Subscriber: Dashboard subscribes to gossip stream on srv01
Real-time updates: WebSocket connection to TUI for live streaming
Historical storage: Optional persistence layer for trending

Phase 3: Migration Strategy

Parallel deployment: Run ZMQ agents alongside existing HTTP APIs
A/B comparison: Validate metrics accuracy and performance
Gradual cutover: Switch dashboard to ZMQ, then remove HTTP services

Configuration Integration

Agent Configuration (per-host):

[metrics_agent]
enabled = true
port = 6130
neighbors = ["srv01:6130", "cmbox:6130"]  # Redundant connections
role = "agent"  # or "dashboard" for srv01

[collectors]
smart_metrics = { enabled = true, interval_ms = 5000 }
service_metrics = { enabled = true, interval_ms = 2000 }  # srv01 only
backup_metrics = { enabled = true, interval_ms = 30000 }  # srv01 only

Dashboard Configuration (updated):

[data_source]
type = "zmq_gossip"  # vs current "http_polling"
listen_port = 6130
buffer_size = 1000
real_time_updates = true

[legacy_support]
http_apis_enabled = true  # For migration period
fallback_to_http = true   # If ZMQ unavailable

Performance Comparison

Metric	Current (HTTP)	Proposed (ZMQ)
Collection latency	~50ms	~1ms
Network overhead	HTTP headers + JSON	Binary ZMQ frames
Resource per host	~5MB (Python + HTTP)	~1MB (Rust agent)
Update frequency	5s polling	Real-time push
Network ports	3 per host	1 per host
Failure recovery	Manual retry	Auto-reconnect

Development Roadmap

Week 1-2: Basic ZMQ agent

Rust binary with ZMQ gossip protocol
SMART metrics collection
Configuration management

Week 3-4: Dashboard integration

ZMQ subscriber in cm-dashboard
Real-time TUI updates
Parallel HTTP/ZMQ operation

Week 5-6: Production readiness

Service/backup metrics support
Error handling and resilience
Performance benchmarking

Week 7-8: Migration and cleanup

Switch dashboard to ZMQ-only
Remove legacy HTTP APIs
Documentation and deployment

Potential Features

Plugin system for custom widgets
REST API for external integrations
Mobile companion app for alerts
Grafana integration for advanced graphing
Prometheus metrics export
Custom scripting for automated responses
Machine learning for predictive analytics
Clustering support for high availability

Integration Opportunities

Home Assistant integration
Slack/Discord notifications
SNMP support for network equipment
Docker/Kubernetes container monitoring
Cloud metrics integration (if needed)

Success Metrics

Technical Success

Zero crashes during normal operation
Sub-second response times for all operations
99.9% uptime for monitoring (excluding network issues)
Minimal resource usage as specified

User Success

Faster problem detection compared to Glance
Reduced time to resolution for issues
Improved infrastructure awareness
Enhanced operational efficiency

Development Log

Project Initialization

Repository created: /home/cm/projects/cm-dashboard
Initial planning: TUI dashboard to replace Glance
Technology selected: Rust + ratatui
Architecture designed: Multi-host monitoring with existing API integration

Current Status (HTTP-based)

Functional TUI: Basic dashboard rendering with ratatui
HTTP API integration: Connects to ports 6127, 6128, 6129
Multi-host support: Configurable host management
Async architecture: Tokio-based concurrent metrics fetching
Configuration system: TOML-based host and dashboard configuration

Proposed Evolution: ZMQ Agent System

Rationale for Change: The current HTTP polling approach has fundamental limitations:

Latency: 5-second refresh cycles miss rapid changes
Resource overhead: Python HTTP servers consume unnecessary resources
Network complexity: Multiple ports per host complicate firewall management
Scalability: Linear resource growth with host count

Solution: Peer-to-peer ZMQ gossip network with Rust agents provides:

Real-time streaming: Sub-second metric propagation
Fault tolerance: Network self-heals around failed hosts
Performance: Native Rust speed vs interpreted Python
Simplicity: Single port per host, no central coordination

ZMQ Agent Development Plan

Component 1: cm-metrics-agent (New Rust binary)

[package]
name = "cm-metrics-agent"
version = "0.1.0"

[dependencies]
zmq = "0.10"
serde = { version = "1.0", features = ["derive"] }
tokio = { version = "1.0", features = ["full"] }
smartmontools-rs = "0.1"  # Or direct smartctl bindings

Component 2: Dashboard Integration (Update cm-dashboard)

Add ZMQ subscriber mode alongside HTTP client
Implement real-time metric streaming
Provide migration path from HTTP to ZMQ

Migration Strategy:

Phase 1: Deploy agents alongside existing APIs
Phase 2: Switch dashboard to ZMQ mode
Phase 3: Remove HTTP APIs from NixOS configurations

Performance Targets:

Agent footprint: < 2MB RAM, < 1% CPU
Metric latency: < 100ms propagation across network
Network efficiency: < 1KB/s per host steady state

23 KiB Raw Blame History

CM Dashboard - Infrastructure Monitoring TUI

Overview

Project Goals

Core Objectives

Key Features

Technical Architecture

Technology Stack

Dependencies

Project Structure

API Integration

Existing CMTEC APIs

Data Structures

Dashboard Layout Design

Main Dashboard View

Multi-Host View

Development Phases

Phase 1: Foundation (Week 1-2)

Phase 2: Core Features (Week 3-4)

Phase 3: Advanced Features (Week 5-6)

Phase 4: Polish & Documentation (Week 7-8)

Configuration

Host Configuration (config/hosts.toml)

Dashboard Configuration (config/dashboard.toml)

Key Features

Real-time Monitoring

Historical Tracking

Alert System

Multi-Host Management

Performance Requirements

Resource Usage

Responsiveness

Security Considerations

Network Security

Data Privacy

Testing Strategy

Unit Tests

Integration Tests

Performance Tests

Deployment

Installation

Usage

Maintenance

Regular Tasks

Updates

Future Enhancements

Proposed: ZMQ Metrics Agent Architecture

Current Limitations of HTTP-based APIs

Proposed: Rust ZMQ Gossip Network

Implementation Plan

Configuration Integration

Performance Comparison

Development Roadmap

Potential Features

Integration Opportunities

Success Metrics

Technical Success

User Success

Development Log

Project Initialization

Current Status (HTTP-based)

Proposed Evolution: ZMQ Agent System

ZMQ Agent Development Plan

23 KiB

Raw Blame History