cm-dashboard/CLAUDE.md

# CM Dashboard - Infrastructure Monitoring TUI

## Overview
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.

## Project Goals

### Core Objectives
- **Real-time monitoring** of all infrastructure components
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
- **Performance-focused** with minimal resource usage
- **Keyboard-driven interface** for power users
- **Integration** with existing monitoring APIs (ports 6127, 6128, 6129)

### Key Features
- **NVMe health monitoring** with wear prediction
- **CPU / memory / GPU telemetry** with automatic thresholding
- **Service resource monitoring** with per-service CPU and RAM usage
- **Disk usage overview** for root filesystems
- **Backup status** with detailed metrics and history
- **Unified alert pipeline** summarising host health
- **Historical data tracking** and trend analysis

## Technical Architecture

### Technology Stack
- **Language**: Rust 🦀
- **TUI Framework**: ratatui (modern tui-rs fork)
- **Async Runtime**: tokio
- **HTTP Client**: reqwest
- **Serialization**: serde
- **CLI**: clap
- **Error Handling**: anyhow
- **Time**: chrono

### Dependencies
```toml
[dependencies]
ratatui = "0.24"           # Modern TUI framework
crossterm = "0.27"         # Cross-platform terminal handling
tokio = { version = "1.0", features = ["full"] }  # Async runtime
reqwest = { version = "0.11", features = ["json"] }  # HTTP client
serde = { version = "1.0", features = ["derive"] }   # JSON parsing
clap = { version = "4.0", features = ["derive"] }    # CLI args
anyhow = "1.0"             # Error handling
chrono = "0.4"             # Time handling
```

## Project Structure

```
cm-dashboard/
├── Cargo.toml
├── README.md
├── CLAUDE.md              # This file
├── src/
│   ├── main.rs            # Entry point & CLI
│   ├── app.rs             # Main application state
│   ├── ui/
│   │   ├── mod.rs
│   │   ├── dashboard.rs   # Main dashboard layout
│   │   ├── nvme.rs        # NVMe health widget
│   │   ├── services.rs    # Services status widget
│   │   ├── memory.rs      # RAM optimization widget
│   │   ├── backup.rs      # Backup status widget
│   │   └── alerts.rs      # Alerts/notifications widget
│   ├── api/
│   │   ├── mod.rs
│   │   ├── client.rs      # HTTP client wrapper
│   │   ├── smart.rs       # Smart metrics API (port 6127)
│   │   ├── service.rs     # Service metrics API (port 6128)
│   │   └── backup.rs      # Backup metrics API (port 6129)
│   ├── data/
│   │   ├── mod.rs
│   │   ├── metrics.rs     # Data structures
│   │   ├── history.rs     # Historical data storage
│   │   └── config.rs      # Host configuration
│   └── config.rs          # Application configuration
├── config/
│   ├── hosts.toml         # Host definitions
│   └── dashboard.toml     # Dashboard layout config
└── docs/
    ├── API.md             # API integration documentation
    └── WIDGETS.md         # Widget development guide
```

## API Integration

### Existing CMTEC APIs
1. **Smart Metrics API** (port 6127)
   - NVMe health status (wear, temperature, power-on hours)
   - Disk space information
   - SMART health indicators

2. **Service Metrics API** (port 6128)
   - Service status and resource usage
   - Service memory consumption vs limits
   - Host CPU load / frequency / temperature
   - Root disk utilisation snapshot
   - GPU utilisation and temperature (if available)

3. **Backup Metrics API** (port 6129)
   - Backup status and history
   - Repository statistics
   - Service integration status

### Data Structures
```rust
#[derive(Deserialize, Debug)]
pub struct SmartMetrics {
    pub status: String,
    pub drives: Vec<DriveInfo>,
    pub summary: DriveSummary,
    pub issues: Vec<String>,
    pub timestamp: u64,
}

#[derive(Deserialize, Debug)]
pub struct ServiceMetrics {
    pub summary: ServiceSummary,
    pub services: Vec<ServiceInfo>,
    pub timestamp: u64,
}

#[derive(Deserialize, Debug)]
pub struct ServiceSummary {
    pub healthy: usize,
    pub degraded: usize,
    pub failed: usize,
    pub memory_used_mb: f32,
    pub memory_quota_mb: f32,
    pub system_memory_used_mb: f32,
    pub system_memory_total_mb: f32,
    pub disk_used_gb: f32,
    pub disk_total_gb: f32,
    pub cpu_load_1: f32,
    pub cpu_load_5: f32,
    pub cpu_load_15: f32,
    pub cpu_freq_mhz: Option<f32>,
    pub cpu_temp_c: Option<f32>,
    pub gpu_load_percent: Option<f32>,
    pub gpu_temp_c: Option<f32>,
}

#[derive(Deserialize, Debug)]
pub struct BackupMetrics {
    pub overall_status: String,
    pub backup: BackupInfo,
    pub service: BackupServiceInfo,
    pub timestamp: u64,
}
```

## Dashboard Layout Design

### Main Dashboard View
```
┌─────────────────────────────────────────────────────────────────────┐
│ 📊 CMTEC Infrastructure Dashboard                          srv01     │
├─────────────────────────────────────────────────────────────────────┤
│ 💾 NVMe Health              │ 🐏 RAM Optimization                    │
│ ┌─────────────────────────┐  │ ┌─────────────────────────────────────┐ │
│ │ Wear: 4% (█░░░░░░░░░░)   │  │ │ Physical: 2.4G/7.6G (32%)          │ │
│ │ Temp: 56°C              │  │ │ zram: 64B/1.9G (64:1 compression)  │ │
│ │ Hours: 11419h (475d)    │  │ │ tmpfs: /var/log 88K/512M           │ │
│ │ Status: ✅ PASSED       │  │ │ Kernel: vm.dirty_ratio=5           │ │
│ └─────────────────────────┘  │ └─────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│ 🔧 Services Status                                                   │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ✅ Gitea (256M/4G, 15G/100G)      ✅ smart-metrics-api         │ │
│ │ ✅ Immich (1.2G/4G, 45G/500G)     ✅ service-metrics-api       │ │
│ │ ✅ Vaultwarden (45M/1G, 512M/1G)  ✅ backup-metrics-api        │ │
│ │ ✅ UniFi (234M/2G, 1.2G/5G)       ✅ WordPress M2              │ │
│ └─────────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────────┤
│ 📧 Recent Alerts                     │ 💾 Backup Status             │
│ 10:15 NVMe wear OK → 4%             │ Last: ✅ Success (04:00)      │
│ 04:00 Backup completed successfully │ Duration: 45m 32s            │
│ Yesterday: Email notification test   │ Size: 15.2GB → 4.1GB        │
│                                     │ Next: Tomorrow 04:00         │
└─────────────────────────────────────────────────────────────────────┘
Keys: [h]osts [r]efresh [s]ettings [a]lerts [←→] navigate [q]uit
```

### Multi-Host View
```
┌─────────────────────────────────────────────────────────────────────┐
│ 🖥️  CMTEC Host Overview                                              │
├─────────────────────────────────────────────────────────────────────┤
│ Host      │ NVMe Wear │ RAM Usage │ Services │ Last Alert            │
├─────────────────────────────────────────────────────────────────────┤
│ srv01     │ 4%   ✅   │ 32%  ✅   │ 8/8  ✅  │ 04:00 Backup OK       │
│ cmbox     │ 12%  ✅   │ 45%  ✅   │ 3/3  ✅  │ Yesterday Email test  │
│ labbox    │ 8%   ✅   │ 28%  ✅   │ 2/2  ✅  │ 2h ago NVMe temp OK   │
│ simonbox  │ 15%  ✅   │ 67%  ⚠️   │ 4/4  ✅  │ Gaming session active │
│ steambox  │ 23%  ✅   │ 78%  ⚠️   │ 2/2  ✅  │ High RAM usage        │
└─────────────────────────────────────────────────────────────────────┘
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
```

## Development Phases

### Phase 1: Foundation (Week 1-2)
- [x] Project setup with Cargo.toml
- [ ] Basic TUI framework with ratatui
- [ ] HTTP client for API connections
- [ ] Data structures for metrics
- [ ] Simple single-host dashboard

**Deliverables:**
- Working TUI that connects to srv01
- Real-time display of basic metrics
- Keyboard navigation

### Phase 2: Core Features (Week 3-4)
- [ ] All widget implementations
- [ ] Multi-host configuration
- [ ] Historical data storage
- [ ] Alert system integration
- [ ] Configuration management

**Deliverables:**
- Full-featured dashboard
- Multi-host monitoring
- Historical trending
- Configuration file support

### Phase 3: Advanced Features (Week 5-6)
- [ ] Predictive analytics
- [ ] Custom alert rules
- [ ] Export capabilities
- [ ] Performance optimizations
- [ ] Error handling & resilience

**Deliverables:**
- Production-ready dashboard
- Advanced monitoring features
- Comprehensive error handling
- Performance benchmarks

### Phase 4: Polish & Documentation (Week 7-8)
- [ ] Code documentation
- [ ] User documentation
- [ ] Installation scripts
- [ ] Testing suite
- [ ] Release preparation

**Deliverables:**
- Complete documentation
- Installation packages
- Test coverage
- Release v1.0

## Configuration

### Host Configuration (config/hosts.toml)
```toml
[hosts]

[hosts.srv01]
name = "srv01"
address = "192.168.30.100"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "server"

[hosts.cmbox]
name = "cmbox"
address = "192.168.30.101"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "workstation"

[hosts.labbox]
name = "labbox"
address = "192.168.30.102"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "lab"
```

### Dashboard Configuration (config/dashboard.toml)
```toml
[dashboard]
refresh_interval = 5  # seconds
history_retention = 7  # days
theme = "dark"

[widgets]
nvme_wear_threshold = 70
temperature_threshold = 70
memory_warning_threshold = 80
memory_critical_threshold = 90

[alerts]
email_enabled = true
sound_enabled = false
desktop_notifications = true
```

## Key Features

### Real-time Monitoring
- **Auto-refresh** configurable intervals (1-60 seconds)
- **Async data fetching** from multiple hosts simultaneously
- **Connection status** indicators for each host
- **Graceful degradation** when hosts are unreachable

### Historical Tracking
- **SQLite database** for local storage
- **Trend analysis** for wear levels and resource usage
- **Retention policies** configurable per metric type
- **Export capabilities** (CSV, JSON)

### Alert System
- **Threshold-based alerts** for all metrics
- **Email integration** with existing notification system
- **Alert acknowledgment** and history
- **Custom alert rules** with logical operators

### Multi-Host Management
- **Auto-discovery** of hosts on network
- **Host grouping** by role (server, workstation, lab)
- **Bulk operations** across multiple hosts
- **Host-specific configurations**

## Performance Requirements

### Resource Usage
- **Memory**: < 50MB runtime footprint
- **CPU**: < 1% average CPU usage
- **Network**: Minimal bandwidth (< 1KB/s per host)
- **Startup**: < 2 seconds cold start

### Responsiveness
- **UI updates**: 60 FPS smooth rendering
- **Data refresh**: < 500ms API response handling
- **Navigation**: Instant keyboard response
- **Error recovery**: < 5 seconds reconnection

## Security Considerations

### Network Security
- **Local network only** - no external connections
- **Authentication** for API access if implemented
- **Encrypted storage** for sensitive configuration
- **Audit logging** for administrative actions

### Data Privacy
- **Local storage** only - no cloud dependencies
- **Configurable retention** for historical data
- **Secure deletion** of expired data
- **No sensitive data logging**

## Testing Strategy

### Unit Tests
- API client modules
- Data parsing and validation
- Configuration management
- Alert logic

### Integration Tests
- Multi-host connectivity
- API error handling
- Database operations
- Alert delivery

### Performance Tests
- Memory usage under load
- Network timeout handling
- Large dataset rendering
- Extended runtime stability

## Deployment

### Installation
```bash
# Development build
cargo build --release

# Install from source
cargo install --path .

# Future: Package distribution
# Package for NixOS inclusion
```

### Usage
```bash
# Start dashboard
cm-dashboard

# Specify config
cm-dashboard --config /path/to/config

# Single host mode
cm-dashboard --host srv01

# Debug mode
cm-dashboard --verbose
```

## Maintenance

### Regular Tasks
- **Database cleanup** - automated retention policies
- **Log rotation** - configurable log levels and retention
- **Configuration validation** - startup configuration checks
- **Performance monitoring** - built-in metrics for dashboard itself

### Updates
- **Auto-update checks** - optional feature
- **Configuration migration** - version compatibility
- **API compatibility** - backwards compatibility with monitoring APIs
- **Feature toggles** - enable/disable features without rebuild

## Future Enhancements

### Proposed: ZMQ Metrics Agent Architecture

#### **Current Limitations of HTTP-based APIs**
- **Performance overhead**: Python scripts with HTTP servers on each host
- **Network complexity**: Multiple firewall ports (6127-6129) per host
- **Polling inefficiency**: Manual refresh cycles instead of real-time streaming
- **Scalability concerns**: Resource usage grows linearly with hosts

#### **Proposed: Rust ZMQ Gossip Network**

**Core Concept**: Replace HTTP polling with a peer-to-peer ZMQ gossip network where lightweight Rust agents stream metrics in real-time.

```
┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│  cmbox  │<-->│ labbox  │<-->│ srv01   │<-->│steambox │
│ :6130   │    │ :6130   │    │ :6130   │    │ :6130   │
└─────────┘    └─────────┘    └─────────┘    └─────────┘
      ^                            ^              ^
      └────────────────────────────┼──────────────┘
                                   v
                              ┌─────────┐
                              │simonbox │
                              │ :6130   │
                              └─────────┘
```

**Architecture Benefits**:
- **No central router**: Peer-to-peer gossip eliminates single point of failure
- **Self-healing**: Network automatically routes around failed hosts
- **Real-time streaming**: Metrics pushed immediately on change
- **Performance**: Rust agents ~10-100x faster than Python
- **Simplified networking**: Single ZMQ port (6130) vs multiple HTTP ports
- **Lower resource usage**: Minimal memory/CPU footprint per agent

#### **Implementation Plan**

**Phase 1: Agent Development**
```rust
// Lightweight agent on each host
pub struct MetricsAgent {
    neighbors: Vec<String>,           // ["srv01:6130", "cmbox:6130"]
    collectors: Vec<Box<dyn Collector>>, // SMART, Service, Backup
    gossip_interval: Duration,        // How often to broadcast
    zmq_context: zmq::Context,
}

// Message format for metrics
#[derive(Serialize, Deserialize)]
struct MetricsMessage {
    hostname: String,
    agent_type: AgentType,     // Smart, Service, Backup
    timestamp: u64,
    metrics: MetricsData,
    hop_count: u8,             // Prevent infinite loops
}
```

**Phase 2: Dashboard Integration**
- **ZMQ Subscriber**: Dashboard subscribes to gossip stream on srv01
- **Real-time updates**: WebSocket connection to TUI for live streaming
- **Historical storage**: Optional persistence layer for trending

**Phase 3: Migration Strategy**
- **Parallel deployment**: Run ZMQ agents alongside existing HTTP APIs
- **A/B comparison**: Validate metrics accuracy and performance
- **Gradual cutover**: Switch dashboard to ZMQ, then remove HTTP services

#### **Configuration Integration**

**Agent Configuration** (per-host):
```toml
[metrics_agent]
enabled = true
port = 6130
neighbors = ["srv01:6130", "cmbox:6130"]  # Redundant connections
role = "agent"  # or "dashboard" for srv01

[collectors]
smart_metrics = { enabled = true, interval_ms = 5000 }
service_metrics = { enabled = true, interval_ms = 2000 }  # srv01 only
backup_metrics = { enabled = true, interval_ms = 30000 }  # srv01 only
```

**Dashboard Configuration** (updated):
```toml
[data_source]
type = "zmq_gossip"  # vs current "http_polling"
listen_port = 6130
buffer_size = 1000
real_time_updates = true

[legacy_support]
http_apis_enabled = true  # For migration period
fallback_to_http = true   # If ZMQ unavailable
```

#### **Performance Comparison**

| Metric | Current (HTTP) | Proposed (ZMQ) |
|--------|---------------|----------------|
| Collection latency | ~50ms | ~1ms |
| Network overhead | HTTP headers + JSON | Binary ZMQ frames |
| Resource per host | ~5MB (Python + HTTP) | ~1MB (Rust agent) |
| Update frequency | 5s polling | Real-time push |
| Network ports | 3 per host | 1 per host |
| Failure recovery | Manual retry | Auto-reconnect |

#### **Development Roadmap**

**Week 1-2**: Basic ZMQ agent
- Rust binary with ZMQ gossip protocol
- SMART metrics collection
- Configuration management

**Week 3-4**: Dashboard integration
- ZMQ subscriber in cm-dashboard
- Real-time TUI updates
- Parallel HTTP/ZMQ operation

**Week 5-6**: Production readiness
- Service/backup metrics support
- Error handling and resilience
- Performance benchmarking

**Week 7-8**: Migration and cleanup
- Switch dashboard to ZMQ-only
- Remove legacy HTTP APIs
- Documentation and deployment

### Potential Features
- **Plugin system** for custom widgets
- **REST API** for external integrations
- **Mobile companion app** for alerts
- **Grafana integration** for advanced graphing
- **Prometheus metrics export**
- **Custom scripting** for automated responses
- **Machine learning** for predictive analytics
- **Clustering support** for high availability

### Integration Opportunities
- **Home Assistant** integration
- **Slack/Discord** notifications
- **SNMP support** for network equipment
- **Docker/Kubernetes** container monitoring
- **Cloud metrics** integration (if needed)

## Success Metrics

### Technical Success
- **Zero crashes** during normal operation
- **Sub-second response** times for all operations
- **99.9% uptime** for monitoring (excluding network issues)
- **Minimal resource usage** as specified

### User Success
- **Faster problem detection** compared to Glance
- **Reduced time to resolution** for issues
- **Improved infrastructure awareness**
- **Enhanced operational efficiency**

---

## Development Log

### Project Initialization
- Repository created: `/home/cm/projects/cm-dashboard`
- Initial planning: TUI dashboard to replace Glance
- Technology selected: Rust + ratatui
- Architecture designed: Multi-host monitoring with existing API integration

### Current Status (HTTP-based)
- **Functional TUI**: Basic dashboard rendering with ratatui
- **HTTP API integration**: Connects to ports 6127, 6128, 6129
- **Multi-host support**: Configurable host management
- **Async architecture**: Tokio-based concurrent metrics fetching
- **Configuration system**: TOML-based host and dashboard configuration

### Proposed Evolution: ZMQ Agent System

**Rationale for Change**: The current HTTP polling approach has fundamental limitations:
1. **Latency**: 5-second refresh cycles miss rapid changes
2. **Resource overhead**: Python HTTP servers consume unnecessary resources
3. **Network complexity**: Multiple ports per host complicate firewall management
4. **Scalability**: Linear resource growth with host count

**Solution**: Peer-to-peer ZMQ gossip network with Rust agents provides:
- **Real-time streaming**: Sub-second metric propagation
- **Fault tolerance**: Network self-heals around failed hosts
- **Performance**: Native Rust speed vs interpreted Python
- **Simplicity**: Single port per host, no central coordination

### ZMQ Agent Development Plan

**Component 1: cm-metrics-agent** (New Rust binary)
```toml
[package]
name = "cm-metrics-agent"
version = "0.1.0"

[dependencies]
zmq = "0.10"
serde = { version = "1.0", features = ["derive"] }
tokio = { version = "1.0", features = ["full"] }
smartmontools-rs = "0.1"  # Or direct smartctl bindings
```

**Component 2: Dashboard Integration** (Update cm-dashboard)
- Add ZMQ subscriber mode alongside HTTP client
- Implement real-time metric streaming
- Provide migration path from HTTP to ZMQ

**Migration Strategy**:
1. **Phase 1**: Deploy agents alongside existing APIs
2. **Phase 2**: Switch dashboard to ZMQ mode
3. **Phase 3**: Remove HTTP APIs from NixOS configurations

**Performance Targets**:
- **Agent footprint**: < 2MB RAM, < 1% CPU
- **Metric latency**: < 100ms propagation across network
- **Network efficiency**: < 1KB/s per host steady state