This commit is contained in:
2025-10-12 14:53:27 +02:00
parent 2581435b10
commit 2239badc8a
16 changed files with 1116 additions and 1414 deletions

518
CLAUDE.md
View File

@@ -1,11 +1,13 @@
# CM Dashboard - Infrastructure Monitoring TUI
## Overview
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.
## Project Goals
### Core Objectives
- **Real-time monitoring** of all infrastructure components
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
- **Performance-focused** with minimal resource usage
@@ -13,6 +15,7 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
- **Integration** with existing monitoring APIs (ports 6127, 6128, 6129)
### Key Features
- **NVMe health monitoring** with wear prediction
- **CPU / memory / GPU telemetry** with automatic thresholding
- **Service resource monitoring** with per-service CPU and RAM usage
@@ -24,6 +27,7 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
## Technical Architecture
### Technology Stack
- **Language**: Rust 🦀
- **TUI Framework**: ratatui (modern tui-rs fork)
- **Async Runtime**: tokio
@@ -34,6 +38,7 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
- **Time**: chrono
### Dependencies
```toml
[dependencies]
ratatui = "0.24" # Modern TUI framework
@@ -84,27 +89,8 @@ cm-dashboard/
└── WIDGETS.md # Widget development guide
```
## API Integration
### Existing CMTEC APIs
1. **Smart Metrics API** (port 6127)
- NVMe health status (wear, temperature, power-on hours)
- Disk space information
- SMART health indicators
2. **Service Metrics API** (port 6128)
- Service status and resource usage
- Service memory consumption vs limits
- Host CPU load / frequency / temperature
- Root disk utilisation snapshot
- GPU utilisation and temperature (if available)
3. **Backup Metrics API** (port 6129)
- Backup status and history
- Repository statistics
- Service integration status
### Data Structures
```rust
#[derive(Deserialize, Debug)]
pub struct SmartMetrics {
@@ -154,36 +140,35 @@ pub struct BackupMetrics {
## Dashboard Layout Design
### Main Dashboard View
```
┌─────────────────────────────────────────────────────────────────────┐
📊 CMTEC Infrastructure Dashboard srv01
CM Dashboard • cmbox
├─────────────────────────────────────────────────────────────────────┤
💾 NVMe Health │ 🐏 RAM Optimization
│ ┌─────────────────────────┐ │ ┌─────────────────────────────────────┐
│ │ Wear: 4% (█░░░░░░░░░░) │ │ Physical: 2.4G/7.6G (32%) │ │
│ │ Temp: 56°C │ │ zram: 64B/1.9G (64:1 compression) │ │
│ │ Hours: 11419h (475d) │ │ tmpfs: /var/log 88K/512M │ │
│ │ Status: ✅ PASSED │ │ │ Kernel: vm.dirty_ratio=5 │ │
│ └─────────────────────────┘ │ └─────────────────────────────────────┘
Storage • ok:1 warn:0 crit:0 │ Services • ok:1 warn:0 fail:0
│ ┌─────────────────────────────────┐ │ ┌───────────────────────────────
│ │Drive Temp Wear Spare Hours │ │ │Service memory: 7.1/23899.7 MiB│ │
│ │nvme0n1 28°C 1% 100% 14489 │ │ │Disk usage: — │ │
│ │ Capacity Usage │ │ │ Service Memory Disk │ │
│ │ 954G 77G (8%) │ │ │✔ sshd 7.1 MiB — │ │
│ └─────────────────────────────────┘ │ └───────────────────────────────
├─────────────────────────────────────────────────────────────────────┤
🔧 Services Status
┌─────────────────────────────────────────────────────────────────┐
│ ✅ Gitea (256M/4G, 15G/100G) ✅ smart-metrics-api │ │
│ ✅ Immich (1.2G/4G, 45G/500G) ✅ service-metrics-api │ │
│ ✅ Vaultwarden (45M/1G, 512M/1G) ✅ backup-metrics-api │ │
│ │ ✅ UniFi (234M/2G, 1.2G/5G) ✅ WordPress M2 │ │
│ └─────────────────────────────────────────────────────────────────┘ │
CPU / Memory • warn │ Backups
System memory: 5251.7/23899.7 MiB │ Host cmbox awaiting backup │
CPU load (1/5/15): 2.18 2.66 2.56 │ metrics │ │
CPU freq: 1100.1 MHz │ │ │
CPU temp: 47.0°C │ │ │
├─────────────────────────────────────────────────────────────────────┤
📧 Recent Alerts │ 💾 Backup Status
10:15 NVMe wear OK → 4% │ Last: ✅ Success (04:00)
04:00 Backup completed successfully │ Duration: 45m 32s
Yesterday: Email notification test │ Size: 15.2GB → 4.1GB
│ │ Next: Tomorrow 04:00 │
Alerts • ok:0 warn:3 fail:0 │ Status • ZMQ connected
cmbox: warning: CPU load 2.18 │ Monitoring • hosts: 3
srv01: pending: awaiting metrics │ Data source: ZMQ connected
labbox: pending: awaiting metrics │ Active host: cmbox (1/3)
└─────────────────────────────────────────────────────────────────────┘
Keys: [h]osts [r]efresh [s]ettings [a]lerts [←→] navigate [q]uit
Keys: [←→] hosts [r]efresh [q]uit
```
### Multi-Host View
```
┌─────────────────────────────────────────────────────────────────────┐
│ 🖥️ CMTEC Host Overview │
@@ -199,445 +184,28 @@ Keys: [h]osts [r]efresh [s]ettings [a]lerts [←→] navigate [q]uit
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
```
## Development Phases
## Development Status
### Phase 1: Foundation (Week 1-2)
- [x] Project setup with Cargo.toml
- [ ] Basic TUI framework with ratatui
- [ ] HTTP client for API connections
- [ ] Data structures for metrics
- [ ] Simple single-host dashboard
### Immediate TODOs
**Deliverables:**
- Working TUI that connects to srv01
- Real-time display of basic metrics
- Keyboard navigation
- Refactor all dashboard widgets to use a shared table/layout helper so icons, padding, and titles remain consistent across panels
### Phase 2: Core Features (Week 3-4)
- [ ] All widget implementations
- [ ] Multi-host configuration
- [ ] Historical data storage
- [ ] Alert system integration
- [ ] Configuration management
- Investigate why the backup metrics agent is not publishing data to the dashboard
- Resize the services widget so it can display more services without truncation
- Remove the dedicated status widget and redistribute the layout space
- Add responsive scaling within each widget so columns and content adapt dynamically
**Deliverables:**
- Full-featured dashboard
- Multi-host monitoring
- Historical trending
- Configuration file support
### Phase 3: Advanced Features 🚧 IN PROGRESS
### Phase 3: Advanced Features (Week 5-6)
- [ ] Predictive analytics
- [ ] Custom alert rules
- [ ] Export capabilities
- [ ] Performance optimizations
- [ ] Error handling & resilience
- [x] ZMQ gossip network implementation
- [x] Comprehensive error handling
- [x] Performance optimizations
- [ ] Predictive analytics for wear levels
- [ ] Custom alert rules engine
- [ ] Historical data export capabilities
**Deliverables:**
- Production-ready dashboard
- Advanced monitoring features
- Comprehensive error handling
- Performance benchmarks
# Important Communication Guidelines
### Phase 4: Polish & Documentation (Week 7-8)
- [ ] Code documentation
- [ ] User documentation
- [ ] Installation scripts
- [ ] Testing suite
- [ ] Release preparation
NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.
**Deliverables:**
- Complete documentation
- Installation packages
- Test coverage
- Release v1.0
## Configuration
### Host Configuration (config/hosts.toml)
```toml
[hosts]
[hosts.srv01]
name = "srv01"
address = "192.168.30.100"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "server"
[hosts.cmbox]
name = "cmbox"
address = "192.168.30.101"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "workstation"
[hosts.labbox]
name = "labbox"
address = "192.168.30.102"
smart_api = 6127
service_api = 6128
backup_api = 6129
role = "lab"
```
### Dashboard Configuration (config/dashboard.toml)
```toml
[dashboard]
refresh_interval = 5 # seconds
history_retention = 7 # days
theme = "dark"
[widgets]
nvme_wear_threshold = 70
temperature_threshold = 70
memory_warning_threshold = 80
memory_critical_threshold = 90
[alerts]
email_enabled = true
sound_enabled = false
desktop_notifications = true
```
## Key Features
### Real-time Monitoring
- **Auto-refresh** configurable intervals (1-60 seconds)
- **Async data fetching** from multiple hosts simultaneously
- **Connection status** indicators for each host
- **Graceful degradation** when hosts are unreachable
### Historical Tracking
- **SQLite database** for local storage
- **Trend analysis** for wear levels and resource usage
- **Retention policies** configurable per metric type
- **Export capabilities** (CSV, JSON)
### Alert System
- **Threshold-based alerts** for all metrics
- **Email integration** with existing notification system
- **Alert acknowledgment** and history
- **Custom alert rules** with logical operators
### Multi-Host Management
- **Auto-discovery** of hosts on network
- **Host grouping** by role (server, workstation, lab)
- **Bulk operations** across multiple hosts
- **Host-specific configurations**
## Performance Requirements
### Resource Usage
- **Memory**: < 50MB runtime footprint
- **CPU**: < 1% average CPU usage
- **Network**: Minimal bandwidth (< 1KB/s per host)
- **Startup**: < 2 seconds cold start
### Responsiveness
- **UI updates**: 60 FPS smooth rendering
- **Data refresh**: < 500ms API response handling
- **Navigation**: Instant keyboard response
- **Error recovery**: < 5 seconds reconnection
## Security Considerations
### Network Security
- **Local network only** - no external connections
- **Authentication** for API access if implemented
- **Encrypted storage** for sensitive configuration
- **Audit logging** for administrative actions
### Data Privacy
- **Local storage** only - no cloud dependencies
- **Configurable retention** for historical data
- **Secure deletion** of expired data
- **No sensitive data logging**
## Testing Strategy
### Unit Tests
- API client modules
- Data parsing and validation
- Configuration management
- Alert logic
### Integration Tests
- Multi-host connectivity
- API error handling
- Database operations
- Alert delivery
### Performance Tests
- Memory usage under load
- Network timeout handling
- Large dataset rendering
- Extended runtime stability
## Deployment
### Installation
```bash
# Development build
cargo build --release
# Install from source
cargo install --path .
# Future: Package distribution
# Package for NixOS inclusion
```
### Usage
```bash
# Start dashboard
cm-dashboard
# Specify config
cm-dashboard --config /path/to/config
# Single host mode
cm-dashboard --host srv01
# Debug mode
cm-dashboard --verbose
```
## Maintenance
### Regular Tasks
- **Database cleanup** - automated retention policies
- **Log rotation** - configurable log levels and retention
- **Configuration validation** - startup configuration checks
- **Performance monitoring** - built-in metrics for dashboard itself
### Updates
- **Auto-update checks** - optional feature
- **Configuration migration** - version compatibility
- **API compatibility** - backwards compatibility with monitoring APIs
- **Feature toggles** - enable/disable features without rebuild
## Future Enhancements
### Proposed: ZMQ Metrics Agent Architecture
#### **Current Limitations of HTTP-based APIs**
- **Performance overhead**: Python scripts with HTTP servers on each host
- **Network complexity**: Multiple firewall ports (6127-6129) per host
- **Polling inefficiency**: Manual refresh cycles instead of real-time streaming
- **Scalability concerns**: Resource usage grows linearly with hosts
#### **Proposed: Rust ZMQ Gossip Network**
**Core Concept**: Replace HTTP polling with a peer-to-peer ZMQ gossip network where lightweight Rust agents stream metrics in real-time.
```
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ cmbox │<-->│ labbox │<-->│ srv01 │<-->│steambox │
│ :6130 │ │ :6130 │ │ :6130 │ │ :6130 │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
^ ^ ^
└────────────────────────────┼──────────────┘
v
┌─────────┐
│simonbox │
│ :6130 │
└─────────┘
```
**Architecture Benefits**:
- **No central router**: Peer-to-peer gossip eliminates single point of failure
- **Self-healing**: Network automatically routes around failed hosts
- **Real-time streaming**: Metrics pushed immediately on change
- **Performance**: Rust agents ~10-100x faster than Python
- **Simplified networking**: Single ZMQ port (6130) vs multiple HTTP ports
- **Lower resource usage**: Minimal memory/CPU footprint per agent
#### **Implementation Plan**
**Phase 1: Agent Development**
```rust
// Lightweight agent on each host
pub struct MetricsAgent {
neighbors: Vec<String>, // ["srv01:6130", "cmbox:6130"]
collectors: Vec<Box<dyn Collector>>, // SMART, Service, Backup
gossip_interval: Duration, // How often to broadcast
zmq_context: zmq::Context,
}
// Message format for metrics
#[derive(Serialize, Deserialize)]
struct MetricsMessage {
hostname: String,
agent_type: AgentType, // Smart, Service, Backup
timestamp: u64,
metrics: MetricsData,
hop_count: u8, // Prevent infinite loops
}
```
**Phase 2: Dashboard Integration**
- **ZMQ Subscriber**: Dashboard subscribes to gossip stream on srv01
- **Real-time updates**: WebSocket connection to TUI for live streaming
- **Historical storage**: Optional persistence layer for trending
**Phase 3: Migration Strategy**
- **Parallel deployment**: Run ZMQ agents alongside existing HTTP APIs
- **A/B comparison**: Validate metrics accuracy and performance
- **Gradual cutover**: Switch dashboard to ZMQ, then remove HTTP services
#### **Configuration Integration**
**Agent Configuration** (per-host):
```toml
[metrics_agent]
enabled = true
port = 6130
neighbors = ["srv01:6130", "cmbox:6130"] # Redundant connections
role = "agent" # or "dashboard" for srv01
[collectors]
smart_metrics = { enabled = true, interval_ms = 5000 }
service_metrics = { enabled = true, interval_ms = 2000 } # srv01 only
backup_metrics = { enabled = true, interval_ms = 30000 } # srv01 only
```
**Dashboard Configuration** (updated):
```toml
[data_source]
type = "zmq_gossip" # vs current "http_polling"
listen_port = 6130
buffer_size = 1000
real_time_updates = true
[legacy_support]
http_apis_enabled = true # For migration period
fallback_to_http = true # If ZMQ unavailable
```
#### **Performance Comparison**
| Metric | Current (HTTP) | Proposed (ZMQ) |
|--------|---------------|----------------|
| Collection latency | ~50ms | ~1ms |
| Network overhead | HTTP headers + JSON | Binary ZMQ frames |
| Resource per host | ~5MB (Python + HTTP) | ~1MB (Rust agent) |
| Update frequency | 5s polling | Real-time push |
| Network ports | 3 per host | 1 per host |
| Failure recovery | Manual retry | Auto-reconnect |
#### **Development Roadmap**
**Week 1-2**: Basic ZMQ agent
- Rust binary with ZMQ gossip protocol
- SMART metrics collection
- Configuration management
**Week 3-4**: Dashboard integration
- ZMQ subscriber in cm-dashboard
- Real-time TUI updates
- Parallel HTTP/ZMQ operation
**Week 5-6**: Production readiness
- Service/backup metrics support
- Error handling and resilience
- Performance benchmarking
**Week 7-8**: Migration and cleanup
- Switch dashboard to ZMQ-only
- Remove legacy HTTP APIs
- Documentation and deployment
### Potential Features
- **Plugin system** for custom widgets
- **REST API** for external integrations
- **Mobile companion app** for alerts
- **Grafana integration** for advanced graphing
- **Prometheus metrics export**
- **Custom scripting** for automated responses
- **Machine learning** for predictive analytics
- **Clustering support** for high availability
### Integration Opportunities
- **Home Assistant** integration
- **Slack/Discord** notifications
- **SNMP support** for network equipment
- **Docker/Kubernetes** container monitoring
- **Cloud metrics** integration (if needed)
## Success Metrics
### Technical Success
- **Zero crashes** during normal operation
- **Sub-second response** times for all operations
- **99.9% uptime** for monitoring (excluding network issues)
- **Minimal resource usage** as specified
### User Success
- **Faster problem detection** compared to Glance
- **Reduced time to resolution** for issues
- **Improved infrastructure awareness**
- **Enhanced operational efficiency**
---
## Development Log
### Project Initialization
- Repository created: `/home/cm/projects/cm-dashboard`
- Initial planning: TUI dashboard to replace Glance
- Technology selected: Rust + ratatui
- Architecture designed: Multi-host monitoring with existing API integration
### Current Status (HTTP-based)
- **Functional TUI**: Basic dashboard rendering with ratatui
- **HTTP API integration**: Connects to ports 6127, 6128, 6129
- **Multi-host support**: Configurable host management
- **Async architecture**: Tokio-based concurrent metrics fetching
- **Configuration system**: TOML-based host and dashboard configuration
### Proposed Evolution: ZMQ Agent System
**Rationale for Change**: The current HTTP polling approach has fundamental limitations:
1. **Latency**: 5-second refresh cycles miss rapid changes
2. **Resource overhead**: Python HTTP servers consume unnecessary resources
3. **Network complexity**: Multiple ports per host complicate firewall management
4. **Scalability**: Linear resource growth with host count
**Solution**: Peer-to-peer ZMQ gossip network with Rust agents provides:
- **Real-time streaming**: Sub-second metric propagation
- **Fault tolerance**: Network self-heals around failed hosts
- **Performance**: Native Rust speed vs interpreted Python
- **Simplicity**: Single port per host, no central coordination
### ZMQ Agent Development Plan
**Component 1: cm-metrics-agent** (New Rust binary)
```toml
[package]
name = "cm-metrics-agent"
version = "0.1.0"
[dependencies]
zmq = "0.10"
serde = { version = "1.0", features = ["derive"] }
tokio = { version = "1.0", features = ["full"] }
smartmontools-rs = "0.1" # Or direct smartctl bindings
```
**Component 2: Dashboard Integration** (Update cm-dashboard)
- Add ZMQ subscriber mode alongside HTTP client
- Implement real-time metric streaming
- Provide migration path from HTTP to ZMQ
**Migration Strategy**:
1. **Phase 1**: Deploy agents alongside existing APIs
2. **Phase 2**: Switch dashboard to ZMQ mode
3. **Phase 3**: Remove HTTP APIs from NixOS configurations
**Performance Targets**:
- **Agent footprint**: < 2MB RAM, < 1% CPU
- **Metric latency**: < 100ms propagation across network
- **Network efficiency**: < 1KB/s per host steady state
NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.