Testing

2025-10-12 14:53:27 +02:00
parent 2581435b10
commit 2239badc8a
16 changed files with 1116 additions and 1414 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,11 +1,13 @@
 # CM Dashboard - Infrastructure Monitoring TUI

 ## Overview
+
 A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.

 ## Project Goals

 ### Core Objectives
+
 - **Real-time monitoring** of all infrastructure components
 - **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
 - **Performance-focused** with minimal resource usage
@@ -13,6 +15,7 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
 - **Integration** with existing monitoring APIs (ports 6127, 6128, 6129)

 ### Key Features
+
 - **NVMe health monitoring** with wear prediction
 - **CPU / memory / GPU telemetry** with automatic thresholding
 - **Service resource monitoring** with per-service CPU and RAM usage
@@ -24,6 +27,7 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
 ## Technical Architecture

 ### Technology Stack
+
 - **Language**: Rust 🦀
 - **TUI Framework**: ratatui (modern tui-rs fork)
 - **Async Runtime**: tokio
@@ -34,6 +38,7 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
 - **Time**: chrono

 ### Dependencies
+
 ```toml
 [dependencies]
 ratatui = "0.24"           # Modern TUI framework
@@ -84,27 +89,8 @@ cm-dashboard/
    └── WIDGETS.md         # Widget development guide
 ```

-## API Integration
-
-### Existing CMTEC APIs
-1. **Smart Metrics API** (port 6127)
-   - NVMe health status (wear, temperature, power-on hours)
-   - Disk space information
-   - SMART health indicators
-
-2. **Service Metrics API** (port 6128)
-   - Service status and resource usage
-   - Service memory consumption vs limits
-   - Host CPU load / frequency / temperature
-   - Root disk utilisation snapshot
-   - GPU utilisation and temperature (if available)
-
-3. **Backup Metrics API** (port 6129)
-   - Backup status and history
-   - Repository statistics
-   - Service integration status
-
 ### Data Structures
+
 ```rust
 #[derive(Deserialize, Debug)]
 pub struct SmartMetrics {
@@ -154,36 +140,35 @@ pub struct BackupMetrics {
 ## Dashboard Layout Design

 ### Main Dashboard View
+
 ```
 ┌─────────────────────────────────────────────────────────────────────┐
-│ 📊 CMTEC Infrastructure Dashboard                          srv01     │
+│ CM Dashboard • cmbox                                                 │
 ├─────────────────────────────────────────────────────────────────────┤
-│ 💾 NVMe Health              │ 🐏 RAM Optimization                    │
-│ ┌─────────────────────────┐  │ ┌─────────────────────────────────────┐ │
-│ │ Wear: 4% (█░░░░░░░░░░)   │  │ │ Physical: 2.4G/7.6G (32%)          │ │
-│ │ Temp: 56°C              │  │ │ zram: 64B/1.9G (64:1 compression)  │ │
-│ │ Hours: 11419h (475d)    │  │ │ tmpfs: /var/log 88K/512M           │ │
-│ │ Status: ✅ PASSED       │  │ │ Kernel: vm.dirty_ratio=5           │ │
-│ └─────────────────────────┘  │ └─────────────────────────────────────┘ │
+│ Storage • ok:1 warn:0 crit:0       │ Services • ok:1 warn:0 fail:0   │
+│ ┌─────────────────────────────────┐ │ ┌─────────────────────────────── │ │
+│ │Drive    Temp  Wear Spare Hours │ │ │Service memory: 7.1/23899.7 MiB│ │
+│ │nvme0n1  28°C  1%   100%  14489 │ │ │Disk usage: —                  │ │
+│ │         Capacity Usage          │ │ │  Service  Memory     Disk      │ │
+│ │         954G     77G (8%)       │ │ │✔ sshd     7.1 MiB   —          │ │
+│ └─────────────────────────────────┘ │ └─────────────────────────────── │ │
 ├─────────────────────────────────────────────────────────────────────┤
-│ 🔧 Services Status                                                   │
-│ ┌─────────────────────────────────────────────────────────────────┐ │
-│ │ ✅ Gitea (256M/4G, 15G/100G)      ✅ smart-metrics-api         │ │
-│ │ ✅ Immich (1.2G/4G, 45G/500G)     ✅ service-metrics-api       │ │
-│ │ ✅ Vaultwarden (45M/1G, 512M/1G)  ✅ backup-metrics-api        │ │
-│ │ ✅ UniFi (234M/2G, 1.2G/5G)       ✅ WordPress M2              │ │
-│ └─────────────────────────────────────────────────────────────────┘ │
+│ CPU / Memory • warn                 │ Backups                         │
+│ System memory: 5251.7/23899.7 MiB  │ Host cmbox awaiting backup      │ │
+│ CPU load (1/5/15): 2.18 2.66 2.56  │ metrics                         │ │
+│ CPU freq: 1100.1 MHz               │                                 │ │
+│ CPU temp: 47.0°C                    │                                 │ │
 ├─────────────────────────────────────────────────────────────────────┤
-│ 📧 Recent Alerts                     │ 💾 Backup Status             │
-│ 10:15 NVMe wear OK → 4%             │ Last: ✅ Success (04:00)      │
-│ 04:00 Backup completed successfully │ Duration: 45m 32s            │
-│ Yesterday: Email notification test   │ Size: 15.2GB → 4.1GB        │
-│                                     │ Next: Tomorrow 04:00         │
+│ Alerts • ok:0 warn:3 fail:0        │ Status • ZMQ connected          │
+│ cmbox: warning: CPU load 2.18      │ Monitoring • hosts: 3           │ │
+│ srv01: pending: awaiting metrics    │ Data source: ZMQ – connected    │ │
+│ labbox: pending: awaiting metrics   │ Active host: cmbox (1/3)        │ │
 └─────────────────────────────────────────────────────────────────────┘
-Keys: [h]osts [r]efresh [s]ettings [a]lerts [←→] navigate [q]uit
+Keys: [←→] hosts [r]efresh [q]uit
 ```

 ### Multi-Host View
+
 ```
 ┌─────────────────────────────────────────────────────────────────────┐
 │ 🖥️  CMTEC Host Overview                                              │
@@ -199,445 +184,28 @@ Keys: [h]osts [r]efresh [s]ettings [a]lerts [←→] navigate [q]uit
 Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
 ```

-## Development Phases
+## Development Status

-### Phase 1: Foundation (Week 1-2)
- [x] Project setup with Cargo.toml
- [ ] Basic TUI framework with ratatui
- [ ] HTTP client for API connections
- [ ] Data structures for metrics
- [ ] Simple single-host dashboard
+### Immediate TODOs

-**Deliverables:**
- Working TUI that connects to srv01
- Real-time display of basic metrics
- Keyboard navigation
+- Refactor all dashboard widgets to use a shared table/layout helper so icons, padding, and titles remain consistent across panels

-### Phase 2: Core Features (Week 3-4)
- [ ] All widget implementations
- [ ] Multi-host configuration
- [ ] Historical data storage
- [ ] Alert system integration
- [ ] Configuration management
+- Investigate why the backup metrics agent is not publishing data to the dashboard
+- Resize the services widget so it can display more services without truncation
+- Remove the dedicated status widget and redistribute the layout space
+- Add responsive scaling within each widget so columns and content adapt dynamically

-**Deliverables:**
- Full-featured dashboard
- Multi-host monitoring
- Historical trending
- Configuration file support
+### Phase 3: Advanced Features 🚧 IN PROGRESS

-### Phase 3: Advanced Features (Week 5-6)
- [ ] Predictive analytics
- [ ] Custom alert rules
- [ ] Export capabilities
- [ ] Performance optimizations
- [ ] Error handling & resilience
+- [x] ZMQ gossip network implementation
+- [x] Comprehensive error handling
+- [x] Performance optimizations
+- [ ] Predictive analytics for wear levels
+- [ ] Custom alert rules engine
+- [ ] Historical data export capabilities

-**Deliverables:**
- Production-ready dashboard
- Advanced monitoring features
- Comprehensive error handling
- Performance benchmarks
+# Important Communication Guidelines

-### Phase 4: Polish & Documentation (Week 7-8)
- [ ] Code documentation
- [ ] User documentation
- [ ] Installation scripts
- [ ] Testing suite
- [ ] Release preparation
+NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.

-**Deliverables:**
- Complete documentation
- Installation packages
- Test coverage
- Release v1.0
-
-## Configuration
-
-### Host Configuration (config/hosts.toml)
-```toml
-[hosts]
-
-[hosts.srv01]
-name = "srv01"
-address = "192.168.30.100"
-smart_api = 6127
-service_api = 6128
-backup_api = 6129
-role = "server"
-
-[hosts.cmbox]
-name = "cmbox"
-address = "192.168.30.101"
-smart_api = 6127
-service_api = 6128
-backup_api = 6129
-role = "workstation"
-
-[hosts.labbox]
-name = "labbox" 
-address = "192.168.30.102"
-smart_api = 6127
-service_api = 6128
-backup_api = 6129
-role = "lab"
-```
-
-### Dashboard Configuration (config/dashboard.toml)
-```toml
-[dashboard]
-refresh_interval = 5  # seconds
-history_retention = 7  # days
-theme = "dark"
-
-[widgets]
-nvme_wear_threshold = 70
-temperature_threshold = 70
-memory_warning_threshold = 80
-memory_critical_threshold = 90
-
-[alerts]
-email_enabled = true
-sound_enabled = false
-desktop_notifications = true
-```
-
-## Key Features
-
-### Real-time Monitoring
- **Auto-refresh** configurable intervals (1-60 seconds)
- **Async data fetching** from multiple hosts simultaneously
- **Connection status** indicators for each host
- **Graceful degradation** when hosts are unreachable
-
-### Historical Tracking
- **SQLite database** for local storage
- **Trend analysis** for wear levels and resource usage
- **Retention policies** configurable per metric type
- **Export capabilities** (CSV, JSON)
-
-### Alert System
- **Threshold-based alerts** for all metrics
- **Email integration** with existing notification system
- **Alert acknowledgment** and history
- **Custom alert rules** with logical operators
-
-### Multi-Host Management
- **Auto-discovery** of hosts on network
- **Host grouping** by role (server, workstation, lab)
- **Bulk operations** across multiple hosts
- **Host-specific configurations**
-
-## Performance Requirements
-
-### Resource Usage
- **Memory**: < 50MB runtime footprint
- **CPU**: < 1% average CPU usage
- **Network**: Minimal bandwidth (< 1KB/s per host)
- **Startup**: < 2 seconds cold start
-
-### Responsiveness
- **UI updates**: 60 FPS smooth rendering
- **Data refresh**: < 500ms API response handling
- **Navigation**: Instant keyboard response
- **Error recovery**: < 5 seconds reconnection
-
-## Security Considerations
-
-### Network Security
- **Local network only** - no external connections
- **Authentication** for API access if implemented
- **Encrypted storage** for sensitive configuration
- **Audit logging** for administrative actions
-
-### Data Privacy
- **Local storage** only - no cloud dependencies
- **Configurable retention** for historical data
- **Secure deletion** of expired data
- **No sensitive data logging**
-
-## Testing Strategy
-
-### Unit Tests
- API client modules
- Data parsing and validation
- Configuration management
- Alert logic
-
-### Integration Tests
- Multi-host connectivity
- API error handling
- Database operations
- Alert delivery
-
-### Performance Tests
- Memory usage under load
- Network timeout handling
- Large dataset rendering
- Extended runtime stability
-
-## Deployment
-
-### Installation
-```bash
-# Development build
-cargo build --release
-
-# Install from source
-cargo install --path .
-
-# Future: Package distribution
-# Package for NixOS inclusion
-```
-
-### Usage
-```bash
-# Start dashboard
-cm-dashboard
-
-# Specify config
-cm-dashboard --config /path/to/config
-
-# Single host mode
-cm-dashboard --host srv01
-
-# Debug mode
-cm-dashboard --verbose
-```
-
-## Maintenance
-
-### Regular Tasks
- **Database cleanup** - automated retention policies
- **Log rotation** - configurable log levels and retention
- **Configuration validation** - startup configuration checks
- **Performance monitoring** - built-in metrics for dashboard itself
-
-### Updates
- **Auto-update checks** - optional feature
- **Configuration migration** - version compatibility
- **API compatibility** - backwards compatibility with monitoring APIs
- **Feature toggles** - enable/disable features without rebuild
-
-## Future Enhancements
-
-### Proposed: ZMQ Metrics Agent Architecture
-
-#### **Current Limitations of HTTP-based APIs**
- **Performance overhead**: Python scripts with HTTP servers on each host
- **Network complexity**: Multiple firewall ports (6127-6129) per host
- **Polling inefficiency**: Manual refresh cycles instead of real-time streaming
- **Scalability concerns**: Resource usage grows linearly with hosts
-
-#### **Proposed: Rust ZMQ Gossip Network**
-
-**Core Concept**: Replace HTTP polling with a peer-to-peer ZMQ gossip network where lightweight Rust agents stream metrics in real-time.
-
-```
-┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
-│  cmbox  │<-->│ labbox  │<-->│ srv01   │<-->│steambox │
-│ :6130   │    │ :6130   │    │ :6130   │    │ :6130   │
-└─────────┘    └─────────┘    └─────────┘    └─────────┘
-      ^                            ^              ^
-      └────────────────────────────┼──────────────┘
-                                   v
-                              ┌─────────┐
-                              │simonbox │
-                              │ :6130   │
-                              └─────────┘
-```
-
-**Architecture Benefits**:
- **No central router**: Peer-to-peer gossip eliminates single point of failure
- **Self-healing**: Network automatically routes around failed hosts
- **Real-time streaming**: Metrics pushed immediately on change
- **Performance**: Rust agents ~10-100x faster than Python
- **Simplified networking**: Single ZMQ port (6130) vs multiple HTTP ports
- **Lower resource usage**: Minimal memory/CPU footprint per agent
-
-#### **Implementation Plan**
-
-**Phase 1: Agent Development**
-```rust
-// Lightweight agent on each host
-pub struct MetricsAgent {
-    neighbors: Vec<String>,           // ["srv01:6130", "cmbox:6130"]
-    collectors: Vec<Box<dyn Collector>>, // SMART, Service, Backup
-    gossip_interval: Duration,        // How often to broadcast
-    zmq_context: zmq::Context,
-}
-
-// Message format for metrics
-#[derive(Serialize, Deserialize)]
-struct MetricsMessage {
-    hostname: String,
-    agent_type: AgentType,     // Smart, Service, Backup
-    timestamp: u64,
-    metrics: MetricsData,
-    hop_count: u8,             // Prevent infinite loops
-}
-```
-
-**Phase 2: Dashboard Integration**
- **ZMQ Subscriber**: Dashboard subscribes to gossip stream on srv01
- **Real-time updates**: WebSocket connection to TUI for live streaming
- **Historical storage**: Optional persistence layer for trending
-
-**Phase 3: Migration Strategy**
- **Parallel deployment**: Run ZMQ agents alongside existing HTTP APIs
- **A/B comparison**: Validate metrics accuracy and performance
- **Gradual cutover**: Switch dashboard to ZMQ, then remove HTTP services
-
-#### **Configuration Integration**
-
-**Agent Configuration** (per-host):
-```toml
-[metrics_agent]
-enabled = true
-port = 6130
-neighbors = ["srv01:6130", "cmbox:6130"]  # Redundant connections
-role = "agent"  # or "dashboard" for srv01
-
-[collectors]
-smart_metrics = { enabled = true, interval_ms = 5000 }
-service_metrics = { enabled = true, interval_ms = 2000 }  # srv01 only
-backup_metrics = { enabled = true, interval_ms = 30000 }  # srv01 only
-```
-
-**Dashboard Configuration** (updated):
-```toml
-[data_source]
-type = "zmq_gossip"  # vs current "http_polling"
-listen_port = 6130
-buffer_size = 1000
-real_time_updates = true
-
-[legacy_support]
-http_apis_enabled = true  # For migration period
-fallback_to_http = true   # If ZMQ unavailable
-```
-
-#### **Performance Comparison**
-
-| Metric | Current (HTTP) | Proposed (ZMQ) |
-|--------|---------------|----------------|
-| Collection latency | ~50ms | ~1ms |
-| Network overhead | HTTP headers + JSON | Binary ZMQ frames |
-| Resource per host | ~5MB (Python + HTTP) | ~1MB (Rust agent) |
-| Update frequency | 5s polling | Real-time push |
-| Network ports | 3 per host | 1 per host |
-| Failure recovery | Manual retry | Auto-reconnect |
-
-#### **Development Roadmap**
-
-**Week 1-2**: Basic ZMQ agent
- Rust binary with ZMQ gossip protocol
- SMART metrics collection
- Configuration management
-
-**Week 3-4**: Dashboard integration  
- ZMQ subscriber in cm-dashboard
- Real-time TUI updates
- Parallel HTTP/ZMQ operation
-
-**Week 5-6**: Production readiness
- Service/backup metrics support
- Error handling and resilience
- Performance benchmarking
-
-**Week 7-8**: Migration and cleanup
- Switch dashboard to ZMQ-only
- Remove legacy HTTP APIs
- Documentation and deployment
-
-### Potential Features
- **Plugin system** for custom widgets
- **REST API** for external integrations
- **Mobile companion app** for alerts
- **Grafana integration** for advanced graphing
- **Prometheus metrics export**
- **Custom scripting** for automated responses
- **Machine learning** for predictive analytics
- **Clustering support** for high availability
-
-### Integration Opportunities
- **Home Assistant** integration
- **Slack/Discord** notifications
- **SNMP support** for network equipment
- **Docker/Kubernetes** container monitoring
- **Cloud metrics** integration (if needed)
-
-## Success Metrics
-
-### Technical Success
- **Zero crashes** during normal operation
- **Sub-second response** times for all operations
- **99.9% uptime** for monitoring (excluding network issues)
- **Minimal resource usage** as specified
-
-### User Success
- **Faster problem detection** compared to Glance
- **Reduced time to resolution** for issues
- **Improved infrastructure awareness**
- **Enhanced operational efficiency**
-
---
-
-## Development Log
-
-### Project Initialization
- Repository created: `/home/cm/projects/cm-dashboard`
- Initial planning: TUI dashboard to replace Glance
- Technology selected: Rust + ratatui
- Architecture designed: Multi-host monitoring with existing API integration
-
-### Current Status (HTTP-based)
- **Functional TUI**: Basic dashboard rendering with ratatui
- **HTTP API integration**: Connects to ports 6127, 6128, 6129
- **Multi-host support**: Configurable host management
- **Async architecture**: Tokio-based concurrent metrics fetching
- **Configuration system**: TOML-based host and dashboard configuration
-
-### Proposed Evolution: ZMQ Agent System
-
-**Rationale for Change**: The current HTTP polling approach has fundamental limitations:
-1. **Latency**: 5-second refresh cycles miss rapid changes
-2. **Resource overhead**: Python HTTP servers consume unnecessary resources
-3. **Network complexity**: Multiple ports per host complicate firewall management
-4. **Scalability**: Linear resource growth with host count
-
-**Solution**: Peer-to-peer ZMQ gossip network with Rust agents provides:
- **Real-time streaming**: Sub-second metric propagation
- **Fault tolerance**: Network self-heals around failed hosts
- **Performance**: Native Rust speed vs interpreted Python
- **Simplicity**: Single port per host, no central coordination
-
-### ZMQ Agent Development Plan
-
-**Component 1: cm-metrics-agent** (New Rust binary)
-```toml
-[package]
-name = "cm-metrics-agent"
-version = "0.1.0"
-
-[dependencies]
-zmq = "0.10"
-serde = { version = "1.0", features = ["derive"] }
-tokio = { version = "1.0", features = ["full"] }
-smartmontools-rs = "0.1"  # Or direct smartctl bindings
-```
-
-**Component 2: Dashboard Integration** (Update cm-dashboard)
- Add ZMQ subscriber mode alongside HTTP client
- Implement real-time metric streaming
- Provide migration path from HTTP to ZMQ
-
-**Migration Strategy**:
-1. **Phase 1**: Deploy agents alongside existing APIs
-2. **Phase 2**: Switch dashboard to ZMQ mode
-3. **Phase 3**: Remove HTTP APIs from NixOS configurations
-
-**Performance Targets**:
- **Agent footprint**: < 2MB RAM, < 1% CPU
- **Metric latency**: < 100ms propagation across network
- **Network efficiency**: < 1KB/s per host steady state
+NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.