Testing
This commit is contained in:
518
CLAUDE.md
518
CLAUDE.md
@@ -1,11 +1,13 @@
|
||||
# CM Dashboard - Infrastructure Monitoring TUI
|
||||
|
||||
## Overview
|
||||
|
||||
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.
|
||||
|
||||
## Project Goals
|
||||
|
||||
### Core Objectives
|
||||
|
||||
- **Real-time monitoring** of all infrastructure components
|
||||
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
|
||||
- **Performance-focused** with minimal resource usage
|
||||
@@ -13,6 +15,7 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
|
||||
- **Integration** with existing monitoring APIs (ports 6127, 6128, 6129)
|
||||
|
||||
### Key Features
|
||||
|
||||
- **NVMe health monitoring** with wear prediction
|
||||
- **CPU / memory / GPU telemetry** with automatic thresholding
|
||||
- **Service resource monitoring** with per-service CPU and RAM usage
|
||||
@@ -24,6 +27,7 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
|
||||
## Technical Architecture
|
||||
|
||||
### Technology Stack
|
||||
|
||||
- **Language**: Rust 🦀
|
||||
- **TUI Framework**: ratatui (modern tui-rs fork)
|
||||
- **Async Runtime**: tokio
|
||||
@@ -34,6 +38,7 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
|
||||
- **Time**: chrono
|
||||
|
||||
### Dependencies
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
ratatui = "0.24" # Modern TUI framework
|
||||
@@ -84,27 +89,8 @@ cm-dashboard/
|
||||
└── WIDGETS.md # Widget development guide
|
||||
```
|
||||
|
||||
## API Integration
|
||||
|
||||
### Existing CMTEC APIs
|
||||
1. **Smart Metrics API** (port 6127)
|
||||
- NVMe health status (wear, temperature, power-on hours)
|
||||
- Disk space information
|
||||
- SMART health indicators
|
||||
|
||||
2. **Service Metrics API** (port 6128)
|
||||
- Service status and resource usage
|
||||
- Service memory consumption vs limits
|
||||
- Host CPU load / frequency / temperature
|
||||
- Root disk utilisation snapshot
|
||||
- GPU utilisation and temperature (if available)
|
||||
|
||||
3. **Backup Metrics API** (port 6129)
|
||||
- Backup status and history
|
||||
- Repository statistics
|
||||
- Service integration status
|
||||
|
||||
### Data Structures
|
||||
|
||||
```rust
|
||||
#[derive(Deserialize, Debug)]
|
||||
pub struct SmartMetrics {
|
||||
@@ -154,36 +140,35 @@ pub struct BackupMetrics {
|
||||
## Dashboard Layout Design
|
||||
|
||||
### Main Dashboard View
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ 📊 CMTEC Infrastructure Dashboard srv01 │
|
||||
│ CM Dashboard • cmbox │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ 💾 NVMe Health │ 🐏 RAM Optimization │
|
||||
│ ┌─────────────────────────┐ │ ┌─────────────────────────────────────┐ │
|
||||
│ │ Wear: 4% (█░░░░░░░░░░) │ │ │ Physical: 2.4G/7.6G (32%) │ │
|
||||
│ │ Temp: 56°C │ │ │ zram: 64B/1.9G (64:1 compression) │ │
|
||||
│ │ Hours: 11419h (475d) │ │ │ tmpfs: /var/log 88K/512M │ │
|
||||
│ │ Status: ✅ PASSED │ │ │ Kernel: vm.dirty_ratio=5 │ │
|
||||
│ └─────────────────────────┘ │ └─────────────────────────────────────┘ │
|
||||
│ Storage • ok:1 warn:0 crit:0 │ Services • ok:1 warn:0 fail:0 │
|
||||
│ ┌─────────────────────────────────┐ │ ┌─────────────────────────────── │ │
|
||||
│ │Drive Temp Wear Spare Hours │ │ │Service memory: 7.1/23899.7 MiB│ │
|
||||
│ │nvme0n1 28°C 1% 100% 14489 │ │ │Disk usage: — │ │
|
||||
│ │ Capacity Usage │ │ │ Service Memory Disk │ │
|
||||
│ │ 954G 77G (8%) │ │ │✔ sshd 7.1 MiB — │ │
|
||||
│ └─────────────────────────────────┘ │ └─────────────────────────────── │ │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ 🔧 Services Status │
|
||||
│ ┌─────────────────────────────────────────────────────────────────┐ │
|
||||
│ │ ✅ Gitea (256M/4G, 15G/100G) ✅ smart-metrics-api │ │
|
||||
│ │ ✅ Immich (1.2G/4G, 45G/500G) ✅ service-metrics-api │ │
|
||||
│ │ ✅ Vaultwarden (45M/1G, 512M/1G) ✅ backup-metrics-api │ │
|
||||
│ │ ✅ UniFi (234M/2G, 1.2G/5G) ✅ WordPress M2 │ │
|
||||
│ └─────────────────────────────────────────────────────────────────┘ │
|
||||
│ CPU / Memory • warn │ Backups │
|
||||
│ System memory: 5251.7/23899.7 MiB │ Host cmbox awaiting backup │ │
|
||||
│ CPU load (1/5/15): 2.18 2.66 2.56 │ metrics │ │
|
||||
│ CPU freq: 1100.1 MHz │ │ │
|
||||
│ CPU temp: 47.0°C │ │ │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ 📧 Recent Alerts │ 💾 Backup Status │
|
||||
│ 10:15 NVMe wear OK → 4% │ Last: ✅ Success (04:00) │
|
||||
│ 04:00 Backup completed successfully │ Duration: 45m 32s │
|
||||
│ Yesterday: Email notification test │ Size: 15.2GB → 4.1GB │
|
||||
│ │ Next: Tomorrow 04:00 │
|
||||
│ Alerts • ok:0 warn:3 fail:0 │ Status • ZMQ connected │
|
||||
│ cmbox: warning: CPU load 2.18 │ Monitoring • hosts: 3 │ │
|
||||
│ srv01: pending: awaiting metrics │ Data source: ZMQ – connected │ │
|
||||
│ labbox: pending: awaiting metrics │ Active host: cmbox (1/3) │ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
Keys: [h]osts [r]efresh [s]ettings [a]lerts [←→] navigate [q]uit
|
||||
Keys: [←→] hosts [r]efresh [q]uit
|
||||
```
|
||||
|
||||
### Multi-Host View
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ 🖥️ CMTEC Host Overview │
|
||||
@@ -199,445 +184,28 @@ Keys: [h]osts [r]efresh [s]ettings [a]lerts [←→] navigate [q]uit
|
||||
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
|
||||
```
|
||||
|
||||
## Development Phases
|
||||
## Development Status
|
||||
|
||||
### Phase 1: Foundation (Week 1-2)
|
||||
- [x] Project setup with Cargo.toml
|
||||
- [ ] Basic TUI framework with ratatui
|
||||
- [ ] HTTP client for API connections
|
||||
- [ ] Data structures for metrics
|
||||
- [ ] Simple single-host dashboard
|
||||
### Immediate TODOs
|
||||
|
||||
**Deliverables:**
|
||||
- Working TUI that connects to srv01
|
||||
- Real-time display of basic metrics
|
||||
- Keyboard navigation
|
||||
- Refactor all dashboard widgets to use a shared table/layout helper so icons, padding, and titles remain consistent across panels
|
||||
|
||||
### Phase 2: Core Features (Week 3-4)
|
||||
- [ ] All widget implementations
|
||||
- [ ] Multi-host configuration
|
||||
- [ ] Historical data storage
|
||||
- [ ] Alert system integration
|
||||
- [ ] Configuration management
|
||||
- Investigate why the backup metrics agent is not publishing data to the dashboard
|
||||
- Resize the services widget so it can display more services without truncation
|
||||
- Remove the dedicated status widget and redistribute the layout space
|
||||
- Add responsive scaling within each widget so columns and content adapt dynamically
|
||||
|
||||
**Deliverables:**
|
||||
- Full-featured dashboard
|
||||
- Multi-host monitoring
|
||||
- Historical trending
|
||||
- Configuration file support
|
||||
### Phase 3: Advanced Features 🚧 IN PROGRESS
|
||||
|
||||
### Phase 3: Advanced Features (Week 5-6)
|
||||
- [ ] Predictive analytics
|
||||
- [ ] Custom alert rules
|
||||
- [ ] Export capabilities
|
||||
- [ ] Performance optimizations
|
||||
- [ ] Error handling & resilience
|
||||
- [x] ZMQ gossip network implementation
|
||||
- [x] Comprehensive error handling
|
||||
- [x] Performance optimizations
|
||||
- [ ] Predictive analytics for wear levels
|
||||
- [ ] Custom alert rules engine
|
||||
- [ ] Historical data export capabilities
|
||||
|
||||
**Deliverables:**
|
||||
- Production-ready dashboard
|
||||
- Advanced monitoring features
|
||||
- Comprehensive error handling
|
||||
- Performance benchmarks
|
||||
# Important Communication Guidelines
|
||||
|
||||
### Phase 4: Polish & Documentation (Week 7-8)
|
||||
- [ ] Code documentation
|
||||
- [ ] User documentation
|
||||
- [ ] Installation scripts
|
||||
- [ ] Testing suite
|
||||
- [ ] Release preparation
|
||||
NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.
|
||||
|
||||
**Deliverables:**
|
||||
- Complete documentation
|
||||
- Installation packages
|
||||
- Test coverage
|
||||
- Release v1.0
|
||||
|
||||
## Configuration
|
||||
|
||||
### Host Configuration (config/hosts.toml)
|
||||
```toml
|
||||
[hosts]
|
||||
|
||||
[hosts.srv01]
|
||||
name = "srv01"
|
||||
address = "192.168.30.100"
|
||||
smart_api = 6127
|
||||
service_api = 6128
|
||||
backup_api = 6129
|
||||
role = "server"
|
||||
|
||||
[hosts.cmbox]
|
||||
name = "cmbox"
|
||||
address = "192.168.30.101"
|
||||
smart_api = 6127
|
||||
service_api = 6128
|
||||
backup_api = 6129
|
||||
role = "workstation"
|
||||
|
||||
[hosts.labbox]
|
||||
name = "labbox"
|
||||
address = "192.168.30.102"
|
||||
smart_api = 6127
|
||||
service_api = 6128
|
||||
backup_api = 6129
|
||||
role = "lab"
|
||||
```
|
||||
|
||||
### Dashboard Configuration (config/dashboard.toml)
|
||||
```toml
|
||||
[dashboard]
|
||||
refresh_interval = 5 # seconds
|
||||
history_retention = 7 # days
|
||||
theme = "dark"
|
||||
|
||||
[widgets]
|
||||
nvme_wear_threshold = 70
|
||||
temperature_threshold = 70
|
||||
memory_warning_threshold = 80
|
||||
memory_critical_threshold = 90
|
||||
|
||||
[alerts]
|
||||
email_enabled = true
|
||||
sound_enabled = false
|
||||
desktop_notifications = true
|
||||
```
|
||||
|
||||
## Key Features
|
||||
|
||||
### Real-time Monitoring
|
||||
- **Auto-refresh** configurable intervals (1-60 seconds)
|
||||
- **Async data fetching** from multiple hosts simultaneously
|
||||
- **Connection status** indicators for each host
|
||||
- **Graceful degradation** when hosts are unreachable
|
||||
|
||||
### Historical Tracking
|
||||
- **SQLite database** for local storage
|
||||
- **Trend analysis** for wear levels and resource usage
|
||||
- **Retention policies** configurable per metric type
|
||||
- **Export capabilities** (CSV, JSON)
|
||||
|
||||
### Alert System
|
||||
- **Threshold-based alerts** for all metrics
|
||||
- **Email integration** with existing notification system
|
||||
- **Alert acknowledgment** and history
|
||||
- **Custom alert rules** with logical operators
|
||||
|
||||
### Multi-Host Management
|
||||
- **Auto-discovery** of hosts on network
|
||||
- **Host grouping** by role (server, workstation, lab)
|
||||
- **Bulk operations** across multiple hosts
|
||||
- **Host-specific configurations**
|
||||
|
||||
## Performance Requirements
|
||||
|
||||
### Resource Usage
|
||||
- **Memory**: < 50MB runtime footprint
|
||||
- **CPU**: < 1% average CPU usage
|
||||
- **Network**: Minimal bandwidth (< 1KB/s per host)
|
||||
- **Startup**: < 2 seconds cold start
|
||||
|
||||
### Responsiveness
|
||||
- **UI updates**: 60 FPS smooth rendering
|
||||
- **Data refresh**: < 500ms API response handling
|
||||
- **Navigation**: Instant keyboard response
|
||||
- **Error recovery**: < 5 seconds reconnection
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Network Security
|
||||
- **Local network only** - no external connections
|
||||
- **Authentication** for API access if implemented
|
||||
- **Encrypted storage** for sensitive configuration
|
||||
- **Audit logging** for administrative actions
|
||||
|
||||
### Data Privacy
|
||||
- **Local storage** only - no cloud dependencies
|
||||
- **Configurable retention** for historical data
|
||||
- **Secure deletion** of expired data
|
||||
- **No sensitive data logging**
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- API client modules
|
||||
- Data parsing and validation
|
||||
- Configuration management
|
||||
- Alert logic
|
||||
|
||||
### Integration Tests
|
||||
- Multi-host connectivity
|
||||
- API error handling
|
||||
- Database operations
|
||||
- Alert delivery
|
||||
|
||||
### Performance Tests
|
||||
- Memory usage under load
|
||||
- Network timeout handling
|
||||
- Large dataset rendering
|
||||
- Extended runtime stability
|
||||
|
||||
## Deployment
|
||||
|
||||
### Installation
|
||||
```bash
|
||||
# Development build
|
||||
cargo build --release
|
||||
|
||||
# Install from source
|
||||
cargo install --path .
|
||||
|
||||
# Future: Package distribution
|
||||
# Package for NixOS inclusion
|
||||
```
|
||||
|
||||
### Usage
|
||||
```bash
|
||||
# Start dashboard
|
||||
cm-dashboard
|
||||
|
||||
# Specify config
|
||||
cm-dashboard --config /path/to/config
|
||||
|
||||
# Single host mode
|
||||
cm-dashboard --host srv01
|
||||
|
||||
# Debug mode
|
||||
cm-dashboard --verbose
|
||||
```
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Regular Tasks
|
||||
- **Database cleanup** - automated retention policies
|
||||
- **Log rotation** - configurable log levels and retention
|
||||
- **Configuration validation** - startup configuration checks
|
||||
- **Performance monitoring** - built-in metrics for dashboard itself
|
||||
|
||||
### Updates
|
||||
- **Auto-update checks** - optional feature
|
||||
- **Configuration migration** - version compatibility
|
||||
- **API compatibility** - backwards compatibility with monitoring APIs
|
||||
- **Feature toggles** - enable/disable features without rebuild
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Proposed: ZMQ Metrics Agent Architecture
|
||||
|
||||
#### **Current Limitations of HTTP-based APIs**
|
||||
- **Performance overhead**: Python scripts with HTTP servers on each host
|
||||
- **Network complexity**: Multiple firewall ports (6127-6129) per host
|
||||
- **Polling inefficiency**: Manual refresh cycles instead of real-time streaming
|
||||
- **Scalability concerns**: Resource usage grows linearly with hosts
|
||||
|
||||
#### **Proposed: Rust ZMQ Gossip Network**
|
||||
|
||||
**Core Concept**: Replace HTTP polling with a peer-to-peer ZMQ gossip network where lightweight Rust agents stream metrics in real-time.
|
||||
|
||||
```
|
||||
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
|
||||
│ cmbox │<-->│ labbox │<-->│ srv01 │<-->│steambox │
|
||||
│ :6130 │ │ :6130 │ │ :6130 │ │ :6130 │
|
||||
└─────────┘ └─────────┘ └─────────┘ └─────────┘
|
||||
^ ^ ^
|
||||
└────────────────────────────┼──────────────┘
|
||||
v
|
||||
┌─────────┐
|
||||
│simonbox │
|
||||
│ :6130 │
|
||||
└─────────┘
|
||||
```
|
||||
|
||||
**Architecture Benefits**:
|
||||
- **No central router**: Peer-to-peer gossip eliminates single point of failure
|
||||
- **Self-healing**: Network automatically routes around failed hosts
|
||||
- **Real-time streaming**: Metrics pushed immediately on change
|
||||
- **Performance**: Rust agents ~10-100x faster than Python
|
||||
- **Simplified networking**: Single ZMQ port (6130) vs multiple HTTP ports
|
||||
- **Lower resource usage**: Minimal memory/CPU footprint per agent
|
||||
|
||||
#### **Implementation Plan**
|
||||
|
||||
**Phase 1: Agent Development**
|
||||
```rust
|
||||
// Lightweight agent on each host
|
||||
pub struct MetricsAgent {
|
||||
neighbors: Vec<String>, // ["srv01:6130", "cmbox:6130"]
|
||||
collectors: Vec<Box<dyn Collector>>, // SMART, Service, Backup
|
||||
gossip_interval: Duration, // How often to broadcast
|
||||
zmq_context: zmq::Context,
|
||||
}
|
||||
|
||||
// Message format for metrics
|
||||
#[derive(Serialize, Deserialize)]
|
||||
struct MetricsMessage {
|
||||
hostname: String,
|
||||
agent_type: AgentType, // Smart, Service, Backup
|
||||
timestamp: u64,
|
||||
metrics: MetricsData,
|
||||
hop_count: u8, // Prevent infinite loops
|
||||
}
|
||||
```
|
||||
|
||||
**Phase 2: Dashboard Integration**
|
||||
- **ZMQ Subscriber**: Dashboard subscribes to gossip stream on srv01
|
||||
- **Real-time updates**: WebSocket connection to TUI for live streaming
|
||||
- **Historical storage**: Optional persistence layer for trending
|
||||
|
||||
**Phase 3: Migration Strategy**
|
||||
- **Parallel deployment**: Run ZMQ agents alongside existing HTTP APIs
|
||||
- **A/B comparison**: Validate metrics accuracy and performance
|
||||
- **Gradual cutover**: Switch dashboard to ZMQ, then remove HTTP services
|
||||
|
||||
#### **Configuration Integration**
|
||||
|
||||
**Agent Configuration** (per-host):
|
||||
```toml
|
||||
[metrics_agent]
|
||||
enabled = true
|
||||
port = 6130
|
||||
neighbors = ["srv01:6130", "cmbox:6130"] # Redundant connections
|
||||
role = "agent" # or "dashboard" for srv01
|
||||
|
||||
[collectors]
|
||||
smart_metrics = { enabled = true, interval_ms = 5000 }
|
||||
service_metrics = { enabled = true, interval_ms = 2000 } # srv01 only
|
||||
backup_metrics = { enabled = true, interval_ms = 30000 } # srv01 only
|
||||
```
|
||||
|
||||
**Dashboard Configuration** (updated):
|
||||
```toml
|
||||
[data_source]
|
||||
type = "zmq_gossip" # vs current "http_polling"
|
||||
listen_port = 6130
|
||||
buffer_size = 1000
|
||||
real_time_updates = true
|
||||
|
||||
[legacy_support]
|
||||
http_apis_enabled = true # For migration period
|
||||
fallback_to_http = true # If ZMQ unavailable
|
||||
```
|
||||
|
||||
#### **Performance Comparison**
|
||||
|
||||
| Metric | Current (HTTP) | Proposed (ZMQ) |
|
||||
|--------|---------------|----------------|
|
||||
| Collection latency | ~50ms | ~1ms |
|
||||
| Network overhead | HTTP headers + JSON | Binary ZMQ frames |
|
||||
| Resource per host | ~5MB (Python + HTTP) | ~1MB (Rust agent) |
|
||||
| Update frequency | 5s polling | Real-time push |
|
||||
| Network ports | 3 per host | 1 per host |
|
||||
| Failure recovery | Manual retry | Auto-reconnect |
|
||||
|
||||
#### **Development Roadmap**
|
||||
|
||||
**Week 1-2**: Basic ZMQ agent
|
||||
- Rust binary with ZMQ gossip protocol
|
||||
- SMART metrics collection
|
||||
- Configuration management
|
||||
|
||||
**Week 3-4**: Dashboard integration
|
||||
- ZMQ subscriber in cm-dashboard
|
||||
- Real-time TUI updates
|
||||
- Parallel HTTP/ZMQ operation
|
||||
|
||||
**Week 5-6**: Production readiness
|
||||
- Service/backup metrics support
|
||||
- Error handling and resilience
|
||||
- Performance benchmarking
|
||||
|
||||
**Week 7-8**: Migration and cleanup
|
||||
- Switch dashboard to ZMQ-only
|
||||
- Remove legacy HTTP APIs
|
||||
- Documentation and deployment
|
||||
|
||||
### Potential Features
|
||||
- **Plugin system** for custom widgets
|
||||
- **REST API** for external integrations
|
||||
- **Mobile companion app** for alerts
|
||||
- **Grafana integration** for advanced graphing
|
||||
- **Prometheus metrics export**
|
||||
- **Custom scripting** for automated responses
|
||||
- **Machine learning** for predictive analytics
|
||||
- **Clustering support** for high availability
|
||||
|
||||
### Integration Opportunities
|
||||
- **Home Assistant** integration
|
||||
- **Slack/Discord** notifications
|
||||
- **SNMP support** for network equipment
|
||||
- **Docker/Kubernetes** container monitoring
|
||||
- **Cloud metrics** integration (if needed)
|
||||
|
||||
## Success Metrics
|
||||
|
||||
### Technical Success
|
||||
- **Zero crashes** during normal operation
|
||||
- **Sub-second response** times for all operations
|
||||
- **99.9% uptime** for monitoring (excluding network issues)
|
||||
- **Minimal resource usage** as specified
|
||||
|
||||
### User Success
|
||||
- **Faster problem detection** compared to Glance
|
||||
- **Reduced time to resolution** for issues
|
||||
- **Improved infrastructure awareness**
|
||||
- **Enhanced operational efficiency**
|
||||
|
||||
---
|
||||
|
||||
## Development Log
|
||||
|
||||
### Project Initialization
|
||||
- Repository created: `/home/cm/projects/cm-dashboard`
|
||||
- Initial planning: TUI dashboard to replace Glance
|
||||
- Technology selected: Rust + ratatui
|
||||
- Architecture designed: Multi-host monitoring with existing API integration
|
||||
|
||||
### Current Status (HTTP-based)
|
||||
- **Functional TUI**: Basic dashboard rendering with ratatui
|
||||
- **HTTP API integration**: Connects to ports 6127, 6128, 6129
|
||||
- **Multi-host support**: Configurable host management
|
||||
- **Async architecture**: Tokio-based concurrent metrics fetching
|
||||
- **Configuration system**: TOML-based host and dashboard configuration
|
||||
|
||||
### Proposed Evolution: ZMQ Agent System
|
||||
|
||||
**Rationale for Change**: The current HTTP polling approach has fundamental limitations:
|
||||
1. **Latency**: 5-second refresh cycles miss rapid changes
|
||||
2. **Resource overhead**: Python HTTP servers consume unnecessary resources
|
||||
3. **Network complexity**: Multiple ports per host complicate firewall management
|
||||
4. **Scalability**: Linear resource growth with host count
|
||||
|
||||
**Solution**: Peer-to-peer ZMQ gossip network with Rust agents provides:
|
||||
- **Real-time streaming**: Sub-second metric propagation
|
||||
- **Fault tolerance**: Network self-heals around failed hosts
|
||||
- **Performance**: Native Rust speed vs interpreted Python
|
||||
- **Simplicity**: Single port per host, no central coordination
|
||||
|
||||
### ZMQ Agent Development Plan
|
||||
|
||||
**Component 1: cm-metrics-agent** (New Rust binary)
|
||||
```toml
|
||||
[package]
|
||||
name = "cm-metrics-agent"
|
||||
version = "0.1.0"
|
||||
|
||||
[dependencies]
|
||||
zmq = "0.10"
|
||||
serde = { version = "1.0", features = ["derive"] }
|
||||
tokio = { version = "1.0", features = ["full"] }
|
||||
smartmontools-rs = "0.1" # Or direct smartctl bindings
|
||||
```
|
||||
|
||||
**Component 2: Dashboard Integration** (Update cm-dashboard)
|
||||
- Add ZMQ subscriber mode alongside HTTP client
|
||||
- Implement real-time metric streaming
|
||||
- Provide migration path from HTTP to ZMQ
|
||||
|
||||
**Migration Strategy**:
|
||||
1. **Phase 1**: Deploy agents alongside existing APIs
|
||||
2. **Phase 2**: Switch dashboard to ZMQ mode
|
||||
3. **Phase 3**: Remove HTTP APIs from NixOS configurations
|
||||
|
||||
**Performance Targets**:
|
||||
- **Agent footprint**: < 2MB RAM, < 1% CPU
|
||||
- **Metric latency**: < 100ms propagation across network
|
||||
- **Network efficiency**: < 1KB/s per host steady state
|
||||
NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.
|
||||
|
||||
Reference in New Issue
Block a user