214 lines
10 KiB
Markdown
214 lines
10 KiB
Markdown
# CM Dashboard - Infrastructure Monitoring TUI
|
||
|
||
## Overview
|
||
|
||
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.
|
||
|
||
## Project Goals
|
||
|
||
### Core Objectives
|
||
|
||
- **Real-time monitoring** of all infrastructure components
|
||
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
|
||
- **Performance-focused** with minimal resource usage
|
||
- **Keyboard-driven interface** for power users
|
||
- **Integration** with existing monitoring APIs (ports 6127, 6128, 6129)
|
||
|
||
### Key Features
|
||
|
||
- **NVMe health monitoring** with wear prediction
|
||
- **CPU / memory / GPU telemetry** with automatic thresholding
|
||
- **Service resource monitoring** with per-service CPU and RAM usage
|
||
- **Disk usage overview** for root filesystems
|
||
- **Backup status** with detailed metrics and history
|
||
- **Unified alert pipeline** summarising host health
|
||
- **Historical data tracking** and trend analysis
|
||
|
||
## Technical Architecture
|
||
|
||
### Technology Stack
|
||
|
||
- **Language**: Rust 🦀
|
||
- **TUI Framework**: ratatui (modern tui-rs fork)
|
||
- **Async Runtime**: tokio
|
||
- **HTTP Client**: reqwest
|
||
- **Serialization**: serde
|
||
- **CLI**: clap
|
||
- **Error Handling**: anyhow
|
||
- **Time**: chrono
|
||
|
||
### Dependencies
|
||
|
||
```toml
|
||
[dependencies]
|
||
ratatui = "0.24" # Modern TUI framework
|
||
crossterm = "0.27" # Cross-platform terminal handling
|
||
tokio = { version = "1.0", features = ["full"] } # Async runtime
|
||
reqwest = { version = "0.11", features = ["json"] } # HTTP client
|
||
serde = { version = "1.0", features = ["derive"] } # JSON parsing
|
||
clap = { version = "4.0", features = ["derive"] } # CLI args
|
||
anyhow = "1.0" # Error handling
|
||
chrono = "0.4" # Time handling
|
||
```
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
cm-dashboard/
|
||
├── Cargo.toml
|
||
├── README.md
|
||
├── CLAUDE.md # This file
|
||
├── src/
|
||
│ ├── main.rs # Entry point & CLI
|
||
│ ├── app.rs # Main application state
|
||
│ ├── ui/
|
||
│ │ ├── mod.rs
|
||
│ │ ├── dashboard.rs # Main dashboard layout
|
||
│ │ ├── nvme.rs # NVMe health widget
|
||
│ │ ├── services.rs # Services status widget
|
||
│ │ ├── memory.rs # RAM optimization widget
|
||
│ │ ├── backup.rs # Backup status widget
|
||
│ │ └── alerts.rs # Alerts/notifications widget
|
||
│ ├── api/
|
||
│ │ ├── mod.rs
|
||
│ │ ├── client.rs # HTTP client wrapper
|
||
│ │ ├── smart.rs # Smart metrics API (port 6127)
|
||
│ │ ├── service.rs # Service metrics API (port 6128)
|
||
│ │ └── backup.rs # Backup metrics API (port 6129)
|
||
│ ├── data/
|
||
│ │ ├── mod.rs
|
||
│ │ ├── metrics.rs # Data structures
|
||
│ │ ├── history.rs # Historical data storage
|
||
│ │ └── config.rs # Host configuration
|
||
│ └── config.rs # Application configuration
|
||
├── config/
|
||
│ ├── hosts.toml # Host definitions
|
||
│ └── dashboard.toml # Dashboard layout config
|
||
└── docs/
|
||
├── API.md # API integration documentation
|
||
└── WIDGETS.md # Widget development guide
|
||
```
|
||
|
||
### Data Structures
|
||
|
||
```rust
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct SmartMetrics {
|
||
pub status: String,
|
||
pub drives: Vec<DriveInfo>,
|
||
pub summary: DriveSummary,
|
||
pub issues: Vec<String>,
|
||
pub timestamp: u64,
|
||
}
|
||
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct ServiceMetrics {
|
||
pub summary: ServiceSummary,
|
||
pub services: Vec<ServiceInfo>,
|
||
pub timestamp: u64,
|
||
}
|
||
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct ServiceSummary {
|
||
pub healthy: usize,
|
||
pub degraded: usize,
|
||
pub failed: usize,
|
||
pub memory_used_mb: f32,
|
||
pub memory_quota_mb: f32,
|
||
pub system_memory_used_mb: f32,
|
||
pub system_memory_total_mb: f32,
|
||
pub disk_used_gb: f32,
|
||
pub disk_total_gb: f32,
|
||
pub cpu_load_1: f32,
|
||
pub cpu_load_5: f32,
|
||
pub cpu_load_15: f32,
|
||
pub cpu_freq_mhz: Option<f32>,
|
||
pub cpu_temp_c: Option<f32>,
|
||
pub gpu_load_percent: Option<f32>,
|
||
pub gpu_temp_c: Option<f32>,
|
||
}
|
||
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct BackupMetrics {
|
||
pub overall_status: String,
|
||
pub backup: BackupInfo,
|
||
pub service: BackupServiceInfo,
|
||
pub timestamp: u64,
|
||
}
|
||
```
|
||
|
||
## Dashboard Layout Design
|
||
|
||
### Main Dashboard View
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ CM Dashboard • cmbox │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Storage • ok:1 warn:0 crit:0 │ Services • ok:1 warn:0 fail:0 │
|
||
│ ┌─────────────────────────────────┐ │ ┌─────────────────────────────── │ │
|
||
│ │Drive Temp Wear Spare Hours │ │ │Service memory: 7.1/23899.7 MiB│ │
|
||
│ │nvme0n1 28°C 1% 100% 14489 │ │ │Disk usage: — │ │
|
||
│ │ Capacity Usage │ │ │ Service Memory Disk │ │
|
||
│ │ 954G 77G (8%) │ │ │✔ sshd 7.1 MiB — │ │
|
||
│ └─────────────────────────────────┘ │ └─────────────────────────────── │ │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ CPU / Memory • warn │ Backups │
|
||
│ System memory: 5251.7/23899.7 MiB │ Host cmbox awaiting backup │ │
|
||
│ CPU load (1/5/15): 2.18 2.66 2.56 │ metrics │ │
|
||
│ CPU freq: 1100.1 MHz │ │ │
|
||
│ CPU temp: 47.0°C │ │ │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Alerts • ok:0 warn:3 fail:0 │ Status • ZMQ connected │
|
||
│ cmbox: warning: CPU load 2.18 │ Monitoring • hosts: 3 │ │
|
||
│ srv01: pending: awaiting metrics │ Data source: ZMQ – connected │ │
|
||
│ labbox: pending: awaiting metrics │ Active host: cmbox (1/3) │ │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
Keys: [←→] hosts [r]efresh [q]uit
|
||
```
|
||
|
||
### Multi-Host View
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ 🖥️ CMTEC Host Overview │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Host │ NVMe Wear │ RAM Usage │ Services │ Last Alert │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ srv01 │ 4% ✅ │ 32% ✅ │ 8/8 ✅ │ 04:00 Backup OK │
|
||
│ cmbox │ 12% ✅ │ 45% ✅ │ 3/3 ✅ │ Yesterday Email test │
|
||
│ labbox │ 8% ✅ │ 28% ✅ │ 2/2 ✅ │ 2h ago NVMe temp OK │
|
||
│ simonbox │ 15% ✅ │ 67% ⚠️ │ 4/4 ✅ │ Gaming session active │
|
||
│ steambox │ 23% ✅ │ 78% ⚠️ │ 2/2 ✅ │ High RAM usage │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
|
||
```
|
||
|
||
## Development Status
|
||
|
||
### Immediate TODOs
|
||
|
||
- Refactor all dashboard widgets to use a shared table/layout helper so icons, padding, and titles remain consistent across panels
|
||
|
||
- Investigate why the backup metrics agent is not publishing data to the dashboard
|
||
- Resize the services widget so it can display more services without truncation
|
||
- Remove the dedicated status widget and redistribute the layout space
|
||
- Add responsive scaling within each widget so columns and content adapt dynamically
|
||
|
||
### Phase 3: Advanced Features 🚧 IN PROGRESS
|
||
|
||
- [x] ZMQ gossip network implementation
|
||
- [x] Comprehensive error handling
|
||
- [x] Performance optimizations
|
||
- [ ] Predictive analytics for wear levels
|
||
- [ ] Custom alert rules engine
|
||
- [ ] Historical data export capabilities
|
||
|
||
# Important Communication Guidelines
|
||
|
||
NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.
|
||
|
||
NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.
|
||
|
||
NEVER mention Claude or automation in commit messages. Keep commit messages focused on the technical changes only.
|