310 lines
14 KiB
Markdown
310 lines
14 KiB
Markdown
# CM Dashboard - Infrastructure Monitoring TUI
|
||
|
||
## Overview
|
||
|
||
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.
|
||
|
||
## Project Goals
|
||
|
||
### Core Objectives
|
||
|
||
- **Real-time monitoring** of all infrastructure components
|
||
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
|
||
- **Performance-focused** with minimal resource usage
|
||
- **Keyboard-driven interface** for power users
|
||
- **Integration** with existing monitoring APIs (ports 6127, 6128, 6129)
|
||
|
||
### Key Features
|
||
|
||
- **NVMe health monitoring** with wear prediction
|
||
- **CPU / memory / GPU telemetry** with automatic thresholding
|
||
- **Service resource monitoring** with per-service CPU and RAM usage
|
||
- **Disk usage overview** for root filesystems
|
||
- **Backup status** with detailed metrics and history
|
||
- **Unified alert pipeline** summarising host health
|
||
- **Historical data tracking** and trend analysis
|
||
|
||
## Technical Architecture
|
||
|
||
### Technology Stack
|
||
|
||
- **Language**: Rust 🦀
|
||
- **TUI Framework**: ratatui (modern tui-rs fork)
|
||
- **Async Runtime**: tokio
|
||
- **HTTP Client**: reqwest
|
||
- **Serialization**: serde
|
||
- **CLI**: clap
|
||
- **Error Handling**: anyhow
|
||
- **Time**: chrono
|
||
|
||
### Dependencies
|
||
|
||
```toml
|
||
[dependencies]
|
||
ratatui = "0.24" # Modern TUI framework
|
||
crossterm = "0.27" # Cross-platform terminal handling
|
||
tokio = { version = "1.0", features = ["full"] } # Async runtime
|
||
reqwest = { version = "0.11", features = ["json"] } # HTTP client
|
||
serde = { version = "1.0", features = ["derive"] } # JSON parsing
|
||
clap = { version = "4.0", features = ["derive"] } # CLI args
|
||
anyhow = "1.0" # Error handling
|
||
chrono = "0.4" # Time handling
|
||
```
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
cm-dashboard/
|
||
├── Cargo.toml
|
||
├── README.md
|
||
├── CLAUDE.md # This file
|
||
├── src/
|
||
│ ├── main.rs # Entry point & CLI
|
||
│ ├── app.rs # Main application state
|
||
│ ├── ui/
|
||
│ │ ├── mod.rs
|
||
│ │ ├── dashboard.rs # Main dashboard layout
|
||
│ │ ├── nvme.rs # NVMe health widget
|
||
│ │ ├── services.rs # Services status widget
|
||
│ │ ├── memory.rs # RAM optimization widget
|
||
│ │ ├── backup.rs # Backup status widget
|
||
│ │ └── alerts.rs # Alerts/notifications widget
|
||
│ ├── api/
|
||
│ │ ├── mod.rs
|
||
│ │ ├── client.rs # HTTP client wrapper
|
||
│ │ ├── smart.rs # Smart metrics API (port 6127)
|
||
│ │ ├── service.rs # Service metrics API (port 6128)
|
||
│ │ └── backup.rs # Backup metrics API (port 6129)
|
||
│ ├── data/
|
||
│ │ ├── mod.rs
|
||
│ │ ├── metrics.rs # Data structures
|
||
│ │ ├── history.rs # Historical data storage
|
||
│ │ └── config.rs # Host configuration
|
||
│ └── config.rs # Application configuration
|
||
├── config/
|
||
│ ├── hosts.toml # Host definitions
|
||
│ └── dashboard.toml # Dashboard layout config
|
||
└── docs/
|
||
├── API.md # API integration documentation
|
||
└── WIDGETS.md # Widget development guide
|
||
```
|
||
|
||
### Data Structures
|
||
|
||
```rust
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct SmartMetrics {
|
||
pub status: String,
|
||
pub drives: Vec<DriveInfo>,
|
||
pub summary: DriveSummary,
|
||
pub issues: Vec<String>,
|
||
pub timestamp: u64,
|
||
}
|
||
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct ServiceMetrics {
|
||
pub summary: ServiceSummary,
|
||
pub services: Vec<ServiceInfo>,
|
||
pub timestamp: u64,
|
||
}
|
||
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct ServiceSummary {
|
||
pub healthy: usize,
|
||
pub degraded: usize,
|
||
pub failed: usize,
|
||
pub memory_used_mb: f32,
|
||
pub memory_quota_mb: f32,
|
||
pub system_memory_used_mb: f32,
|
||
pub system_memory_total_mb: f32,
|
||
pub disk_used_gb: f32,
|
||
pub disk_total_gb: f32,
|
||
pub cpu_load_1: f32,
|
||
pub cpu_load_5: f32,
|
||
pub cpu_load_15: f32,
|
||
pub cpu_freq_mhz: Option<f32>,
|
||
pub cpu_temp_c: Option<f32>,
|
||
pub gpu_load_percent: Option<f32>,
|
||
pub gpu_temp_c: Option<f32>,
|
||
}
|
||
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct BackupMetrics {
|
||
pub overall_status: String,
|
||
pub backup: BackupInfo,
|
||
pub service: BackupServiceInfo,
|
||
pub timestamp: u64,
|
||
}
|
||
```
|
||
|
||
## Dashboard Layout Design
|
||
|
||
### Main Dashboard View
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ CM Dashboard • cmbox │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Storage • ok:1 warn:0 crit:0 │ Services • ok:1 warn:0 fail:0 │
|
||
│ ┌─────────────────────────────────┐ │ ┌─────────────────────────────── │ │
|
||
│ │Drive Temp Wear Spare Hours │ │ │Service memory: 7.1/23899.7 MiB│ │
|
||
│ │nvme0n1 28°C 1% 100% 14489 │ │ │Disk usage: — │ │
|
||
│ │ Capacity Usage │ │ │ Service Memory Disk │ │
|
||
│ │ 954G 77G (8%) │ │ │✔ sshd 7.1 MiB — │ │
|
||
│ └─────────────────────────────────┘ │ └─────────────────────────────── │ │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ CPU / Memory • warn │ Backups │
|
||
│ System memory: 5251.7/23899.7 MiB │ Host cmbox awaiting backup │ │
|
||
│ CPU load (1/5/15): 2.18 2.66 2.56 │ metrics │ │
|
||
│ CPU freq: 1100.1 MHz │ │ │
|
||
│ CPU temp: 47.0°C │ │ │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Alerts • ok:0 warn:3 fail:0 │ Status • ZMQ connected │
|
||
│ cmbox: warning: CPU load 2.18 │ Monitoring • hosts: 3 │ │
|
||
│ srv01: pending: awaiting metrics │ Data source: ZMQ – connected │ │
|
||
│ labbox: pending: awaiting metrics │ Active host: cmbox (1/3) │ │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
Keys: [←→] hosts [r]efresh [q]uit
|
||
```
|
||
|
||
### Multi-Host View
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ 🖥️ CMTEC Host Overview │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Host │ NVMe Wear │ RAM Usage │ Services │ Last Alert │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ srv01 │ 4% ✅ │ 32% ✅ │ 8/8 ✅ │ 04:00 Backup OK │
|
||
│ cmbox │ 12% ✅ │ 45% ✅ │ 3/3 ✅ │ Yesterday Email test │
|
||
│ labbox │ 8% ✅ │ 28% ✅ │ 2/2 ✅ │ 2h ago NVMe temp OK │
|
||
│ simonbox │ 15% ✅ │ 67% ⚠️ │ 4/4 ✅ │ Gaming session active │
|
||
│ steambox │ 23% ✅ │ 78% ⚠️ │ 2/2 ✅ │ High RAM usage │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
|
||
```
|
||
|
||
## Architecture Principles - CRITICAL
|
||
|
||
### Agent-Dashboard Separation of Concerns
|
||
|
||
**AGENT IS SINGLE SOURCE OF TRUTH FOR ALL STATUS CALCULATIONS**
|
||
- Agent calculates status ("ok"/"warning"/"critical"/"unknown") using defined thresholds
|
||
- Agent sends status to dashboard via ZMQ
|
||
- Dashboard NEVER calculates status - only displays what agent provides
|
||
|
||
**Data Flow Architecture:**
|
||
```
|
||
Agent (calculations + thresholds) → Status → Dashboard (display only) → TableBuilder (colors)
|
||
```
|
||
|
||
**Status Handling Rules:**
|
||
- Agent provides status → Dashboard uses agent status
|
||
- Agent doesn't provide status → Dashboard shows "unknown" (NOT "ok")
|
||
- Dashboard widgets NEVER contain hardcoded thresholds
|
||
- TableBuilder converts status to colors for display
|
||
|
||
### Current Agent Thresholds (as of 2025-10-12)
|
||
|
||
**CPU Load (service.rs:392-400):**
|
||
- Warning: ≥ 2.0 (testing value, was 5.0)
|
||
- Critical: ≥ 4.0 (testing value, was 8.0)
|
||
|
||
**CPU Temperature (service.rs:412-420):**
|
||
- Warning: ≥ 70.0°C
|
||
- Critical: ≥ 80.0°C
|
||
|
||
**Memory Usage (service.rs:402-410):**
|
||
- Warning: ≥ 80%
|
||
- Critical: ≥ 95%
|
||
|
||
### Email Notifications
|
||
|
||
**System Configuration:**
|
||
- From: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se)
|
||
- To: `cm@cmtec.se`
|
||
- SMTP: localhost:25 (postfix)
|
||
- Timezone: Europe/Stockholm (not UTC)
|
||
|
||
**Notification Triggers:**
|
||
- Status degradation: any → "warning" or "critical"
|
||
- Recovery: "warning"/"critical" → "ok"
|
||
- Rate limiting: configurable (set to 0 for testing, 30 minutes for production)
|
||
|
||
**Monitored Components:**
|
||
- system.cpu (load status) - SystemCollector
|
||
- system.memory (usage status) - SystemCollector
|
||
- system.cpu_temp (temperature status) - SystemCollector (disabled)
|
||
- system.services (service health status) - ServiceCollector
|
||
- storage.smart (drive health) - SmartCollector
|
||
- backup.overall (backup status) - BackupCollector
|
||
|
||
### Pure Auto-Discovery Implementation
|
||
|
||
**Agent Configuration:**
|
||
- No config files required
|
||
- Auto-detects storage devices, services, backup systems
|
||
- Runtime discovery of system capabilities
|
||
- CLI: `cm-dashboard-agent [-v]` (only verbose flag)
|
||
|
||
**Service Discovery:**
|
||
- Scans running systemd services
|
||
- Filters by predefined interesting patterns (gitea, nginx, docker, etc.)
|
||
- No host-specific hardcoded service lists
|
||
|
||
### Current Implementation Status
|
||
|
||
**Completed:**
|
||
- [x] Pure auto-discovery agent (no config files)
|
||
- [x] Agent-side status calculations with defined thresholds
|
||
- [x] Dashboard displays agent status (no dashboard calculations)
|
||
- [x] Email notifications with Stockholm timezone
|
||
- [x] CPU temperature monitoring and notifications
|
||
- [x] ZMQ message format standardization
|
||
- [x] Removed all hardcoded dashboard thresholds
|
||
- [x] CPU thresholds restored to production values (5.0/8.0)
|
||
- [x] All collectors output standardized status strings (ok/warning/critical/unknown)
|
||
- [x] Dashboard connection loss detection with 5-second keep-alive
|
||
- [x] Removed excessive logging from agent
|
||
- [x] Fixed all compiler warnings in both agent and dashboard
|
||
- [x] **SystemCollector architecture refactoring completed (2025-10-12)**
|
||
- [x] Created SystemCollector for CPU load, memory, temperature, C-states
|
||
- [x] Moved system metrics from ServiceCollector to SystemCollector
|
||
- [x] Updated dashboard to parse and display SystemCollector data
|
||
- [x] Enhanced service notifications to include specific failure details
|
||
- [x] CPU temperature thresholds set to 100°C (effectively disabled)
|
||
- [x] **SystemCollector bug fixes completed (2025-10-12)**
|
||
- [x] Fixed CPU load parsing for comma decimal separator locale (", " split)
|
||
- [x] Fixed CPU temperature to prioritize x86_pkg_temp over generic thermal zones
|
||
- [x] Fixed C-state collection to discover all available states (including C10)
|
||
|
||
**Production Configuration:**
|
||
- CPU load thresholds: Warning ≥ 5.0, Critical ≥ 8.0
|
||
- CPU temperature thresholds: Warning ≥ 100°C, Critical ≥ 100°C (effectively disabled)
|
||
- Memory usage thresholds: Warning ≥ 80%, Critical ≥ 95%
|
||
- Connection timeout: 15 seconds (agents send data every 5 seconds)
|
||
- Email rate limiting: 30 minutes (set to 0 for testing)
|
||
|
||
### Development Guidelines
|
||
|
||
**When Adding New Metrics:**
|
||
1. Agent calculates status with thresholds
|
||
2. Agent adds `{metric}_status` field to JSON output
|
||
3. Dashboard data structure adds `{metric}_status: Option<String>`
|
||
4. Dashboard uses `status_level_from_agent_status()` for display
|
||
5. Agent adds notification monitoring for status changes
|
||
|
||
**NEVER:**
|
||
- Add hardcoded thresholds to dashboard widgets
|
||
- Calculate status in dashboard with different thresholds than agent
|
||
- Use "ok" as default when agent status is missing (use "unknown")
|
||
- Calculate colors in widgets (TableBuilder's responsibility)
|
||
|
||
# Important Communication Guidelines
|
||
|
||
NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.
|
||
|
||
NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.
|
||
|
||
NEVER mention Claude or automation in commit messages. Keep commit messages focused on the technical changes only.
|