Implements ZMQ command protocol for dashboard-to-agent communication: - Agents listen on port 6131 for REQ/REP commands - Dashboard sends "refresh" command when 'r' key is pressed - Agents force immediate collection of all metrics via force_refresh_all() - Fresh data is broadcast immediately to dashboard - Updated help text to show "r: Refresh all metrics" Also includes metric-level caching architecture foundation for future granular control over individual metric update frequencies.
495 lines
21 KiB
Markdown
495 lines
21 KiB
Markdown
# CM Dashboard - Infrastructure Monitoring TUI
|
||
|
||
## Overview
|
||
|
||
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.
|
||
|
||
## Project Goals
|
||
|
||
### Core Objectives
|
||
|
||
- **Real-time monitoring** of all infrastructure components
|
||
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
|
||
- **Performance-focused** with minimal resource usage
|
||
- **Keyboard-driven interface** for power users
|
||
- **Integration** with existing monitoring APIs (ports 6127, 6128, 6129)
|
||
|
||
### Key Features
|
||
|
||
- **NVMe health monitoring** with wear prediction
|
||
- **CPU / memory / GPU telemetry** with automatic thresholding
|
||
- **Service resource monitoring** with per-service CPU and RAM usage
|
||
- **Disk usage overview** for root filesystems
|
||
- **Backup status** with detailed metrics and history
|
||
- **Unified alert pipeline** summarising host health
|
||
- **Historical data tracking** and trend analysis
|
||
|
||
## Technical Architecture
|
||
|
||
### Technology Stack
|
||
|
||
- **Language**: Rust 🦀
|
||
- **TUI Framework**: ratatui (modern tui-rs fork)
|
||
- **Async Runtime**: tokio
|
||
- **HTTP Client**: reqwest
|
||
- **Serialization**: serde
|
||
- **CLI**: clap
|
||
- **Error Handling**: anyhow
|
||
- **Time**: chrono
|
||
|
||
### Dependencies
|
||
|
||
```toml
|
||
[dependencies]
|
||
ratatui = "0.24" # Modern TUI framework
|
||
crossterm = "0.27" # Cross-platform terminal handling
|
||
tokio = { version = "1.0", features = ["full"] } # Async runtime
|
||
reqwest = { version = "0.11", features = ["json"] } # HTTP client
|
||
serde = { version = "1.0", features = ["derive"] } # JSON parsing
|
||
clap = { version = "4.0", features = ["derive"] } # CLI args
|
||
anyhow = "1.0" # Error handling
|
||
chrono = "0.4" # Time handling
|
||
```
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
cm-dashboard/
|
||
├── Cargo.toml
|
||
├── README.md
|
||
├── CLAUDE.md # This file
|
||
├── src/
|
||
│ ├── main.rs # Entry point & CLI
|
||
│ ├── app.rs # Main application state
|
||
│ ├── ui/
|
||
│ │ ├── mod.rs
|
||
│ │ ├── dashboard.rs # Main dashboard layout
|
||
│ │ ├── nvme.rs # NVMe health widget
|
||
│ │ ├── services.rs # Services status widget
|
||
│ │ ├── memory.rs # RAM optimization widget
|
||
│ │ ├── backup.rs # Backup status widget
|
||
│ │ └── alerts.rs # Alerts/notifications widget
|
||
│ ├── api/
|
||
│ │ ├── mod.rs
|
||
│ │ ├── client.rs # HTTP client wrapper
|
||
│ │ ├── smart.rs # Smart metrics API (port 6127)
|
||
│ │ ├── service.rs # Service metrics API (port 6128)
|
||
│ │ └── backup.rs # Backup metrics API (port 6129)
|
||
│ ├── data/
|
||
│ │ ├── mod.rs
|
||
│ │ ├── metrics.rs # Data structures
|
||
│ │ ├── history.rs # Historical data storage
|
||
│ │ └── config.rs # Host configuration
|
||
│ └── config.rs # Application configuration
|
||
├── config/
|
||
│ ├── hosts.toml # Host definitions
|
||
│ └── dashboard.toml # Dashboard layout config
|
||
└── docs/
|
||
├── API.md # API integration documentation
|
||
└── WIDGETS.md # Widget development guide
|
||
```
|
||
|
||
### Data Structures
|
||
|
||
```rust
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct SmartMetrics {
|
||
pub status: String,
|
||
pub drives: Vec<DriveInfo>,
|
||
pub summary: DriveSummary,
|
||
pub issues: Vec<String>,
|
||
pub timestamp: u64,
|
||
}
|
||
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct ServiceMetrics {
|
||
pub summary: ServiceSummary,
|
||
pub services: Vec<ServiceInfo>,
|
||
pub timestamp: u64,
|
||
}
|
||
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct ServiceSummary {
|
||
pub healthy: usize,
|
||
pub degraded: usize,
|
||
pub failed: usize,
|
||
pub memory_used_mb: f32,
|
||
pub memory_quota_mb: f32,
|
||
pub system_memory_used_mb: f32,
|
||
pub system_memory_total_mb: f32,
|
||
pub disk_used_gb: f32,
|
||
pub disk_total_gb: f32,
|
||
pub cpu_load_1: f32,
|
||
pub cpu_load_5: f32,
|
||
pub cpu_load_15: f32,
|
||
pub cpu_freq_mhz: Option<f32>,
|
||
pub cpu_temp_c: Option<f32>,
|
||
pub gpu_load_percent: Option<f32>,
|
||
pub gpu_temp_c: Option<f32>,
|
||
}
|
||
|
||
#[derive(Deserialize, Debug)]
|
||
pub struct BackupMetrics {
|
||
pub overall_status: String,
|
||
pub backup: BackupInfo,
|
||
pub service: BackupServiceInfo,
|
||
pub timestamp: u64,
|
||
}
|
||
```
|
||
|
||
## Dashboard Layout Design
|
||
|
||
### Main Dashboard View
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ CM Dashboard • cmbox │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Storage • ok:1 warn:0 crit:0 │ Services • ok:1 warn:0 fail:0 │
|
||
│ ┌─────────────────────────────────┐ │ ┌─────────────────────────────── │ │
|
||
│ │Drive Temp Wear Spare Hours │ │ │Service memory: 7.1/23899.7 MiB│ │
|
||
│ │nvme0n1 28°C 1% 100% 14489 │ │ │Disk usage: — │ │
|
||
│ │ Capacity Usage │ │ │ Service Memory Disk │ │
|
||
│ │ 954G 77G (8%) │ │ │✔ sshd 7.1 MiB — │ │
|
||
│ └─────────────────────────────────┘ │ └─────────────────────────────── │ │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ CPU / Memory • warn │ Backups │
|
||
│ System memory: 5251.7/23899.7 MiB │ Host cmbox awaiting backup │ │
|
||
│ CPU load (1/5/15): 2.18 2.66 2.56 │ metrics │ │
|
||
│ CPU freq: 1100.1 MHz │ │ │
|
||
│ CPU temp: 47.0°C │ │ │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Alerts • ok:0 warn:3 fail:0 │ Status • ZMQ connected │
|
||
│ cmbox: warning: CPU load 2.18 │ Monitoring • hosts: 3 │ │
|
||
│ srv01: pending: awaiting metrics │ Data source: ZMQ – connected │ │
|
||
│ labbox: pending: awaiting metrics │ Active host: cmbox (1/3) │ │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
Keys: [←→] hosts [r]efresh [q]uit
|
||
```
|
||
|
||
### Multi-Host View
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ 🖥️ CMTEC Host Overview │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Host │ NVMe Wear │ RAM Usage │ Services │ Last Alert │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ srv01 │ 4% ✅ │ 32% ✅ │ 8/8 ✅ │ 04:00 Backup OK │
|
||
│ cmbox │ 12% ✅ │ 45% ✅ │ 3/3 ✅ │ Yesterday Email test │
|
||
│ labbox │ 8% ✅ │ 28% ✅ │ 2/2 ✅ │ 2h ago NVMe temp OK │
|
||
│ simonbox │ 15% ✅ │ 67% ⚠️ │ 4/4 ✅ │ Gaming session active │
|
||
│ steambox │ 23% ✅ │ 78% ⚠️ │ 2/2 ✅ │ High RAM usage │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
|
||
```
|
||
|
||
## Architecture Principles - CRITICAL
|
||
|
||
### Agent-Dashboard Separation of Concerns
|
||
|
||
**AGENT IS SINGLE SOURCE OF TRUTH FOR ALL STATUS CALCULATIONS**
|
||
- Agent calculates status ("ok"/"warning"/"critical"/"unknown") using defined thresholds
|
||
- Agent sends status to dashboard via ZMQ
|
||
- Dashboard NEVER calculates status - only displays what agent provides
|
||
|
||
**Data Flow Architecture:**
|
||
```
|
||
Agent (calculations + thresholds) → Status → Dashboard (display only) → TableBuilder (colors)
|
||
```
|
||
|
||
**Status Handling Rules:**
|
||
- Agent provides status → Dashboard uses agent status
|
||
- Agent doesn't provide status → Dashboard shows "unknown" (NOT "ok")
|
||
- Dashboard widgets NEVER contain hardcoded thresholds
|
||
- TableBuilder converts status to colors for display
|
||
|
||
### Current Agent Thresholds (as of 2025-10-12)
|
||
|
||
**CPU Load (service.rs:392-400):**
|
||
- Warning: ≥ 2.0 (testing value, was 5.0)
|
||
- Critical: ≥ 4.0 (testing value, was 8.0)
|
||
|
||
**CPU Temperature (service.rs:412-420):**
|
||
- Warning: ≥ 70.0°C
|
||
- Critical: ≥ 80.0°C
|
||
|
||
**Memory Usage (service.rs:402-410):**
|
||
- Warning: ≥ 80%
|
||
- Critical: ≥ 95%
|
||
|
||
### Email Notifications
|
||
|
||
**System Configuration:**
|
||
- From: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se)
|
||
- To: `cm@cmtec.se`
|
||
- SMTP: localhost:25 (postfix)
|
||
- Timezone: Europe/Stockholm (not UTC)
|
||
|
||
**Notification Triggers:**
|
||
- Status degradation: any → "warning" or "critical"
|
||
- Recovery: "warning"/"critical" → "ok"
|
||
- Rate limiting: configurable (set to 0 for testing, 30 minutes for production)
|
||
|
||
**Monitored Components:**
|
||
- system.cpu (load status) - SystemCollector
|
||
- system.memory (usage status) - SystemCollector
|
||
- system.cpu_temp (temperature status) - SystemCollector (disabled)
|
||
- system.services (service health status) - ServiceCollector
|
||
- storage.smart (drive health) - SmartCollector
|
||
- backup.overall (backup status) - BackupCollector
|
||
|
||
### Pure Auto-Discovery Implementation
|
||
|
||
**Agent Configuration:**
|
||
- No config files required
|
||
- Auto-detects storage devices, services, backup systems
|
||
- Runtime discovery of system capabilities
|
||
- CLI: `cm-dashboard-agent [-v]` (intelligent caching enabled)
|
||
|
||
**Service Discovery:**
|
||
- Scans running systemd services
|
||
- Filters by predefined interesting patterns (gitea, nginx, docker, etc.)
|
||
- No host-specific hardcoded service lists
|
||
|
||
### Current Implementation Status
|
||
|
||
**Completed:**
|
||
- [x] Pure auto-discovery agent (no config files)
|
||
- [x] Agent-side status calculations with defined thresholds
|
||
- [x] Dashboard displays agent status (no dashboard calculations)
|
||
- [x] Email notifications with Stockholm timezone
|
||
- [x] CPU temperature monitoring and notifications
|
||
- [x] ZMQ message format standardization
|
||
- [x] Removed all hardcoded dashboard thresholds
|
||
- [x] CPU thresholds restored to production values (5.0/8.0)
|
||
- [x] All collectors output standardized status strings (ok/warning/critical/unknown)
|
||
- [x] Dashboard connection loss detection with 5-second keep-alive
|
||
- [x] Removed excessive logging from agent
|
||
- [x] Fixed all compiler warnings in both agent and dashboard
|
||
- [x] **SystemCollector architecture refactoring completed (2025-10-12)**
|
||
- [x] Created SystemCollector for CPU load, memory, temperature, C-states
|
||
- [x] Moved system metrics from ServiceCollector to SystemCollector
|
||
- [x] Updated dashboard to parse and display SystemCollector data
|
||
- [x] Enhanced service notifications to include specific failure details
|
||
- [x] CPU temperature thresholds set to 100°C (effectively disabled)
|
||
- [x] **SystemCollector bug fixes completed (2025-10-12)**
|
||
- [x] Fixed CPU load parsing for comma decimal separator locale (", " split)
|
||
- [x] Fixed CPU temperature to prioritize x86_pkg_temp over generic thermal zones
|
||
- [x] Fixed C-state collection to discover all available states (including C10)
|
||
- [x] **Dashboard improvements and maintenance mode (2025-10-13)**
|
||
- [x] Host auto-discovery with predefined CMTEC infrastructure hosts (cmbox, labbox, simonbox, steambox, srv01)
|
||
- [x] Host navigation limited to connected hosts only (no disconnected host cycling)
|
||
- [x] Storage widget restructured: Name/Temp/Wear/Usage columns with SMART details as descriptions
|
||
- [x] Agent-provided descriptions for Storage widget (agent is source of truth for formatting)
|
||
- [x] Maintenance mode implementation: /tmp/cm-maintenance file suppresses notifications
|
||
- [x] NixOS borgbackup integration with automatic maintenance mode during backups
|
||
- [x] System widget simplified to single row with C-states as description lines
|
||
- [x] CPU load thresholds updated to production values (9.0/10.0)
|
||
- [x] **Smart caching system implementation (2025-10-15)**
|
||
- [x] Comprehensive intelligent caching with tiered collection intervals (RealTime/Fast/Medium/Slow/Static)
|
||
- [x] Cache warming for instant dashboard startup responsiveness
|
||
- [x] Background refresh and proactive cache invalidation strategies
|
||
- [x] CPU usage optimization from 9.5% to <2% through smart polling reduction
|
||
- [x] Cache key consistency fixes for proper collector data flow
|
||
- [x] ZMQ broadcast mechanism ensuring continuous data delivery to dashboard
|
||
- [x] Immich service quota detection fix (500GB instead of hardcoded 200GB)
|
||
- [x] Service-to-directory mapping for accurate disk usage calculation
|
||
|
||
**Production Configuration:**
|
||
- CPU load thresholds: Warning ≥ 9.0, Critical ≥ 10.0
|
||
- CPU temperature thresholds: Warning ≥ 100°C, Critical ≥ 100°C (effectively disabled)
|
||
- Memory usage thresholds: Warning ≥ 80%, Critical ≥ 95%
|
||
- Connection timeout: 15 seconds (agents send data every 5 seconds)
|
||
- Email rate limiting: 30 minutes (set to 0 for testing)
|
||
|
||
### Maintenance Mode
|
||
|
||
**Purpose:**
|
||
- Suppress email notifications during planned maintenance or backups
|
||
- Prevents false alerts when services are intentionally stopped
|
||
|
||
**Implementation:**
|
||
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
|
||
- File presence suppresses all email notifications while continuing monitoring
|
||
- Dashboard continues to show real status, only notifications are blocked
|
||
|
||
**Usage:**
|
||
```bash
|
||
# Enable maintenance mode
|
||
touch /tmp/cm-maintenance
|
||
|
||
# Run maintenance tasks (backups, service restarts, etc.)
|
||
systemctl stop service
|
||
# ... maintenance work ...
|
||
systemctl start service
|
||
|
||
# Disable maintenance mode
|
||
rm /tmp/cm-maintenance
|
||
```
|
||
|
||
**NixOS Integration:**
|
||
- Borgbackup script automatically creates/removes maintenance file
|
||
- Automatic cleanup via trap ensures maintenance mode doesn't stick
|
||
|
||
### Smart Caching System
|
||
|
||
**Purpose:**
|
||
- Reduce agent CPU usage from 9.5% to <2% through intelligent caching
|
||
- Maintain dashboard responsiveness with tiered refresh strategies
|
||
- Optimize for different data volatility characteristics
|
||
|
||
**Architecture:**
|
||
```
|
||
Cache Tiers:
|
||
- RealTime (5s): CPU load, memory usage, quick-changing metrics
|
||
- Fast (30s): Network stats, process lists, medium-volatility
|
||
- Medium (5min): Service status, disk usage, slow-changing data
|
||
- Slow (15min): SMART data, backup status, rarely-changing metrics
|
||
- Static (1h): Hardware info, system capabilities, fixed data
|
||
```
|
||
|
||
**Implementation:**
|
||
- **SmartCache**: Central cache manager with RwLock for thread safety
|
||
- **CachedCollector**: Wrapper adding caching to any collector
|
||
- **CollectionScheduler**: Manages tier-based refresh timing
|
||
- **Cache warming**: Parallel startup population for instant responsiveness
|
||
- **Background refresh**: Proactive updates to prevent cache misses
|
||
|
||
**Usage:**
|
||
```bash
|
||
# Start the agent with intelligent caching
|
||
cm-dashboard-agent [-v]
|
||
```
|
||
|
||
**Performance Benefits:**
|
||
- CPU usage reduction: 9.5% → <2% expected
|
||
- Instant dashboard startup through cache warming
|
||
- Reduced disk I/O through intelligent du command caching
|
||
- Network efficiency with selective refresh strategies
|
||
|
||
**Configuration:**
|
||
- Cache warming timeout: 3 seconds
|
||
- Background refresh: Enabled at 80% of tier interval
|
||
- Cache cleanup: Every 30 minutes
|
||
- Stale data threshold: 2x tier interval
|
||
|
||
**Architecture:**
|
||
- **Intelligent caching**: Tiered collection with optimal CPU usage
|
||
- **Auto-discovery**: No configuration files required
|
||
- **Responsive design**: Cache warming for instant dashboard startup
|
||
|
||
### Development Guidelines
|
||
|
||
**When Adding New Metrics:**
|
||
1. Agent calculates status with thresholds
|
||
2. Agent adds `{metric}_status` field to JSON output
|
||
3. Dashboard data structure adds `{metric}_status: Option<String>`
|
||
4. Dashboard uses `status_level_from_agent_status()` for display
|
||
5. Agent adds notification monitoring for status changes
|
||
|
||
**Testing & Building:**
|
||
- ALWAYS use `cargo build --workspace` to match NixOS build configuration
|
||
- Test with OpenSSL environment variables when building locally:
|
||
```bash
|
||
OPENSSL_DIR=/nix/store/.../openssl-dev \
|
||
OPENSSL_LIB_DIR=/nix/store/.../openssl/lib \
|
||
OPENSSL_INCLUDE_DIR=/nix/store/.../openssl-dev/include \
|
||
PKG_CONFIG_PATH=/nix/store/.../openssl-dev/lib/pkgconfig \
|
||
OPENSSL_NO_VENDOR=1 cargo build --workspace
|
||
```
|
||
- This prevents build failures that only appear in NixOS deployment
|
||
|
||
**Notification System:**
|
||
- Universal automatic detection of all `_status` fields across all collectors
|
||
- Sends emails from `hostname@cmtec.se` to `cm@cmtec.se` for any status changes
|
||
- Status stored in-memory: `HashMap<"component.metric", status>`
|
||
- Recovery emails sent when status changes from warning/critical → ok
|
||
|
||
**NEVER:**
|
||
- Add hardcoded thresholds to dashboard widgets
|
||
- Calculate status in dashboard with different thresholds than agent
|
||
- Use "ok" as default when agent status is missing (use "unknown")
|
||
- Calculate colors in widgets (TableBuilder's responsibility)
|
||
- Use `cargo build` without `--workspace` for final testing
|
||
|
||
# Important Communication Guidelines
|
||
|
||
NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.
|
||
|
||
NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.
|
||
|
||
## Commit Message Guidelines
|
||
|
||
**NEVER mention:**
|
||
- Claude or any AI assistant names
|
||
- Automation or AI-generated content
|
||
- Any reference to automated code generation
|
||
|
||
**ALWAYS:**
|
||
- Focus purely on technical changes and their purpose
|
||
- Use standard software development commit message format
|
||
- Describe what was changed and why, not how it was created
|
||
- Write from the perspective of a human developer
|
||
|
||
**Examples:**
|
||
- ❌ "Generated with Claude Code"
|
||
- ❌ "AI-assisted implementation"
|
||
- ❌ "Automated refactoring"
|
||
- ✅ "Implement maintenance mode for backup operations"
|
||
- ✅ "Restructure storage widget with improved layout"
|
||
- ✅ "Update CPU thresholds to production values"
|
||
|
||
## NixOS Configuration Updates
|
||
|
||
When code changes are made to cm-dashboard, the NixOS configuration at `~/nixosbox` must be updated to deploy the changes.
|
||
|
||
### Update Process
|
||
|
||
1. **Get Latest Commit Hash**
|
||
```bash
|
||
git log -1 --format="%H"
|
||
```
|
||
|
||
2. **Update NixOS Configuration**
|
||
Edit `~/nixosbox/hosts/common/cm-dashboard.nix`:
|
||
```nix
|
||
src = pkgs.fetchgit {
|
||
url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
|
||
rev = "NEW_COMMIT_HASH_HERE";
|
||
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; # Placeholder
|
||
};
|
||
```
|
||
|
||
3. **Get Correct Source Hash**
|
||
Build with placeholder hash to get the actual hash:
|
||
```bash
|
||
cd ~/nixosbox
|
||
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchgit {
|
||
url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
|
||
rev = "NEW_COMMIT_HASH";
|
||
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
|
||
}' 2>&1 | grep "got:"
|
||
```
|
||
|
||
Example output:
|
||
```
|
||
error: hash mismatch in fixed-output derivation '/nix/store/...':
|
||
specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
|
||
got: sha256-x8crxNusOUYRrkP9mYEOG+Ga3JCPIdJLkEAc5P1ZxdQ=
|
||
```
|
||
|
||
4. **Update Configuration with Correct Hash**
|
||
Replace the placeholder with the hash from the error message (the "got:" line).
|
||
|
||
5. **Commit NixOS Configuration**
|
||
```bash
|
||
cd ~/nixosbox
|
||
git add hosts/common/cm-dashboard.nix
|
||
git commit -m "Update cm-dashboard to latest version (SHORT_HASH)"
|
||
git push
|
||
```
|
||
|
||
6. **Rebuild System**
|
||
The user handles the system rebuild step - this cannot be automated.
|