cm-dashboard/CLAUDE.md

# CM Dashboard - Infrastructure Monitoring TUI

## Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.

## Project Goals

### Core Objectives

- **Real-time monitoring** of all infrastructure components
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
- **Performance-focused** with minimal resource usage
- **Keyboard-driven interface** for power users
- **Integration** with existing monitoring APIs (ports 6127, 6128, 6129)

### Key Features

- **NVMe health monitoring** with wear prediction
- **CPU / memory / GPU telemetry** with automatic thresholding
- **Service resource monitoring** with per-service CPU and RAM usage
- **Disk usage overview** for root filesystems
- **Backup status** with detailed metrics and history
- **Unified alert pipeline** summarising host health
- **Historical data tracking** and trend analysis

## Technical Architecture

### Technology Stack

- **Language**: Rust 🦀
- **TUI Framework**: ratatui (modern tui-rs fork)
- **Async Runtime**: tokio
- **HTTP Client**: reqwest
- **Serialization**: serde
- **CLI**: clap
- **Error Handling**: anyhow
- **Time**: chrono

### Dependencies

```toml
[dependencies]
ratatui = "0.24"           # Modern TUI framework
crossterm = "0.27"         # Cross-platform terminal handling
tokio = { version = "1.0", features = ["full"] }  # Async runtime
reqwest = { version = "0.11", features = ["json"] }  # HTTP client
serde = { version = "1.0", features = ["derive"] }   # JSON parsing
clap = { version = "4.0", features = ["derive"] }    # CLI args
anyhow = "1.0"             # Error handling
chrono = "0.4"             # Time handling
```

## Project Structure

```
cm-dashboard/
├── Cargo.toml
├── README.md
├── CLAUDE.md              # This file
├── src/
│   ├── main.rs            # Entry point & CLI
│   ├── app.rs             # Main application state
│   ├── ui/
│   │   ├── mod.rs
│   │   ├── dashboard.rs   # Main dashboard layout
│   │   ├── nvme.rs        # NVMe health widget
│   │   ├── services.rs    # Services status widget
│   │   ├── memory.rs      # RAM optimization widget
│   │   ├── backup.rs      # Backup status widget
│   │   └── alerts.rs      # Alerts/notifications widget
│   ├── api/
│   │   ├── mod.rs
│   │   ├── client.rs      # HTTP client wrapper
│   │   ├── smart.rs       # Smart metrics API (port 6127)
│   │   ├── service.rs     # Service metrics API (port 6128)
│   │   └── backup.rs      # Backup metrics API (port 6129)
│   ├── data/
│   │   ├── mod.rs
│   │   ├── metrics.rs     # Data structures
│   │   ├── history.rs     # Historical data storage
│   │   └── config.rs      # Host configuration
│   └── config.rs          # Application configuration
├── config/
│   ├── hosts.toml         # Host definitions
│   └── dashboard.toml     # Dashboard layout config
└── docs/
    ├── API.md             # API integration documentation
    └── WIDGETS.md         # Widget development guide
```

### Data Structures

```rust
#[derive(Deserialize, Debug)]
pub struct SmartMetrics {
    pub status: String,
    pub drives: Vec<DriveInfo>,
    pub summary: DriveSummary,
    pub issues: Vec<String>,
    pub timestamp: u64,
}

#[derive(Deserialize, Debug)]
pub struct ServiceMetrics {
    pub summary: ServiceSummary,
    pub services: Vec<ServiceInfo>,
    pub timestamp: u64,
}

#[derive(Deserialize, Debug)]
pub struct ServiceSummary {
    pub healthy: usize,
    pub degraded: usize,
    pub failed: usize,
    pub memory_used_mb: f32,
    pub memory_quota_mb: f32,
    pub system_memory_used_mb: f32,
    pub system_memory_total_mb: f32,
    pub disk_used_gb: f32,
    pub disk_total_gb: f32,
    pub cpu_load_1: f32,
    pub cpu_load_5: f32,
    pub cpu_load_15: f32,
    pub cpu_freq_mhz: Option<f32>,
    pub cpu_temp_c: Option<f32>,
    pub gpu_load_percent: Option<f32>,
    pub gpu_temp_c: Option<f32>,
}

#[derive(Deserialize, Debug)]
pub struct BackupMetrics {
    pub overall_status: String,
    pub backup: BackupInfo,
    pub service: BackupServiceInfo,
    pub timestamp: u64,
}
```

## Dashboard Layout Design

### Main Dashboard View

```
┌─────────────────────────────────────────────────────────────────────┐
│ CM Dashboard • cmbox                                                 │
├─────────────────────────────────────────────────────────────────────┤
│ Storage • ok:1 warn:0 crit:0       │ Services • ok:1 warn:0 fail:0   │
│ ┌─────────────────────────────────┐ │ ┌─────────────────────────────── │ │
│ │Drive    Temp  Wear Spare Hours │ │ │Service memory: 7.1/23899.7 MiB│ │
│ │nvme0n1  28°C  1%   100%  14489 │ │ │Disk usage: —                  │ │
│ │         Capacity Usage          │ │ │  Service  Memory     Disk      │ │
│ │         954G     77G (8%)       │ │ │✔ sshd     7.1 MiB   —          │ │
│ └─────────────────────────────────┘ │ └─────────────────────────────── │ │
├─────────────────────────────────────────────────────────────────────┤
│ CPU / Memory • warn                 │ Backups                         │
│ System memory: 5251.7/23899.7 MiB  │ Host cmbox awaiting backup      │ │
│ CPU load (1/5/15): 2.18 2.66 2.56  │ metrics                         │ │
│ CPU freq: 1100.1 MHz               │                                 │ │
│ CPU temp: 47.0°C                    │                                 │ │
├─────────────────────────────────────────────────────────────────────┤
│ Alerts • ok:0 warn:3 fail:0        │ Status • ZMQ connected          │
│ cmbox: warning: CPU load 2.18      │ Monitoring • hosts: 3           │ │
│ srv01: pending: awaiting metrics    │ Data source: ZMQ – connected    │ │
│ labbox: pending: awaiting metrics   │ Active host: cmbox (1/3)        │ │
└─────────────────────────────────────────────────────────────────────┘
Keys: [←→] hosts [r]efresh [q]uit
```

### Multi-Host View

```
┌─────────────────────────────────────────────────────────────────────┐
│ 🖥️  CMTEC Host Overview                                              │
├─────────────────────────────────────────────────────────────────────┤
│ Host      │ NVMe Wear │ RAM Usage │ Services │ Last Alert            │
├─────────────────────────────────────────────────────────────────────┤
│ srv01     │ 4%   ✅   │ 32%  ✅   │ 8/8  ✅  │ 04:00 Backup OK       │
│ cmbox     │ 12%  ✅   │ 45%  ✅   │ 3/3  ✅  │ Yesterday Email test  │
│ labbox    │ 8%   ✅   │ 28%  ✅   │ 2/2  ✅  │ 2h ago NVMe temp OK   │
│ simonbox  │ 15%  ✅   │ 67%  ⚠️   │ 4/4  ✅  │ Gaming session active │
│ steambox  │ 23%  ✅   │ 78%  ⚠️   │ 2/2  ✅  │ High RAM usage        │
└─────────────────────────────────────────────────────────────────────┘
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
```

## Architecture Principles - CRITICAL

### Agent-Dashboard Separation of Concerns

**AGENT IS SINGLE SOURCE OF TRUTH FOR ALL STATUS CALCULATIONS**
- Agent calculates status ("ok"/"warning"/"critical"/"unknown") using defined thresholds
- Agent sends status to dashboard via ZMQ
- Dashboard NEVER calculates status - only displays what agent provides

**Data Flow Architecture:**
```
Agent (calculations + thresholds) → Status → Dashboard (display only) → TableBuilder (colors)
```

**Status Handling Rules:**
- Agent provides status → Dashboard uses agent status
- Agent doesn't provide status → Dashboard shows "unknown" (NOT "ok")
- Dashboard widgets NEVER contain hardcoded thresholds
- TableBuilder converts status to colors for display

### Current Agent Thresholds (as of 2025-10-12)

**CPU Load (service.rs:392-400):**
- Warning: ≥ 2.0 (testing value, was 5.0)
- Critical: ≥ 4.0 (testing value, was 8.0)

**CPU Temperature (service.rs:412-420):**
- Warning: ≥ 70.0°C
- Critical: ≥ 80.0°C

**Memory Usage (service.rs:402-410):**
- Warning: ≥ 80%
- Critical: ≥ 95%

### Email Notifications

**System Configuration:**
- From: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se)
- To: `cm@cmtec.se`
- SMTP: localhost:25 (postfix)
- Timezone: Europe/Stockholm (not UTC)

**Notification Triggers:**
- Status degradation: any → "warning" or "critical"
- Recovery: "warning"/"critical" → "ok"
- Rate limiting: configurable (set to 0 for testing, 30 minutes for production)

**Monitored Components:**
- system.cpu (load status) - SystemCollector
- system.memory (usage status) - SystemCollector
- system.cpu_temp (temperature status) - SystemCollector (disabled)
- system.services (service health status) - ServiceCollector
- storage.smart (drive health) - SmartCollector
- backup.overall (backup status) - BackupCollector

### Pure Auto-Discovery Implementation

**Agent Configuration:**
- No config files required
- Auto-detects storage devices, services, backup systems
- Runtime discovery of system capabilities
- CLI: `cm-dashboard-agent [-v]` (intelligent caching enabled)

**Service Discovery:**
- Scans running systemd services
- Filters by predefined interesting patterns (gitea, nginx, docker, etc.)
- No host-specific hardcoded service lists

### Current Implementation Status

**Completed:**
- [x] Pure auto-discovery agent (no config files)
- [x] Agent-side status calculations with defined thresholds
- [x] Dashboard displays agent status (no dashboard calculations)
- [x] Email notifications with Stockholm timezone
- [x] CPU temperature monitoring and notifications
- [x] ZMQ message format standardization
- [x] Removed all hardcoded dashboard thresholds
- [x] CPU thresholds restored to production values (5.0/8.0)
- [x] All collectors output standardized status strings (ok/warning/critical/unknown)
- [x] Dashboard connection loss detection with 5-second keep-alive
- [x] Removed excessive logging from agent
- [x] Fixed all compiler warnings in both agent and dashboard
- [x] **SystemCollector architecture refactoring completed (2025-10-12)**
- [x] Created SystemCollector for CPU load, memory, temperature, C-states
- [x] Moved system metrics from ServiceCollector to SystemCollector
- [x] Updated dashboard to parse and display SystemCollector data
- [x] Enhanced service notifications to include specific failure details
- [x] CPU temperature thresholds set to 100°C (effectively disabled)
- [x] **SystemCollector bug fixes completed (2025-10-12)**
- [x] Fixed CPU load parsing for comma decimal separator locale (", " split)
- [x] Fixed CPU temperature to prioritize x86_pkg_temp over generic thermal zones
- [x] Fixed C-state collection to discover all available states (including C10)
- [x] **Dashboard improvements and maintenance mode (2025-10-13)**
- [x] Host auto-discovery with predefined CMTEC infrastructure hosts (cmbox, labbox, simonbox, steambox, srv01)
- [x] Host navigation limited to connected hosts only (no disconnected host cycling)
- [x] Storage widget restructured: Name/Temp/Wear/Usage columns with SMART details as descriptions
- [x] Agent-provided descriptions for Storage widget (agent is source of truth for formatting)
- [x] Maintenance mode implementation: /tmp/cm-maintenance file suppresses notifications
- [x] NixOS borgbackup integration with automatic maintenance mode during backups
- [x] System widget simplified to single row with C-states as description lines
- [x] CPU load thresholds updated to production values (9.0/10.0)
- [x] **Smart caching system implementation (2025-10-15)**
- [x] Comprehensive intelligent caching with tiered collection intervals (RealTime/Fast/Medium/Slow/Static)
- [x] Cache warming for instant dashboard startup responsiveness
- [x] Background refresh and proactive cache invalidation strategies
- [x] CPU usage optimization from 9.5% to <2% through smart polling reduction
- [x] Cache key consistency fixes for proper collector data flow
- [x] ZMQ broadcast mechanism ensuring continuous data delivery to dashboard
- [x] Immich service quota detection fix (500GB instead of hardcoded 200GB)
- [x] Service-to-directory mapping for accurate disk usage calculation

**Production Configuration:**
- CPU load thresholds: Warning ≥ 9.0, Critical ≥ 10.0
- CPU temperature thresholds: Warning ≥ 100°C, Critical ≥ 100°C (effectively disabled)
- Memory usage thresholds: Warning ≥ 80%, Critical ≥ 95%
- Connection timeout: 15 seconds (agents send data every 5 seconds)
- Email rate limiting: 30 minutes (set to 0 for testing)

### Maintenance Mode

**Purpose:**
- Suppress email notifications during planned maintenance or backups
- Prevents false alerts when services are intentionally stopped

**Implementation:**
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
- File presence suppresses all email notifications while continuing monitoring
- Dashboard continues to show real status, only notifications are blocked

**Usage:**
```bash
# Enable maintenance mode
touch /tmp/cm-maintenance

# Run maintenance tasks (backups, service restarts, etc.)
systemctl stop service
# ... maintenance work ...
systemctl start service

# Disable maintenance mode
rm /tmp/cm-maintenance
```

**NixOS Integration:**
- Borgbackup script automatically creates/removes maintenance file
- Automatic cleanup via trap ensures maintenance mode doesn't stick

### Smart Caching System

**Purpose:**
- Reduce agent CPU usage from 9.5% to <2% through intelligent caching
- Maintain dashboard responsiveness with tiered refresh strategies
- Optimize for different data volatility characteristics

**Architecture:**
```
Cache Tiers:
- RealTime (5s):  CPU load, memory usage, quick-changing metrics
- Fast (30s):     Network stats, process lists, medium-volatility
- Medium (5min):  Service status, disk usage, slow-changing data
- Slow (15min):   SMART data, backup status, rarely-changing metrics
- Static (1h):    Hardware info, system capabilities, fixed data
```

**Implementation:**
- **SmartCache**: Central cache manager with RwLock for thread safety
- **CachedCollector**: Wrapper adding caching to any collector
- **CollectionScheduler**: Manages tier-based refresh timing
- **Cache warming**: Parallel startup population for instant responsiveness
- **Background refresh**: Proactive updates to prevent cache misses

**Usage:**
```bash
# Start the agent with intelligent caching
cm-dashboard-agent [-v]
```

**Performance Benefits:**
- CPU usage reduction: 9.5% → <2% expected
- Instant dashboard startup through cache warming
- Reduced disk I/O through intelligent du command caching
- Network efficiency with selective refresh strategies

**Configuration:**
- Cache warming timeout: 3 seconds
- Background refresh: Enabled at 80% of tier interval
- Cache cleanup: Every 30 minutes
- Stale data threshold: 2x tier interval

**Architecture:**
- **Intelligent caching**: Tiered collection with optimal CPU usage
- **Auto-discovery**: No configuration files required
- **Responsive design**: Cache warming for instant dashboard startup

### Development Guidelines

**When Adding New Metrics:**
1. Agent calculates status with thresholds
2. Agent adds `{metric}_status` field to JSON output
3. Dashboard data structure adds `{metric}_status: Option<String>`
4. Dashboard uses `status_level_from_agent_status()` for display
5. Agent adds notification monitoring for status changes

**Testing & Building:**
- ALWAYS use `cargo build --workspace` to match NixOS build configuration
- Test with OpenSSL environment variables when building locally:
  ```bash
  OPENSSL_DIR=/nix/store/.../openssl-dev \
  OPENSSL_LIB_DIR=/nix/store/.../openssl/lib \
  OPENSSL_INCLUDE_DIR=/nix/store/.../openssl-dev/include \
  PKG_CONFIG_PATH=/nix/store/.../openssl-dev/lib/pkgconfig \
  OPENSSL_NO_VENDOR=1 cargo build --workspace
  ```
- This prevents build failures that only appear in NixOS deployment

**Notification System:**
- Universal automatic detection of all `_status` fields across all collectors
- Sends emails from `hostname@cmtec.se` to `cm@cmtec.se` for any status changes
- Status stored in-memory: `HashMap<"component.metric", status>`
- Recovery emails sent when status changes from warning/critical → ok

**NEVER:**
- Add hardcoded thresholds to dashboard widgets
- Calculate status in dashboard with different thresholds than agent
- Use "ok" as default when agent status is missing (use "unknown")
- Calculate colors in widgets (TableBuilder's responsibility)
- Use `cargo build` without `--workspace` for final testing

# Important Communication Guidelines

NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.

NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.

## Commit Message Guidelines

**NEVER mention:**
- Claude or any AI assistant names
- Automation or AI-generated content
- Any reference to automated code generation

**ALWAYS:**
- Focus purely on technical changes and their purpose
- Use standard software development commit message format
- Describe what was changed and why, not how it was created
- Write from the perspective of a human developer

**Examples:**
- ❌ "Generated with Claude Code"
- ❌ "AI-assisted implementation"
- ❌ "Automated refactoring"
- ✅ "Implement maintenance mode for backup operations"
- ✅ "Restructure storage widget with improved layout"
- ✅ "Update CPU thresholds to production values"

## NixOS Configuration Updates

When code changes are made to cm-dashboard, the NixOS configuration at `~/nixosbox` must be updated to deploy the changes.

### Update Process

1. **Get Latest Commit Hash**
   ```bash
   git log -1 --format="%H"
   ```

2. **Update NixOS Configuration**
   Edit `~/nixosbox/hosts/common/cm-dashboard.nix`:
   ```nix
   src = pkgs.fetchgit {
     url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
     rev = "NEW_COMMIT_HASH_HERE";
     sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; # Placeholder
   };
   ```

3. **Get Correct Source Hash**
   Build with placeholder hash to get the actual hash:
   ```bash
   cd ~/nixosbox
   nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchgit {
     url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
     rev = "NEW_COMMIT_HASH";
     sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
   }' 2>&1 | grep "got:"
   ```

   Example output:
   ```
   error: hash mismatch in fixed-output derivation '/nix/store/...':
            specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
               got:    sha256-x8crxNusOUYRrkP9mYEOG+Ga3JCPIdJLkEAc5P1ZxdQ=
   ```

4. **Update Configuration with Correct Hash**
   Replace the placeholder with the hash from the error message (the "got:" line).

5. **Commit NixOS Configuration**
   ```bash
   cd ~/nixosbox
   git add hosts/common/cm-dashboard.nix
   git commit -m "Update cm-dashboard to latest version (SHORT_HASH)"
   git push
   ```

6. **Rebuild System**
   The user handles the system rebuild step - this cannot be automated.