cm-dashboard/CLAUDE.md
Christoffer Martinsson dcca5bbea3 Fix cache tier test to match actual configuration
- Update test expectations from 5s to 2s intervals for realtime tier
- Fix comment to reflect actual 2s interval instead of outdated 5s reference
- All tests now pass correctly
2025-10-18 18:44:13 +02:00

22 KiB

CM Dashboard - Infrastructure Monitoring TUI

Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and ZMQ-based metric collection.

CRITICAL: Architecture Redesign in Progress

LEGACY CODE DEPRECATION: The current codebase is being completely rewritten with a new individual metrics architecture. ALL existing code will be moved to a backup folder for reference only.

NEW IMPLEMENTATION STRATEGY:

  • NO legacy code reuse - Fresh implementation following ARCHITECT.md
  • Clean slate approach - Build entirely new codebase structure
  • Reference-only legacy - Current code preserved only for functionality reference

Implementation Strategy

Phase 1: Legacy Code Backup (IMMEDIATE)

Backup Current Implementation:

# Create backup folder for reference
mkdir -p backup/legacy-2025-10-16

# Move all current source code to backup  
mv agent/ backup/legacy-2025-10-16/
mv dashboard/ backup/legacy-2025-10-16/
mv shared/ backup/legacy-2025-10-16/

# Preserve configuration examples
cp -r config/ backup/legacy-2025-10-16/

# Keep important documentation
cp CLAUDE.md backup/legacy-2025-10-16/CLAUDE-legacy.md
cp README.md backup/legacy-2025-10-16/README-legacy.md

Reference Usage Rules:

  • Legacy code is REFERENCE ONLY - never copy/paste
  • Study existing functionality and UI layout patterns
  • Understand current widget behavior and status mapping
  • Reference notification logic and email formatting
  • NO legacy code in new implementation

Phase 2: Clean Slate Implementation

New Codebase Structure: Following ARCHITECT.md precisely with zero legacy dependencies:

cm-dashboard/                      # New clean repository root
├── ARCHITECT.md                   # Architecture documentation  
├── CLAUDE.md                      # This file (updated)
├── README.md                      # New implementation documentation
├── Cargo.toml                     # Workspace configuration
├── agent/                         # New agent implementation
│   ├── Cargo.toml
│   └── src/ ... (per ARCHITECT.md)
├── dashboard/                     # New dashboard implementation  
│   ├── Cargo.toml
│   └── src/ ... (per ARCHITECT.md)
├── shared/                        # New shared types
│   ├── Cargo.toml
│   └── src/ ... (per ARCHITECT.md)
├── config/                        # New configuration examples
└── backup/                        # Legacy code for reference
    └── legacy-2025-10-16/

Phase 3: Implementation Priorities

Agent Implementation (Priority 1):

  1. Individual metrics collection system
  2. ZMQ communication protocol
  3. Basic collectors (CPU, memory, disk, services)
  4. Status calculation and thresholds
  5. Email notification system

Dashboard Implementation (Priority 2):

  1. ZMQ metric consumer
  2. Metric storage and subscription system
  3. Base widget trait and framework
  4. Core widgets (CPU, memory, storage, services)
  5. Host management and navigation

Testing & Integration (Priority 3):

  1. End-to-end metric flow validation
  2. Multi-host connection testing
  3. UI layout validation against legacy appearance
  4. Performance benchmarking

Project Goals (Updated)

Core Objectives

  • Individual metric architecture for maximum dashboard flexibility
  • Multi-host support for cmbox, labbox, simonbox, steambox, srv01
  • Performance-focused with minimal resource usage
  • Keyboard-driven interface preserving current UI layout
  • ZMQ-based communication replacing HTTP API polling

Key Features

  • Granular metric collection (cpu_load_1min, memory_usage_percent, etc.)
  • Widget-based metric subscription for flexible dashboard composition
  • Preserved UI layout maintaining current visual design
  • Intelligent caching for optimal performance
  • Auto-discovery of services and system components
  • Email notifications for status changes with rate limiting
  • Maintenance mode integration for planned downtime

New Technical Architecture

Technology Stack (Updated)

  • Language: Rust 🦀
  • Communication: ZMQ (zeromq) for agent-dashboard messaging
  • TUI Framework: ratatui (modern tui-rs fork)
  • Async Runtime: tokio
  • Serialization: serde (JSON for metrics)
  • CLI: clap
  • Error Handling: thiserror + anyhow
  • Time: chrono
  • Email: lettre (SMTP notifications)

New Dependencies

# Workspace Cargo.toml
[workspace]
members = ["agent", "dashboard", "shared"]

# Agent dependencies
[dependencies.agent]
zmq = "0.10"                                    # ZMQ communication
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tokio = { version = "1.0", features = ["full"] }
clap = { version = "4.0", features = ["derive"] }
thiserror = "1.0"
anyhow = "1.0"
chrono = { version = "0.4", features = ["serde"] }
lettre = { version = "0.11", features = ["smtp-transport"] }
gethostname = "0.4"

# Dashboard dependencies  
[dependencies.dashboard]
ratatui = "0.24"
crossterm = "0.27"
zmq = "0.10"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tokio = { version = "1.0", features = ["full"] }
clap = { version = "4.0", features = ["derive"] }
thiserror = "1.0"
anyhow = "1.0"
chrono = { version = "0.4", features = ["serde"] }

# Shared dependencies
[dependencies.shared]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
chrono = { version = "0.4", features = ["serde"] }
thiserror = "1.0"

New Project Structure

REFERENCE: See ARCHITECT.md for complete folder structure specification.

Current Status: Legacy code preserved in backup/legacy-2025-10-16/ for reference only.

Implementation Progress:

  • Architecture documentation (ARCHITECT.md)
  • Implementation strategy (CLAUDE.md updates)
  • Legacy code backup
  • New workspace setup
  • Shared types implementation
  • Agent implementation
  • Dashboard implementation
  • Integration testing

New Individual Metrics Architecture

REPLACED: Legacy grouped structures (SmartMetrics, ServiceMetrics, etc.) are replaced with individual metrics.

New Approach: See ARCHITECT.md for individual metric definitions:

// Individual metrics examples:
"cpu_load_1min" -> 2.5
"cpu_temperature_celsius" -> 45.0  
"memory_usage_percent" -> 78.5
"disk_nvme0_wear_percent" -> 12.3
"service_ssh_status" -> "active"
"backup_last_run_timestamp" -> 1697123456

Shared Types: Located in shared/src/metrics.rs:

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Metric {
    pub name: String,
    pub value: MetricValue,
    pub status: Status,
    pub timestamp: u64,
    pub description: Option<String>,
    pub unit: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]  
pub enum MetricValue {
    Float(f32),
    Integer(i64),
    String(String),
    Boolean(bool),
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum Status {
    Ok,
    Warning,
    Critical,
    Unknown,
}

UI Layout Preservation

CRITICAL: The exact visual layout shown above is PRESERVED in the new implementation.

Implementation Strategy:

  • New widgets subscribe to individual metrics but render identically
  • Same positions, colors, borders, and keyboard shortcuts
  • Enhanced with flexible metric composition under the hood

Reference: Legacy widgets in backup/legacy-2025-10-16/dashboard/src/ui/ show exact rendering logic to replicate.

Core Architecture Principles - CRITICAL

Individual Metrics Philosophy

NEW ARCHITECTURE: Agent collects individual metrics, dashboard composes widgets from those metrics.

Status Calculation:

  • Agent calculates status for each individual metric
  • Agent sends individual metrics with status via ZMQ
  • Dashboard aggregates metric statuses for widget-level status
  • Dashboard NEVER calculates metric status - only displays and aggregates

Data Flow Architecture:

Agent (individual metrics + status) → ZMQ → Dashboard (subscribe + display) → Widgets (compose + render)

Migration from Legacy Architecture

OLD (DEPRECATED):

Agent → ServiceMetrics{summary, services} → Dashboard → Widget
Agent → SmartMetrics{drives, summary} → Dashboard → Widget  

NEW (IMPLEMENTING):

Agent → ["cpu_load_1min", "memory_usage_percent", ...] → Dashboard → Widgets subscribe to needed metrics

Current Agent Thresholds (as of 2025-10-12)

CPU Load (service.rs:392-400):

  • Warning: ≥ 2.0 (testing value, was 5.0)
  • Critical: ≥ 4.0 (testing value, was 8.0)

CPU Temperature (service.rs:412-420):

  • Warning: ≥ 70.0°C
  • Critical: ≥ 80.0°C

Memory Usage (service.rs:402-410):

  • Warning: ≥ 80%
  • Critical: ≥ 95%

Email Notifications

System Configuration:

  • From: {hostname}@cmtec.se (e.g., cmbox@cmtec.se)
  • To: cm@cmtec.se
  • SMTP: localhost:25 (postfix)
  • Timezone: Europe/Stockholm (not UTC)

Notification Triggers:

  • Status degradation: any → "warning" or "critical"
  • Recovery: "warning"/"critical" → "ok"
  • Rate limiting: configurable (set to 0 for testing, 30 minutes for production)

Monitored Components:

  • system.cpu (load status) - SystemCollector
  • system.memory (usage status) - SystemCollector
  • system.cpu_temp (temperature status) - SystemCollector (disabled)
  • system.services (service health status) - ServiceCollector
  • storage.smart (drive health) - SmartCollector
  • backup.overall (backup status) - BackupCollector

Pure Auto-Discovery Implementation

Agent Configuration:

  • No config files required
  • Auto-detects storage devices, services, backup systems
  • Runtime discovery of system capabilities
  • CLI: cm-dashboard-agent [-v] (intelligent caching enabled)

Service Discovery:

  • Scans running systemd services
  • Filters by predefined interesting patterns (gitea, nginx, docker, etc.)
  • No host-specific hardcoded service lists

Current Implementation Status

Completed:

  • Pure auto-discovery agent (no config files)
  • Agent-side status calculations with defined thresholds
  • Dashboard displays agent status (no dashboard calculations)
  • Email notifications with Stockholm timezone
  • CPU temperature monitoring and notifications
  • ZMQ message format standardization
  • Removed all hardcoded dashboard thresholds
  • CPU thresholds restored to production values (5.0/8.0)
  • All collectors output standardized status strings (ok/warning/critical/unknown)
  • Dashboard connection loss detection with 5-second keep-alive
  • Removed excessive logging from agent
  • Fixed all compiler warnings in both agent and dashboard
  • SystemCollector architecture refactoring completed (2025-10-12)
  • Created SystemCollector for CPU load, memory, temperature, C-states
  • Moved system metrics from ServiceCollector to SystemCollector
  • Updated dashboard to parse and display SystemCollector data
  • Enhanced service notifications to include specific failure details
  • CPU temperature thresholds set to 100°C (effectively disabled)
  • SystemCollector bug fixes completed (2025-10-12)
  • Fixed CPU load parsing for comma decimal separator locale (", " split)
  • Fixed CPU temperature to prioritize x86_pkg_temp over generic thermal zones
  • Fixed C-state collection to discover all available states (including C10)
  • Dashboard improvements and maintenance mode (2025-10-13)
  • Host auto-discovery with predefined CMTEC infrastructure hosts (cmbox, labbox, simonbox, steambox, srv01)
  • Host navigation limited to connected hosts only (no disconnected host cycling)
  • Storage widget restructured: Name/Temp/Wear/Usage columns with SMART details as descriptions
  • Agent-provided descriptions for Storage widget (agent is source of truth for formatting)
  • Maintenance mode implementation: /tmp/cm-maintenance file suppresses notifications
  • NixOS borgbackup integration with automatic maintenance mode during backups
  • System widget simplified to single row with C-states as description lines
  • CPU load thresholds updated to production values (9.0/10.0)
  • Smart caching system implementation (2025-10-15)
  • Comprehensive intelligent caching with tiered collection intervals (RealTime/Fast/Medium/Slow/Static)
  • Cache warming for instant dashboard startup responsiveness
  • Background refresh and proactive cache invalidation strategies
  • CPU usage optimization from 9.5% to <2% through smart polling reduction
  • Cache key consistency fixes for proper collector data flow
  • ZMQ broadcast mechanism ensuring continuous data delivery to dashboard
  • Immich service quota detection fix (500GB instead of hardcoded 200GB)
  • Service-to-directory mapping for accurate disk usage calculation
  • Real-time process monitoring implementation (2025-10-16)
  • Fixed hardcoded top CPU/RAM process display with real data
  • Added top CPU and RAM process collection to CpuCollector
  • Implemented ps-based process monitoring with accurate percentages
  • Added intelligent filtering to avoid self-monitoring artifacts
  • Dashboard updated to display real-time top processes instead of placeholder text
  • Fixed disk metrics permission issues in systemd collector
  • Enhanced error logging for service directory access problems
  • Optimized service collection focusing on status, memory, and disk metrics only
  • Comprehensive backup monitoring implementation (2025-10-18)
  • Added BackupCollector for reading TOML status files with disk space metrics
  • Implemented BackupWidget with disk usage display and service status details
  • Fixed backup script disk space parsing by adding missing capture_output=True
  • Updated backup widget to show actual disk usage instead of repository size
  • Fixed timestamp parsing to use backup completion time instead of start time
  • Resolved timezone issues by using UTC timestamps in backup script
  • Added disk identification metrics (product name, serial number) to backup status
  • Enhanced UI layout with proper backup monitoring integration

Production Configuration:

  • CPU load thresholds: Warning ≥ 9.0, Critical ≥ 10.0
  • CPU temperature thresholds: Warning ≥ 100°C, Critical ≥ 100°C (effectively disabled)
  • Memory usage thresholds: Warning ≥ 80%, Critical ≥ 95%
  • Connection timeout: 15 seconds (agents send data every 5 seconds)
  • Email rate limiting: 30 minutes (set to 0 for testing)

Maintenance Mode

Purpose:

  • Suppress email notifications during planned maintenance or backups
  • Prevents false alerts when services are intentionally stopped

Implementation:

  • Agent checks for /tmp/cm-maintenance file before sending notifications
  • File presence suppresses all email notifications while continuing monitoring
  • Dashboard continues to show real status, only notifications are blocked

Usage:

# Enable maintenance mode
touch /tmp/cm-maintenance

# Run maintenance tasks (backups, service restarts, etc.)
systemctl stop service
# ... maintenance work ...
systemctl start service

# Disable maintenance mode  
rm /tmp/cm-maintenance

NixOS Integration:

  • Borgbackup script automatically creates/removes maintenance file
  • Automatic cleanup via trap ensures maintenance mode doesn't stick

Configuration-Based Smart Caching System

Purpose:

  • Reduce agent CPU usage from 10% to <1% through configuration-driven intelligent caching
  • Maintain dashboard responsiveness with configurable refresh strategies
  • Optimize for different data volatility characteristics via config files

Configuration-Driven Architecture:

# Cache tiers defined in agent.toml
[cache.tiers.realtime]
interval_seconds = 5
description = "High-frequency metrics (CPU load, memory usage)"

[cache.tiers.medium]
interval_seconds = 300
description = "Low-frequency metrics (service status, disk usage)"

[cache.tiers.slow]
interval_seconds = 900
description = "Very low-frequency metrics (SMART data, backup status)"

# Metric assignments via configuration
[cache.metric_assignments]
"cpu_load_*" = "realtime"
"service_*_disk_gb" = "medium"
"disk_*_temperature" = "slow"

Implementation:

  • ConfigurableCache: Central cache manager reading tier config from files
  • MetricCacheManager: Assigns metrics to tiers based on configuration patterns
  • TierScheduler: Manages configurable tier-based refresh timing
  • Cache warming: Parallel startup population for instant responsiveness
  • Background refresh: Proactive updates based on configured intervals

Configuration:

[cache]
enabled = true
default_ttl_seconds = 30
max_entries = 10000
warming_timeout_seconds = 3
background_refresh_enabled = true
cleanup_interval_seconds = 1800

Performance Benefits:

  • CPU usage reduction: 10% → <1% target through configuration optimization
  • Configurable cache intervals prevent expensive operations from running too frequently
  • Disk usage detection cached at 5-minute intervals instead of every 5 seconds
  • Selective metric refresh based on configured volatility patterns

Usage:

# Start agent with config-based caching
cm-dashboard-agent --config /etc/cm-dashboard/agent.toml [-v]

Architecture:

  • Configuration-driven caching: Tiered collection with configurable intervals
  • Config file management: All cache behavior defined in TOML configuration
  • Responsive design: Cache warming for instant dashboard startup

New Implementation Guidelines - CRITICAL

ARCHITECTURE ENFORCEMENT:

  • ZERO legacy code reuse - Fresh implementation following ARCHITECT.md exactly
  • Individual metrics only - NO grouped metric structures
  • Reference-only legacy - Study old functionality, implement new architecture
  • Clean slate mindset - Build as if legacy codebase never existed

Implementation Rules:

  1. Individual Metrics: Each metric is collected, transmitted, and stored individually
  2. Agent Status Authority: Agent calculates status for each metric using thresholds
  3. Dashboard Composition: Dashboard widgets subscribe to specific metrics by name
  4. Status Aggregation: Dashboard aggregates individual metric statuses for widget status
  5. ZMQ Communication: All metrics transmitted via ZMQ, no HTTP APIs

When Adding New Metrics:

  1. Define metric name in shared registry (e.g., "disk_nvme1_temperature_celsius")
  2. Implement collector that returns individual Metric struct
  3. Agent calculates status using configured thresholds
  4. Dashboard widgets subscribe to metric by name
  5. Notification system automatically detects status changes

Testing & Building:

  • Workspace builds: cargo build --workspace for all testing
  • Clean compilation: Remove target/ between architecture changes
  • ZMQ testing: Test agent-dashboard communication independently
  • Widget testing: Verify UI layout matches legacy appearance exactly

NEVER in New Implementation:

  • Copy/paste ANY code from legacy backup
  • Create grouped metric structures (SystemMetrics, etc.)
  • Calculate status in dashboard widgets
  • Hardcode metric names in widgets (use const arrays)
  • Skip individual metric architecture for "simplicity"

Legacy Reference Usage:

  • Study UI layout and rendering logic only
  • Understand email notification formatting
  • Reference status color mapping
  • Learn host navigation patterns
  • NO code copying or structural influence

Important Communication Guidelines

NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.

NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.

Commit Message Guidelines

NEVER mention:

  • Claude or any AI assistant names
  • Automation or AI-generated content
  • Any reference to automated code generation

ALWAYS:

  • Focus purely on technical changes and their purpose
  • Use standard software development commit message format
  • Describe what was changed and why, not how it was created
  • Write from the perspective of a human developer

Examples:

  • "Generated with Claude Code"
  • "AI-assisted implementation"
  • "Automated refactoring"
  • "Implement maintenance mode for backup operations"
  • "Restructure storage widget with improved layout"
  • "Update CPU thresholds to production values"

NixOS Configuration Updates

When code changes are made to cm-dashboard, the NixOS configuration at ~/nixosbox must be updated to deploy the changes.

Update Process

  1. Get Latest Commit Hash

    git log -1 --format="%H"
    
  2. Update NixOS Configuration Edit ~/nixosbox/hosts/common/cm-dashboard.nix:

    src = pkgs.fetchgit {
      url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
      rev = "NEW_COMMIT_HASH_HERE";
      sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; # Placeholder
    };
    
  3. Get Correct Source Hash Build with placeholder hash to get the actual hash:

    cd ~/nixosbox
    nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchgit { 
      url = "https://gitea.cmtec.se/cm/cm-dashboard.git"; 
      rev = "NEW_COMMIT_HASH"; 
      sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; 
    }' 2>&1 | grep "got:"
    

    Example output:

    error: hash mismatch in fixed-output derivation '/nix/store/...':
             specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
                got:    sha256-x8crxNusOUYRrkP9mYEOG+Ga3JCPIdJLkEAc5P1ZxdQ=
    
  4. Update Configuration with Correct Hash Replace the placeholder with the hash from the error message (the "got:" line).

  5. Commit NixOS Configuration

    cd ~/nixosbox
    git add hosts/common/cm-dashboard.nix
    git commit -m "Update cm-dashboard to latest version (SHORT_HASH)"
    git push
    
  6. Rebuild System The user handles the system rebuild step - this cannot be automated.