Christoffer Martinsson a937032eb1 Remove hardcoded defaults, require configuration file

- Remove all Default implementations from agent configuration structs
- Make configuration file required for agent startup
- Update NixOS module to generate complete agent.toml configuration
- Add comprehensive configuration options to NixOS module including:
  - Service include/exclude patterns for systemd collector
  - All thresholds and intervals
  - ZMQ communication settings
  - Notification and cache configuration
- Agent now fails fast if no configuration provided
- Eliminates configuration drift between defaults and NixOS settings

2025-10-21 00:01:26 +02:00

23 KiB

Raw Blame History

CM Dashboard - Infrastructure Monitoring TUI

Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and ZMQ-based metric collection.

CRITICAL: Architecture Redesign in Progress

LEGACY CODE DEPRECATION: The current codebase is being completely rewritten with a new individual metrics architecture. ALL existing code will be moved to a backup folder for reference only.

NEW IMPLEMENTATION STRATEGY:

NO legacy code reuse - Fresh implementation following ARCHITECT.md
Clean slate approach - Build entirely new codebase structure
Reference-only legacy - Current code preserved only for functionality reference

Implementation Strategy

Phase 1: Legacy Code Backup (IMMEDIATE)

Backup Current Implementation:

# Create backup folder for reference
mkdir -p backup/legacy-2025-10-16

# Move all current source code to backup  
mv agent/ backup/legacy-2025-10-16/
mv dashboard/ backup/legacy-2025-10-16/
mv shared/ backup/legacy-2025-10-16/

# Preserve configuration examples
cp -r config/ backup/legacy-2025-10-16/

# Keep important documentation
cp CLAUDE.md backup/legacy-2025-10-16/CLAUDE-legacy.md
cp README.md backup/legacy-2025-10-16/README-legacy.md

Reference Usage Rules:

Legacy code is REFERENCE ONLY - never copy/paste
Study existing functionality and UI layout patterns
Understand current widget behavior and status mapping
Reference notification logic and email formatting
NO legacy code in new implementation

Phase 2: Clean Slate Implementation

New Codebase Structure: Following ARCHITECT.md precisely with zero legacy dependencies:

cm-dashboard/                      # New clean repository root
├── ARCHITECT.md                   # Architecture documentation  
├── CLAUDE.md                      # This file (updated)
├── README.md                      # New implementation documentation
├── Cargo.toml                     # Workspace configuration
├── agent/                         # New agent implementation
│   ├── Cargo.toml
│   └── src/ ... (per ARCHITECT.md)
├── dashboard/                     # New dashboard implementation  
│   ├── Cargo.toml
│   └── src/ ... (per ARCHITECT.md)
├── shared/                        # New shared types
│   ├── Cargo.toml
│   └── src/ ... (per ARCHITECT.md)
├── config/                        # New configuration examples
└── backup/                        # Legacy code for reference
    └── legacy-2025-10-16/

Phase 3: Implementation Priorities

Agent Implementation (Priority 1):

Individual metrics collection system
ZMQ communication protocol
Basic collectors (CPU, memory, disk, services)
Status calculation and thresholds
Email notification system

Dashboard Implementation (Priority 2):

ZMQ metric consumer
Metric storage and subscription system
Base widget trait and framework
Core widgets (CPU, memory, storage, services)
Host management and navigation

Testing & Integration (Priority 3):

End-to-end metric flow validation
Multi-host connection testing
UI layout validation against legacy appearance
Performance benchmarking

Project Goals (Updated)

Core Objectives

Individual metric architecture for maximum dashboard flexibility
Multi-host support for cmbox, labbox, simonbox, steambox, srv01
Performance-focused with minimal resource usage
Keyboard-driven interface preserving current UI layout
ZMQ-based communication replacing HTTP API polling

Key Features

Granular metric collection (cpu_load_1min, memory_usage_percent, etc.)
Widget-based metric subscription for flexible dashboard composition
Preserved UI layout maintaining current visual design
Intelligent caching for optimal performance
Auto-discovery of services and system components
Email notifications for status changes with rate limiting
Maintenance mode integration for planned downtime

New Technical Architecture

Technology Stack (Updated)

Language: Rust 🦀
Communication: ZMQ (zeromq) for agent-dashboard messaging
TUI Framework: ratatui (modern tui-rs fork)
Async Runtime: tokio
Serialization: serde (JSON for metrics)
CLI: clap
Error Handling: thiserror + anyhow
Time: chrono
Email: lettre (SMTP notifications)

New Dependencies

# Workspace Cargo.toml
[workspace]
members = ["agent", "dashboard", "shared"]

# Agent dependencies
[dependencies.agent]
zmq = "0.10"                                    # ZMQ communication
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tokio = { version = "1.0", features = ["full"] }
clap = { version = "4.0", features = ["derive"] }
thiserror = "1.0"
anyhow = "1.0"
chrono = { version = "0.4", features = ["serde"] }
lettre = { version = "0.11", features = ["smtp-transport"] }
gethostname = "0.4"

# Dashboard dependencies  
[dependencies.dashboard]
ratatui = "0.24"
crossterm = "0.27"
zmq = "0.10"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tokio = { version = "1.0", features = ["full"] }
clap = { version = "4.0", features = ["derive"] }
thiserror = "1.0"
anyhow = "1.0"
chrono = { version = "0.4", features = ["serde"] }

# Shared dependencies
[dependencies.shared]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
chrono = { version = "0.4", features = ["serde"] }
thiserror = "1.0"

New Project Structure

REFERENCE: See ARCHITECT.md for complete folder structure specification.

Current Status: Legacy code preserved in backup/legacy-2025-10-16/ for reference only.

Implementation Progress:

Architecture documentation (ARCHITECT.md)
Implementation strategy (CLAUDE.md updates)
Legacy code backup
New workspace setup
Shared types implementation
Agent implementation
Dashboard implementation
Integration testing

New Individual Metrics Architecture

REPLACED: Legacy grouped structures (SmartMetrics, ServiceMetrics, etc.) are replaced with individual metrics.

New Approach: See ARCHITECT.md for individual metric definitions:

// Individual metrics examples:
"cpu_load_1min" -> 2.5
"cpu_temperature_celsius" -> 45.0  
"memory_usage_percent" -> 78.5
"disk_nvme0_wear_percent" -> 12.3
"service_ssh_status" -> "active"
"backup_last_run_timestamp" -> 1697123456

Shared Types: Located in shared/src/metrics.rs:

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Metric {
    pub name: String,
    pub value: MetricValue,
    pub status: Status,
    pub timestamp: u64,
    pub description: Option<String>,
    pub unit: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]  
pub enum MetricValue {
    Float(f32),
    Integer(i64),
    String(String),
    Boolean(bool),
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum Status {
    Ok,
    Warning,
    Critical,
    Unknown,
}

UI Layout Preservation

CRITICAL: The exact visual layout shown above is PRESERVED in the new implementation.

Implementation Strategy:

New widgets subscribe to individual metrics but render identically
Same positions, colors, borders, and keyboard shortcuts
Enhanced with flexible metric composition under the hood

Reference: Legacy widgets in backup/legacy-2025-10-16/dashboard/src/ui/ show exact rendering logic to replicate.

Core Architecture Principles - CRITICAL

Individual Metrics Philosophy

NEW ARCHITECTURE: Agent collects individual metrics, dashboard composes widgets from those metrics.

Status Calculation:

Agent calculates status for each individual metric
Agent sends individual metrics with status via ZMQ
Dashboard aggregates metric statuses for widget-level status
Dashboard NEVER calculates metric status - only displays and aggregates

Data Flow Architecture:

Agent (individual metrics + status) → ZMQ → Dashboard (subscribe + display) → Widgets (compose + render)

Migration from Legacy Architecture

OLD (DEPRECATED):

Agent → ServiceMetrics{summary, services} → Dashboard → Widget
Agent → SmartMetrics{drives, summary} → Dashboard → Widget

NEW (IMPLEMENTING):

Agent → ["cpu_load_1min", "memory_usage_percent", ...] → Dashboard → Widgets subscribe to needed metrics

Current Agent Thresholds (as of 2025-10-12)

CPU Load (service.rs:392-400):

Warning: ≥ 2.0 (testing value, was 5.0)
Critical: ≥ 4.0 (testing value, was 8.0)

CPU Temperature (service.rs:412-420):

Warning: ≥ 70.0°C
Critical: ≥ 80.0°C

Memory Usage (service.rs:402-410):

Warning: ≥ 80%
Critical: ≥ 95%

Email Notifications

System Configuration:

From: {hostname}@cmtec.se (e.g., cmbox@cmtec.se)
To: cm@cmtec.se
SMTP: localhost:25 (postfix)
Timezone: Europe/Stockholm (not UTC)

Notification Triggers:

Status degradation: any → "warning" or "critical"
Recovery: "warning"/"critical" → "ok"
Rate limiting: configurable (set to 0 for testing, 30 minutes for production)

Monitored Components:

system.cpu (load status) - SystemCollector
system.memory (usage status) - SystemCollector
system.cpu_temp (temperature status) - SystemCollector (disabled)
system.services (service health status) - ServiceCollector
storage.smart (drive health) - SmartCollector
backup.overall (backup status) - BackupCollector

Pure Auto-Discovery Implementation

Agent Configuration:

No config files required
Auto-detects storage devices, services, backup systems
Runtime discovery of system capabilities
CLI: cm-dashboard-agent [-v] (intelligent caching enabled)

Service Discovery:

Scans ALL systemd services (active, inactive, failed, dead, etc.) using list-unit-files and list-units --all
Discovers both system services and user services per host:
- steambox/cmbox: reads system + cm user services
- simonbox: reads system + simon user services
Filters by service_name_filters patterns (gitea, nginx, docker, sunshine, etc.)
Excludes maintenance services (docker-prune, sshd@, ark-permissions, etc.)
No host-specific hardcoded service lists

Current Implementation Status

Completed:

Pure auto-discovery agent (no config files)
Agent-side status calculations with defined thresholds
Dashboard displays agent status (no dashboard calculations)
Email notifications with Stockholm timezone
CPU temperature monitoring and notifications
ZMQ message format standardization
Removed all hardcoded dashboard thresholds
CPU thresholds restored to production values (5.0/8.0)
All collectors output standardized status strings (ok/warning/critical/unknown)
Dashboard connection loss detection with 5-second keep-alive
Removed excessive logging from agent
Reduced initial compiler warnings from excessive logging cleanup
SystemCollector architecture refactoring completed (2025-10-12)
Created SystemCollector for CPU load, memory, temperature, C-states
Moved system metrics from ServiceCollector to SystemCollector
Updated dashboard to parse and display SystemCollector data
Enhanced service notifications to include specific failure details
CPU temperature thresholds set to 100°C (effectively disabled)
SystemCollector bug fixes completed (2025-10-12)
Fixed CPU load parsing for comma decimal separator locale (", " split)
Fixed CPU temperature to prioritize x86_pkg_temp over generic thermal zones
Fixed C-state collection to discover all available states (including C10)
Dashboard improvements and maintenance mode (2025-10-13)
Host auto-discovery with predefined CMTEC infrastructure hosts (cmbox, labbox, simonbox, steambox, srv01)
Host navigation limited to connected hosts only (no disconnected host cycling)
Storage widget restructured: Name/Temp/Wear/Usage columns with SMART details as descriptions
Agent-provided descriptions for Storage widget (agent is source of truth for formatting)
Maintenance mode implementation: /tmp/cm-maintenance file suppresses notifications
NixOS borgbackup integration with automatic maintenance mode during backups
System widget simplified to single row with C-states as description lines
CPU load thresholds updated to production values (9.0/10.0)
Smart caching system implementation (2025-10-15)
Comprehensive intelligent caching with tiered collection intervals (RealTime/Fast/Medium/Slow/Static)
Cache warming for instant dashboard startup responsiveness
Background refresh and proactive cache invalidation strategies
CPU usage optimization from 9.5% to <2% through smart polling reduction
Cache key consistency fixes for proper collector data flow
ZMQ broadcast mechanism ensuring continuous data delivery to dashboard
Immich service quota detection fix (500GB instead of hardcoded 200GB)
Service-to-directory mapping for accurate disk usage calculation
Real-time process monitoring implementation (2025-10-16)
Fixed hardcoded top CPU/RAM process display with real data
Added top CPU and RAM process collection to CpuCollector
Implemented ps-based process monitoring with accurate percentages
Added intelligent filtering to avoid self-monitoring artifacts
Dashboard updated to display real-time top processes instead of placeholder text
Fixed disk metrics permission issues in systemd collector
Enhanced error logging for service directory access problems
Optimized service collection focusing on status, memory, and disk metrics only
Comprehensive backup monitoring implementation (2025-10-18)
Added BackupCollector for reading TOML status files with disk space metrics
Implemented BackupWidget with disk usage display and service status details
Fixed backup script disk space parsing by adding missing capture_output=True
Updated backup widget to show actual disk usage instead of repository size
Fixed timestamp parsing to use backup completion time instead of start time
Resolved timezone issues by using UTC timestamps in backup script
Added disk identification metrics (product name, serial number) to backup status
Enhanced UI layout with proper backup monitoring integration
Complete warning elimination and code cleanup (2025-10-18)
Removed all unused code including widget subscription system and WidgetType enum
Eliminated unused cache utilities, error variants, and theme functions
Removed unused struct fields and imports throughout codebase
Fixed lifetime warnings and replaced subscription-based widgets with direct metric filtering
Achieved zero build warnings in both agent and dashboard (down from 46 total warnings)

Production Configuration:

CPU load thresholds: Warning ≥ 9.0, Critical ≥ 10.0
CPU temperature thresholds: Warning ≥ 100°C, Critical ≥ 100°C (effectively disabled)
Memory usage thresholds: Warning ≥ 80%, Critical ≥ 95%
Connection timeout: 15 seconds (agents send data every 5 seconds)
Email rate limiting: 30 minutes (set to 0 for testing)

Maintenance Mode

Purpose:

Suppress email notifications during planned maintenance or backups
Prevents false alerts when services are intentionally stopped

Implementation:

Agent checks for /tmp/cm-maintenance file before sending notifications
File presence suppresses all email notifications while continuing monitoring
Dashboard continues to show real status, only notifications are blocked

Usage:

# Enable maintenance mode
touch /tmp/cm-maintenance

# Run maintenance tasks (backups, service restarts, etc.)
systemctl stop service
# ... maintenance work ...
systemctl start service

# Disable maintenance mode  
rm /tmp/cm-maintenance

NixOS Integration:

Borgbackup script automatically creates/removes maintenance file
Automatic cleanup via trap ensures maintenance mode doesn't stick

Configuration-Based Smart Caching System

Purpose:

Reduce agent CPU usage from 10% to <1% through configuration-driven intelligent caching
Maintain dashboard responsiveness with configurable refresh strategies
Optimize for different data volatility characteristics via config files

Configuration-Driven Architecture:

# Cache tiers defined in agent.toml
[cache.tiers.realtime]
interval_seconds = 5
description = "High-frequency metrics (CPU load, memory usage)"

[cache.tiers.medium]
interval_seconds = 300
description = "Low-frequency metrics (service status, disk usage)"

[cache.tiers.slow]
interval_seconds = 900
description = "Very low-frequency metrics (SMART data, backup status)"

# Metric assignments via configuration
[cache.metric_assignments]
"cpu_load_*" = "realtime"
"service_*_disk_gb" = "medium"
"disk_*_temperature" = "slow"

Implementation:

ConfigurableCache: Central cache manager reading tier config from files
MetricCacheManager: Assigns metrics to tiers based on configuration patterns
TierScheduler: Manages configurable tier-based refresh timing
Cache warming: Parallel startup population for instant responsiveness
Background refresh: Proactive updates based on configured intervals

Configuration:

[cache]
enabled = true
default_ttl_seconds = 30
max_entries = 10000
warming_timeout_seconds = 3
background_refresh_enabled = true
cleanup_interval_seconds = 1800

Performance Benefits:

CPU usage reduction: 10% → <1% target through configuration optimization
Configurable cache intervals prevent expensive operations from running too frequently
Disk usage detection cached at 5-minute intervals instead of every 5 seconds
Selective metric refresh based on configured volatility patterns

Usage:

# Start agent with config-based caching
cm-dashboard-agent --config /etc/cm-dashboard/agent.toml [-v]

Architecture:

Configuration-driven caching: Tiered collection with configurable intervals
Config file management: All cache behavior defined in TOML configuration
Responsive design: Cache warming for instant dashboard startup

New Implementation Guidelines - CRITICAL

ARCHITECTURE ENFORCEMENT:

ZERO legacy code reuse - Fresh implementation following ARCHITECT.md exactly
Individual metrics only - NO grouped metric structures
Reference-only legacy - Study old functionality, implement new architecture
Clean slate mindset - Build as if legacy codebase never existed

Implementation Rules:

Individual Metrics: Each metric is collected, transmitted, and stored individually
Agent Status Authority: Agent calculates status for each metric using thresholds
Dashboard Composition: Dashboard widgets subscribe to specific metrics by name
Status Aggregation: Dashboard aggregates individual metric statuses for widget status
ZMQ Communication: All metrics transmitted via ZMQ, no HTTP APIs

When Adding New Metrics:

Define metric name in shared registry (e.g., "disk_nvme1_temperature_celsius")
Implement collector that returns individual Metric struct
Agent calculates status using configured thresholds
Dashboard widgets subscribe to metric by name
Notification system automatically detects status changes

Testing & Building:

Workspace builds: cargo build --workspace for all testing
Clean compilation: Remove target/ between architecture changes
ZMQ testing: Test agent-dashboard communication independently
Widget testing: Verify UI layout matches legacy appearance exactly

NEVER in New Implementation:

Copy/paste ANY code from legacy backup
Create grouped metric structures (SystemMetrics, etc.)
Calculate status in dashboard widgets
Hardcode metric names in widgets (use const arrays)
Skip individual metric architecture for "simplicity"

Legacy Reference Usage:

Study UI layout and rendering logic only
Understand email notification formatting
Reference status color mapping
Learn host navigation patterns
NO code copying or structural influence

Important Communication Guidelines

NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.

NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.

Commit Message Guidelines

NEVER mention:

Claude or any AI assistant names
Automation or AI-generated content
Any reference to automated code generation

ALWAYS:

Focus purely on technical changes and their purpose
Use standard software development commit message format
Describe what was changed and why, not how it was created
Write from the perspective of a human developer

Examples:

❌ "Generated with Claude Code"
❌ "AI-assisted implementation"
❌ "Automated refactoring"
✅ "Implement maintenance mode for backup operations"
✅ "Restructure storage widget with improved layout"
✅ "Update CPU thresholds to production values"

NixOS Configuration Updates

When code changes are made to cm-dashboard, the NixOS configuration at ~/nixosbox must be updated to deploy the changes.

Update Process

Get Latest Commit Hash
```
git log -1 --format="%H"
```

Update NixOS Configuration Edit ~/nixosbox/hosts/common/cm-dashboard.nix:

src = pkgs.fetchgit {
  url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
  rev = "NEW_COMMIT_HASH_HERE";
  sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; # Placeholder
};

Get Correct Source Hash Build with placeholder hash to get the actual hash:

cd ~/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchgit { 
  url = "https://gitea.cmtec.se/cm/cm-dashboard.git"; 
  rev = "NEW_COMMIT_HASH"; 
  sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; 
}' 2>&1 | grep "got:"

Example output:

error: hash mismatch in fixed-output derivation '/nix/store/...':
         specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
            got:    sha256-x8crxNusOUYRrkP9mYEOG+Ga3JCPIdJLkEAc5P1ZxdQ=

Update Configuration with Correct Hash Replace the placeholder with the hash from the error message (the "got:" line).

Commit NixOS Configuration

cd ~/nixosbox
git add hosts/common/cm-dashboard.nix
git commit -m "Update cm-dashboard to latest version (SHORT_HASH)"
git push

Rebuild System The user handles the system rebuild step - this cannot be automated.

23 KiB Raw Blame History