cm-dashboard/CLAUDE.md
Christoffer Martinsson a937032eb1 Remove hardcoded defaults, require configuration file
- Remove all Default implementations from agent configuration structs
- Make configuration file required for agent startup
- Update NixOS module to generate complete agent.toml configuration
- Add comprehensive configuration options to NixOS module including:
  - Service include/exclude patterns for systemd collector
  - All thresholds and intervals
  - ZMQ communication settings
  - Notification and cache configuration
- Agent now fails fast if no configuration provided
- Eliminates configuration drift between defaults and NixOS settings
2025-10-21 00:01:26 +02:00

611 lines
23 KiB
Markdown

# CM Dashboard - Infrastructure Monitoring TUI
## Overview
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and ZMQ-based metric collection.
## CRITICAL: Architecture Redesign in Progress
**LEGACY CODE DEPRECATION**: The current codebase is being completely rewritten with a new individual metrics architecture. ALL existing code will be moved to a backup folder for reference only.
**NEW IMPLEMENTATION STRATEGY**:
- **NO legacy code reuse** - Fresh implementation following ARCHITECT.md
- **Clean slate approach** - Build entirely new codebase structure
- **Reference-only legacy** - Current code preserved only for functionality reference
## Implementation Strategy
### Phase 1: Legacy Code Backup (IMMEDIATE)
**Backup Current Implementation:**
```bash
# Create backup folder for reference
mkdir -p backup/legacy-2025-10-16
# Move all current source code to backup
mv agent/ backup/legacy-2025-10-16/
mv dashboard/ backup/legacy-2025-10-16/
mv shared/ backup/legacy-2025-10-16/
# Preserve configuration examples
cp -r config/ backup/legacy-2025-10-16/
# Keep important documentation
cp CLAUDE.md backup/legacy-2025-10-16/CLAUDE-legacy.md
cp README.md backup/legacy-2025-10-16/README-legacy.md
```
**Reference Usage Rules:**
- Legacy code is **REFERENCE ONLY** - never copy/paste
- Study existing functionality and UI layout patterns
- Understand current widget behavior and status mapping
- Reference notification logic and email formatting
- NO legacy code in new implementation
### Phase 2: Clean Slate Implementation
**New Codebase Structure:**
Following ARCHITECT.md precisely with zero legacy dependencies:
```
cm-dashboard/ # New clean repository root
├── ARCHITECT.md # Architecture documentation
├── CLAUDE.md # This file (updated)
├── README.md # New implementation documentation
├── Cargo.toml # Workspace configuration
├── agent/ # New agent implementation
│ ├── Cargo.toml
│ └── src/ ... (per ARCHITECT.md)
├── dashboard/ # New dashboard implementation
│ ├── Cargo.toml
│ └── src/ ... (per ARCHITECT.md)
├── shared/ # New shared types
│ ├── Cargo.toml
│ └── src/ ... (per ARCHITECT.md)
├── config/ # New configuration examples
└── backup/ # Legacy code for reference
└── legacy-2025-10-16/
```
### Phase 3: Implementation Priorities
**Agent Implementation (Priority 1):**
1. Individual metrics collection system
2. ZMQ communication protocol
3. Basic collectors (CPU, memory, disk, services)
4. Status calculation and thresholds
5. Email notification system
**Dashboard Implementation (Priority 2):**
1. ZMQ metric consumer
2. Metric storage and subscription system
3. Base widget trait and framework
4. Core widgets (CPU, memory, storage, services)
5. Host management and navigation
**Testing & Integration (Priority 3):**
1. End-to-end metric flow validation
2. Multi-host connection testing
3. UI layout validation against legacy appearance
4. Performance benchmarking
## Project Goals (Updated)
### Core Objectives
- **Individual metric architecture** for maximum dashboard flexibility
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
- **Performance-focused** with minimal resource usage
- **Keyboard-driven interface** preserving current UI layout
- **ZMQ-based communication** replacing HTTP API polling
### Key Features
- **Granular metric collection** (cpu_load_1min, memory_usage_percent, etc.)
- **Widget-based metric subscription** for flexible dashboard composition
- **Preserved UI layout** maintaining current visual design
- **Intelligent caching** for optimal performance
- **Auto-discovery** of services and system components
- **Email notifications** for status changes with rate limiting
- **Maintenance mode** integration for planned downtime
## New Technical Architecture
### Technology Stack (Updated)
- **Language**: Rust 🦀
- **Communication**: ZMQ (zeromq) for agent-dashboard messaging
- **TUI Framework**: ratatui (modern tui-rs fork)
- **Async Runtime**: tokio
- **Serialization**: serde (JSON for metrics)
- **CLI**: clap
- **Error Handling**: thiserror + anyhow
- **Time**: chrono
- **Email**: lettre (SMTP notifications)
### New Dependencies
```toml
# Workspace Cargo.toml
[workspace]
members = ["agent", "dashboard", "shared"]
# Agent dependencies
[dependencies.agent]
zmq = "0.10" # ZMQ communication
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tokio = { version = "1.0", features = ["full"] }
clap = { version = "4.0", features = ["derive"] }
thiserror = "1.0"
anyhow = "1.0"
chrono = { version = "0.4", features = ["serde"] }
lettre = { version = "0.11", features = ["smtp-transport"] }
gethostname = "0.4"
# Dashboard dependencies
[dependencies.dashboard]
ratatui = "0.24"
crossterm = "0.27"
zmq = "0.10"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tokio = { version = "1.0", features = ["full"] }
clap = { version = "4.0", features = ["derive"] }
thiserror = "1.0"
anyhow = "1.0"
chrono = { version = "0.4", features = ["serde"] }
# Shared dependencies
[dependencies.shared]
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
chrono = { version = "0.4", features = ["serde"] }
thiserror = "1.0"
```
## New Project Structure
**REFERENCE**: See ARCHITECT.md for complete folder structure specification.
**Current Status**: Legacy code preserved in `backup/legacy-2025-10-16/` for reference only.
**Implementation Progress**:
- [x] Architecture documentation (ARCHITECT.md)
- [x] Implementation strategy (CLAUDE.md updates)
- [ ] Legacy code backup
- [ ] New workspace setup
- [ ] Shared types implementation
- [ ] Agent implementation
- [ ] Dashboard implementation
- [ ] Integration testing
### New Individual Metrics Architecture
**REPLACED**: Legacy grouped structures (SmartMetrics, ServiceMetrics, etc.) are replaced with individual metrics.
**New Approach**: See ARCHITECT.md for individual metric definitions:
```rust
// Individual metrics examples:
"cpu_load_1min" -> 2.5
"cpu_temperature_celsius" -> 45.0
"memory_usage_percent" -> 78.5
"disk_nvme0_wear_percent" -> 12.3
"service_ssh_status" -> "active"
"backup_last_run_timestamp" -> 1697123456
```
**Shared Types**: Located in `shared/src/metrics.rs`:
```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Metric {
pub name: String,
pub value: MetricValue,
pub status: Status,
pub timestamp: u64,
pub description: Option<String>,
pub unit: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum MetricValue {
Float(f32),
Integer(i64),
String(String),
Boolean(bool),
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum Status {
Ok,
Warning,
Critical,
Unknown,
}
```
## UI Layout Preservation
**CRITICAL**: The exact visual layout shown above is **PRESERVED** in the new implementation.
**Implementation Strategy**:
- New widgets subscribe to individual metrics but render identically
- Same positions, colors, borders, and keyboard shortcuts
- Enhanced with flexible metric composition under the hood
**Reference**: Legacy widgets in `backup/legacy-2025-10-16/dashboard/src/ui/` show exact rendering logic to replicate.
## Core Architecture Principles - CRITICAL
### Individual Metrics Philosophy
**NEW ARCHITECTURE**: Agent collects individual metrics, dashboard composes widgets from those metrics.
**Status Calculation**:
- Agent calculates status for each individual metric
- Agent sends individual metrics with status via ZMQ
- Dashboard aggregates metric statuses for widget-level status
- Dashboard NEVER calculates metric status - only displays and aggregates
**Data Flow Architecture:**
```
Agent (individual metrics + status) → ZMQ → Dashboard (subscribe + display) → Widgets (compose + render)
```
### Migration from Legacy Architecture
**OLD (DEPRECATED)**:
```
Agent → ServiceMetrics{summary, services} → Dashboard → Widget
Agent → SmartMetrics{drives, summary} → Dashboard → Widget
```
**NEW (IMPLEMENTING)**:
```
Agent → ["cpu_load_1min", "memory_usage_percent", ...] → Dashboard → Widgets subscribe to needed metrics
```
### Current Agent Thresholds (as of 2025-10-12)
**CPU Load (service.rs:392-400):**
- Warning: ≥ 2.0 (testing value, was 5.0)
- Critical: ≥ 4.0 (testing value, was 8.0)
**CPU Temperature (service.rs:412-420):**
- Warning: ≥ 70.0°C
- Critical: ≥ 80.0°C
**Memory Usage (service.rs:402-410):**
- Warning: ≥ 80%
- Critical: ≥ 95%
### Email Notifications
**System Configuration:**
- From: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se)
- To: `cm@cmtec.se`
- SMTP: localhost:25 (postfix)
- Timezone: Europe/Stockholm (not UTC)
**Notification Triggers:**
- Status degradation: any → "warning" or "critical"
- Recovery: "warning"/"critical" → "ok"
- Rate limiting: configurable (set to 0 for testing, 30 minutes for production)
**Monitored Components:**
- system.cpu (load status) - SystemCollector
- system.memory (usage status) - SystemCollector
- system.cpu_temp (temperature status) - SystemCollector (disabled)
- system.services (service health status) - ServiceCollector
- storage.smart (drive health) - SmartCollector
- backup.overall (backup status) - BackupCollector
### Pure Auto-Discovery Implementation
**Agent Configuration:**
- No config files required
- Auto-detects storage devices, services, backup systems
- Runtime discovery of system capabilities
- CLI: `cm-dashboard-agent [-v]` (intelligent caching enabled)
**Service Discovery:**
- Scans ALL systemd services (active, inactive, failed, dead, etc.) using list-unit-files and list-units --all
- Discovers both system services and user services per host:
- steambox/cmbox: reads system + cm user services
- simonbox: reads system + simon user services
- Filters by service_name_filters patterns (gitea, nginx, docker, sunshine, etc.)
- Excludes maintenance services (docker-prune, sshd@, ark-permissions, etc.)
- No host-specific hardcoded service lists
### Current Implementation Status
**Completed:**
- [x] Pure auto-discovery agent (no config files)
- [x] Agent-side status calculations with defined thresholds
- [x] Dashboard displays agent status (no dashboard calculations)
- [x] Email notifications with Stockholm timezone
- [x] CPU temperature monitoring and notifications
- [x] ZMQ message format standardization
- [x] Removed all hardcoded dashboard thresholds
- [x] CPU thresholds restored to production values (5.0/8.0)
- [x] All collectors output standardized status strings (ok/warning/critical/unknown)
- [x] Dashboard connection loss detection with 5-second keep-alive
- [x] Removed excessive logging from agent
- [x] Reduced initial compiler warnings from excessive logging cleanup
- [x] **SystemCollector architecture refactoring completed (2025-10-12)**
- [x] Created SystemCollector for CPU load, memory, temperature, C-states
- [x] Moved system metrics from ServiceCollector to SystemCollector
- [x] Updated dashboard to parse and display SystemCollector data
- [x] Enhanced service notifications to include specific failure details
- [x] CPU temperature thresholds set to 100°C (effectively disabled)
- [x] **SystemCollector bug fixes completed (2025-10-12)**
- [x] Fixed CPU load parsing for comma decimal separator locale (", " split)
- [x] Fixed CPU temperature to prioritize x86_pkg_temp over generic thermal zones
- [x] Fixed C-state collection to discover all available states (including C10)
- [x] **Dashboard improvements and maintenance mode (2025-10-13)**
- [x] Host auto-discovery with predefined CMTEC infrastructure hosts (cmbox, labbox, simonbox, steambox, srv01)
- [x] Host navigation limited to connected hosts only (no disconnected host cycling)
- [x] Storage widget restructured: Name/Temp/Wear/Usage columns with SMART details as descriptions
- [x] Agent-provided descriptions for Storage widget (agent is source of truth for formatting)
- [x] Maintenance mode implementation: /tmp/cm-maintenance file suppresses notifications
- [x] NixOS borgbackup integration with automatic maintenance mode during backups
- [x] System widget simplified to single row with C-states as description lines
- [x] CPU load thresholds updated to production values (9.0/10.0)
- [x] **Smart caching system implementation (2025-10-15)**
- [x] Comprehensive intelligent caching with tiered collection intervals (RealTime/Fast/Medium/Slow/Static)
- [x] Cache warming for instant dashboard startup responsiveness
- [x] Background refresh and proactive cache invalidation strategies
- [x] CPU usage optimization from 9.5% to <2% through smart polling reduction
- [x] Cache key consistency fixes for proper collector data flow
- [x] ZMQ broadcast mechanism ensuring continuous data delivery to dashboard
- [x] Immich service quota detection fix (500GB instead of hardcoded 200GB)
- [x] Service-to-directory mapping for accurate disk usage calculation
- [x] **Real-time process monitoring implementation (2025-10-16)**
- [x] Fixed hardcoded top CPU/RAM process display with real data
- [x] Added top CPU and RAM process collection to CpuCollector
- [x] Implemented ps-based process monitoring with accurate percentages
- [x] Added intelligent filtering to avoid self-monitoring artifacts
- [x] Dashboard updated to display real-time top processes instead of placeholder text
- [x] Fixed disk metrics permission issues in systemd collector
- [x] Enhanced error logging for service directory access problems
- [x] Optimized service collection focusing on status, memory, and disk metrics only
- [x] **Comprehensive backup monitoring implementation (2025-10-18)**
- [x] Added BackupCollector for reading TOML status files with disk space metrics
- [x] Implemented BackupWidget with disk usage display and service status details
- [x] Fixed backup script disk space parsing by adding missing capture_output=True
- [x] Updated backup widget to show actual disk usage instead of repository size
- [x] Fixed timestamp parsing to use backup completion time instead of start time
- [x] Resolved timezone issues by using UTC timestamps in backup script
- [x] Added disk identification metrics (product name, serial number) to backup status
- [x] Enhanced UI layout with proper backup monitoring integration
- [x] **Complete warning elimination and code cleanup (2025-10-18)**
- [x] Removed all unused code including widget subscription system and WidgetType enum
- [x] Eliminated unused cache utilities, error variants, and theme functions
- [x] Removed unused struct fields and imports throughout codebase
- [x] Fixed lifetime warnings and replaced subscription-based widgets with direct metric filtering
- [x] Achieved zero build warnings in both agent and dashboard (down from 46 total warnings)
**Production Configuration:**
- CPU load thresholds: Warning 9.0, Critical 10.0
- CPU temperature thresholds: Warning 100°C, Critical 100°C (effectively disabled)
- Memory usage thresholds: Warning 80%, Critical 95%
- Connection timeout: 15 seconds (agents send data every 5 seconds)
- Email rate limiting: 30 minutes (set to 0 for testing)
### Maintenance Mode
**Purpose:**
- Suppress email notifications during planned maintenance or backups
- Prevents false alerts when services are intentionally stopped
**Implementation:**
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
- File presence suppresses all email notifications while continuing monitoring
- Dashboard continues to show real status, only notifications are blocked
**Usage:**
```bash
# Enable maintenance mode
touch /tmp/cm-maintenance
# Run maintenance tasks (backups, service restarts, etc.)
systemctl stop service
# ... maintenance work ...
systemctl start service
# Disable maintenance mode
rm /tmp/cm-maintenance
```
**NixOS Integration:**
- Borgbackup script automatically creates/removes maintenance file
- Automatic cleanup via trap ensures maintenance mode doesn't stick
### Configuration-Based Smart Caching System
**Purpose:**
- Reduce agent CPU usage from 10% to <1% through configuration-driven intelligent caching
- Maintain dashboard responsiveness with configurable refresh strategies
- Optimize for different data volatility characteristics via config files
**Configuration-Driven Architecture:**
```toml
# Cache tiers defined in agent.toml
[cache.tiers.realtime]
interval_seconds = 5
description = "High-frequency metrics (CPU load, memory usage)"
[cache.tiers.medium]
interval_seconds = 300
description = "Low-frequency metrics (service status, disk usage)"
[cache.tiers.slow]
interval_seconds = 900
description = "Very low-frequency metrics (SMART data, backup status)"
# Metric assignments via configuration
[cache.metric_assignments]
"cpu_load_*" = "realtime"
"service_*_disk_gb" = "medium"
"disk_*_temperature" = "slow"
```
**Implementation:**
- **ConfigurableCache**: Central cache manager reading tier config from files
- **MetricCacheManager**: Assigns metrics to tiers based on configuration patterns
- **TierScheduler**: Manages configurable tier-based refresh timing
- **Cache warming**: Parallel startup population for instant responsiveness
- **Background refresh**: Proactive updates based on configured intervals
**Configuration:**
```toml
[cache]
enabled = true
default_ttl_seconds = 30
max_entries = 10000
warming_timeout_seconds = 3
background_refresh_enabled = true
cleanup_interval_seconds = 1800
```
**Performance Benefits:**
- CPU usage reduction: 10% <1% target through configuration optimization
- Configurable cache intervals prevent expensive operations from running too frequently
- Disk usage detection cached at 5-minute intervals instead of every 5 seconds
- Selective metric refresh based on configured volatility patterns
**Usage:**
```bash
# Start agent with config-based caching
cm-dashboard-agent --config /etc/cm-dashboard/agent.toml [-v]
```
**Architecture:**
- **Configuration-driven caching**: Tiered collection with configurable intervals
- **Config file management**: All cache behavior defined in TOML configuration
- **Responsive design**: Cache warming for instant dashboard startup
### New Implementation Guidelines - CRITICAL
**ARCHITECTURE ENFORCEMENT**:
- **ZERO legacy code reuse** - Fresh implementation following ARCHITECT.md exactly
- **Individual metrics only** - NO grouped metric structures
- **Reference-only legacy** - Study old functionality, implement new architecture
- **Clean slate mindset** - Build as if legacy codebase never existed
**Implementation Rules**:
1. **Individual Metrics**: Each metric is collected, transmitted, and stored individually
2. **Agent Status Authority**: Agent calculates status for each metric using thresholds
3. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
4. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status
5. **ZMQ Communication**: All metrics transmitted via ZMQ, no HTTP APIs
**When Adding New Metrics**:
1. Define metric name in shared registry (e.g., "disk_nvme1_temperature_celsius")
2. Implement collector that returns individual Metric struct
3. Agent calculates status using configured thresholds
4. Dashboard widgets subscribe to metric by name
5. Notification system automatically detects status changes
**Testing & Building**:
- **Workspace builds**: `cargo build --workspace` for all testing
- **Clean compilation**: Remove `target/` between architecture changes
- **ZMQ testing**: Test agent-dashboard communication independently
- **Widget testing**: Verify UI layout matches legacy appearance exactly
**NEVER in New Implementation**:
- Copy/paste ANY code from legacy backup
- Create grouped metric structures (SystemMetrics, etc.)
- Calculate status in dashboard widgets
- Hardcode metric names in widgets (use const arrays)
- Skip individual metric architecture for "simplicity"
**Legacy Reference Usage**:
- Study UI layout and rendering logic only
- Understand email notification formatting
- Reference status color mapping
- Learn host navigation patterns
- NO code copying or structural influence
# Important Communication Guidelines
NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.
NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.
## Commit Message Guidelines
**NEVER mention:**
- Claude or any AI assistant names
- Automation or AI-generated content
- Any reference to automated code generation
**ALWAYS:**
- Focus purely on technical changes and their purpose
- Use standard software development commit message format
- Describe what was changed and why, not how it was created
- Write from the perspective of a human developer
**Examples:**
- "Generated with Claude Code"
- "AI-assisted implementation"
- "Automated refactoring"
- "Implement maintenance mode for backup operations"
- "Restructure storage widget with improved layout"
- "Update CPU thresholds to production values"
## NixOS Configuration Updates
When code changes are made to cm-dashboard, the NixOS configuration at `~/nixosbox` must be updated to deploy the changes.
### Update Process
1. **Get Latest Commit Hash**
```bash
git log -1 --format="%H"
```
2. **Update NixOS Configuration**
Edit `~/nixosbox/hosts/common/cm-dashboard.nix`:
```nix
src = pkgs.fetchgit {
url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
rev = "NEW_COMMIT_HASH_HERE";
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; # Placeholder
};
```
3. **Get Correct Source Hash**
Build with placeholder hash to get the actual hash:
```bash
cd ~/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchgit {
url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
rev = "NEW_COMMIT_HASH";
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
```
Example output:
```
error: hash mismatch in fixed-output derivation '/nix/store/...':
specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
got: sha256-x8crxNusOUYRrkP9mYEOG+Ga3JCPIdJLkEAc5P1ZxdQ=
```
4. **Update Configuration with Correct Hash**
Replace the placeholder with the hash from the error message (the "got:" line).
5. **Commit NixOS Configuration**
```bash
cd ~/nixosbox
git add hosts/common/cm-dashboard.nix
git commit -m "Update cm-dashboard to latest version (SHORT_HASH)"
git push
```
6. **Rebuild System**
The user handles the system rebuild step - this cannot be automated.