Updated documentation
This commit is contained in:
parent
14aae90954
commit
245e546f18
530
CLAUDE.md
530
CLAUDE.md
@ -4,455 +4,29 @@
|
|||||||
|
|
||||||
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and ZMQ-based metric collection.
|
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and ZMQ-based metric collection.
|
||||||
|
|
||||||
## CRITICAL: Architecture Redesign in Progress
|
|
||||||
|
|
||||||
**LEGACY CODE DEPRECATION**: The current codebase is being completely rewritten with a new individual metrics architecture. ALL existing code will be moved to a backup folder for reference only.
|
|
||||||
|
|
||||||
**NEW IMPLEMENTATION STRATEGY**:
|
|
||||||
- **NO legacy code reuse** - Fresh implementation following ARCHITECT.md
|
|
||||||
- **Clean slate approach** - Build entirely new codebase structure
|
|
||||||
- **Reference-only legacy** - Current code preserved only for functionality reference
|
|
||||||
|
|
||||||
## Implementation Strategy
|
## Implementation Strategy
|
||||||
|
|
||||||
### Phase 1: Legacy Code Backup (IMMEDIATE)
|
|
||||||
|
|
||||||
**Backup Current Implementation:**
|
|
||||||
```bash
|
|
||||||
# Create backup folder for reference
|
|
||||||
mkdir -p backup/legacy-2025-10-16
|
|
||||||
|
|
||||||
# Move all current source code to backup
|
|
||||||
mv agent/ backup/legacy-2025-10-16/
|
|
||||||
mv dashboard/ backup/legacy-2025-10-16/
|
|
||||||
mv shared/ backup/legacy-2025-10-16/
|
|
||||||
|
|
||||||
# Preserve configuration examples
|
|
||||||
cp -r config/ backup/legacy-2025-10-16/
|
|
||||||
|
|
||||||
# Keep important documentation
|
|
||||||
cp CLAUDE.md backup/legacy-2025-10-16/CLAUDE-legacy.md
|
|
||||||
cp README.md backup/legacy-2025-10-16/README-legacy.md
|
|
||||||
```
|
|
||||||
|
|
||||||
**Reference Usage Rules:**
|
|
||||||
- Legacy code is **REFERENCE ONLY** - never copy/paste
|
|
||||||
- Study existing functionality and UI layout patterns
|
|
||||||
- Understand current widget behavior and status mapping
|
|
||||||
- Reference notification logic and email formatting
|
|
||||||
- NO legacy code in new implementation
|
|
||||||
|
|
||||||
### Phase 2: Clean Slate Implementation
|
|
||||||
|
|
||||||
**New Codebase Structure:**
|
|
||||||
Following ARCHITECT.md precisely with zero legacy dependencies:
|
|
||||||
|
|
||||||
```
|
|
||||||
cm-dashboard/ # New clean repository root
|
|
||||||
├── ARCHITECT.md # Architecture documentation
|
|
||||||
├── CLAUDE.md # This file (updated)
|
|
||||||
├── README.md # New implementation documentation
|
|
||||||
├── Cargo.toml # Workspace configuration
|
|
||||||
├── agent/ # New agent implementation
|
|
||||||
│ ├── Cargo.toml
|
|
||||||
│ └── src/ ... (per ARCHITECT.md)
|
|
||||||
├── dashboard/ # New dashboard implementation
|
|
||||||
│ ├── Cargo.toml
|
|
||||||
│ └── src/ ... (per ARCHITECT.md)
|
|
||||||
├── shared/ # New shared types
|
|
||||||
│ ├── Cargo.toml
|
|
||||||
│ └── src/ ... (per ARCHITECT.md)
|
|
||||||
├── config/ # New configuration examples
|
|
||||||
└── backup/ # Legacy code for reference
|
|
||||||
└── legacy-2025-10-16/
|
|
||||||
```
|
|
||||||
|
|
||||||
### Phase 3: Implementation Priorities
|
|
||||||
|
|
||||||
**Agent Implementation (Priority 1):**
|
|
||||||
1. Individual metrics collection system
|
|
||||||
2. ZMQ communication protocol
|
|
||||||
3. Basic collectors (CPU, memory, disk, services)
|
|
||||||
4. Status calculation and thresholds
|
|
||||||
5. Email notification system
|
|
||||||
|
|
||||||
**Dashboard Implementation (Priority 2):**
|
|
||||||
1. ZMQ metric consumer
|
|
||||||
2. Metric storage and subscription system
|
|
||||||
3. Base widget trait and framework
|
|
||||||
4. Core widgets (CPU, memory, storage, services)
|
|
||||||
5. Host management and navigation
|
|
||||||
|
|
||||||
**Testing & Integration (Priority 3):**
|
|
||||||
1. End-to-end metric flow validation
|
|
||||||
2. Multi-host connection testing
|
|
||||||
3. UI layout validation against legacy appearance
|
|
||||||
4. Performance benchmarking
|
|
||||||
|
|
||||||
## Project Goals (Updated)
|
|
||||||
|
|
||||||
### Core Objectives
|
|
||||||
|
|
||||||
- **Individual metric architecture** for maximum dashboard flexibility
|
|
||||||
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
|
|
||||||
- **Performance-focused** with minimal resource usage
|
|
||||||
- **Keyboard-driven interface** preserving current UI layout
|
|
||||||
- **ZMQ-based communication** replacing HTTP API polling
|
|
||||||
|
|
||||||
### Key Features
|
|
||||||
|
|
||||||
- **Granular metric collection** (cpu_load_1min, memory_usage_percent, etc.)
|
|
||||||
- **Widget-based metric subscription** for flexible dashboard composition
|
|
||||||
- **Preserved UI layout** maintaining current visual design
|
|
||||||
- **Intelligent caching** for optimal performance
|
|
||||||
- **Auto-discovery** of services and system components
|
|
||||||
- **Email notifications** for status changes with rate limiting
|
|
||||||
- **Maintenance mode** integration for planned downtime
|
|
||||||
|
|
||||||
## New Technical Architecture
|
|
||||||
|
|
||||||
### Technology Stack (Updated)
|
|
||||||
|
|
||||||
- **Language**: Rust 🦀
|
|
||||||
- **Communication**: ZMQ (zeromq) for agent-dashboard messaging
|
|
||||||
- **TUI Framework**: ratatui (modern tui-rs fork)
|
|
||||||
- **Async Runtime**: tokio
|
|
||||||
- **Serialization**: serde (JSON for metrics)
|
|
||||||
- **CLI**: clap
|
|
||||||
- **Error Handling**: thiserror + anyhow
|
|
||||||
- **Time**: chrono
|
|
||||||
- **Email**: lettre (SMTP notifications)
|
|
||||||
|
|
||||||
### New Dependencies
|
|
||||||
|
|
||||||
```toml
|
|
||||||
# Workspace Cargo.toml
|
|
||||||
[workspace]
|
|
||||||
members = ["agent", "dashboard", "shared"]
|
|
||||||
|
|
||||||
# Agent dependencies
|
|
||||||
[dependencies.agent]
|
|
||||||
zmq = "0.10" # ZMQ communication
|
|
||||||
serde = { version = "1.0", features = ["derive"] }
|
|
||||||
serde_json = "1.0"
|
|
||||||
tokio = { version = "1.0", features = ["full"] }
|
|
||||||
clap = { version = "4.0", features = ["derive"] }
|
|
||||||
thiserror = "1.0"
|
|
||||||
anyhow = "1.0"
|
|
||||||
chrono = { version = "0.4", features = ["serde"] }
|
|
||||||
lettre = { version = "0.11", features = ["smtp-transport"] }
|
|
||||||
gethostname = "0.4"
|
|
||||||
|
|
||||||
# Dashboard dependencies
|
|
||||||
[dependencies.dashboard]
|
|
||||||
ratatui = "0.24"
|
|
||||||
crossterm = "0.27"
|
|
||||||
zmq = "0.10"
|
|
||||||
serde = { version = "1.0", features = ["derive"] }
|
|
||||||
serde_json = "1.0"
|
|
||||||
tokio = { version = "1.0", features = ["full"] }
|
|
||||||
clap = { version = "4.0", features = ["derive"] }
|
|
||||||
thiserror = "1.0"
|
|
||||||
anyhow = "1.0"
|
|
||||||
chrono = { version = "0.4", features = ["serde"] }
|
|
||||||
|
|
||||||
# Shared dependencies
|
|
||||||
[dependencies.shared]
|
|
||||||
serde = { version = "1.0", features = ["derive"] }
|
|
||||||
serde_json = "1.0"
|
|
||||||
chrono = { version = "0.4", features = ["serde"] }
|
|
||||||
thiserror = "1.0"
|
|
||||||
```
|
|
||||||
|
|
||||||
## New Project Structure
|
|
||||||
|
|
||||||
**REFERENCE**: See ARCHITECT.md for complete folder structure specification.
|
|
||||||
|
|
||||||
**Current Status**: All configuration moved to NixOS declarative management. Zero hardcoded defaults remain.
|
|
||||||
|
|
||||||
**Implementation Progress**:
|
|
||||||
- [x] Architecture documentation (ARCHITECT.md)
|
|
||||||
- [x] Implementation strategy (CLAUDE.md updates)
|
|
||||||
- [x] Configuration migration to NixOS completed
|
|
||||||
- [x] Hardcoded defaults removal (347 lines removed)
|
|
||||||
- [x] NixOS module with comprehensive configuration generation
|
|
||||||
- [x] Host-specific filesystem configurations
|
|
||||||
- [x] Service include/exclude patterns in NixOS
|
|
||||||
- [x] Live testing and validation on production systems
|
|
||||||
|
|
||||||
### New Individual Metrics Architecture
|
|
||||||
|
|
||||||
**REPLACED**: Legacy grouped structures (SmartMetrics, ServiceMetrics, etc.) are replaced with individual metrics.
|
|
||||||
|
|
||||||
**New Approach**: See ARCHITECT.md for individual metric definitions:
|
|
||||||
|
|
||||||
```rust
|
|
||||||
// Individual metrics examples:
|
|
||||||
"cpu_load_1min" -> 2.5
|
|
||||||
"cpu_temperature_celsius" -> 45.0
|
|
||||||
"memory_usage_percent" -> 78.5
|
|
||||||
"disk_nvme0_wear_percent" -> 12.3
|
|
||||||
"service_ssh_status" -> "active"
|
|
||||||
"backup_last_run_timestamp" -> 1697123456
|
|
||||||
```
|
|
||||||
|
|
||||||
**Shared Types**: Located in `shared/src/metrics.rs`:
|
|
||||||
|
|
||||||
```rust
|
|
||||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
||||||
pub struct Metric {
|
|
||||||
pub name: String,
|
|
||||||
pub value: MetricValue,
|
|
||||||
pub status: Status,
|
|
||||||
pub timestamp: u64,
|
|
||||||
pub description: Option<String>,
|
|
||||||
pub unit: Option<String>,
|
|
||||||
}
|
|
||||||
|
|
||||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
||||||
pub enum MetricValue {
|
|
||||||
Float(f32),
|
|
||||||
Integer(i64),
|
|
||||||
String(String),
|
|
||||||
Boolean(bool),
|
|
||||||
}
|
|
||||||
|
|
||||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
||||||
pub enum Status {
|
|
||||||
Ok,
|
|
||||||
Warning,
|
|
||||||
Critical,
|
|
||||||
Unknown,
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## UI Layout Preservation
|
|
||||||
|
|
||||||
**CRITICAL**: The exact visual layout shown above is **PRESERVED** in the new implementation.
|
|
||||||
|
|
||||||
**Implementation Strategy**:
|
|
||||||
- New widgets subscribe to individual metrics but render identically
|
|
||||||
- Same positions, colors, borders, and keyboard shortcuts
|
|
||||||
- Enhanced with flexible metric composition under the hood
|
|
||||||
|
|
||||||
**Reference**: Legacy widgets in `backup/legacy-2025-10-16/dashboard/src/ui/` show exact rendering logic to replicate.
|
|
||||||
|
|
||||||
## Core Architecture Principles - CRITICAL
|
## Core Architecture Principles - CRITICAL
|
||||||
|
|
||||||
### Individual Metrics Philosophy
|
### Individual Metrics Philosophy
|
||||||
|
|
||||||
**NEW ARCHITECTURE**: Agent collects individual metrics, dashboard composes widgets from those metrics.
|
**NEW ARCHITECTURE**: Agent collects individual metrics, dashboard composes widgets from those metrics.
|
||||||
|
|
||||||
**Status Calculation**:
|
|
||||||
- Agent calculates status for each individual metric
|
|
||||||
- Agent sends individual metrics with status via ZMQ
|
|
||||||
- Dashboard aggregates metric statuses for widget-level status
|
|
||||||
- Dashboard NEVER calculates metric status - only displays and aggregates
|
|
||||||
|
|
||||||
**Data Flow Architecture:**
|
|
||||||
```
|
|
||||||
Agent (individual metrics + status) → ZMQ → Dashboard (subscribe + display) → Widgets (compose + render)
|
|
||||||
```
|
|
||||||
|
|
||||||
### Migration from Legacy Architecture
|
|
||||||
|
|
||||||
**OLD (DEPRECATED)**:
|
|
||||||
```
|
|
||||||
Agent → ServiceMetrics{summary, services} → Dashboard → Widget
|
|
||||||
Agent → SmartMetrics{drives, summary} → Dashboard → Widget
|
|
||||||
```
|
|
||||||
|
|
||||||
**NEW (IMPLEMENTING)**:
|
|
||||||
```
|
|
||||||
Agent → ["cpu_load_1min", "memory_usage_percent", ...] → Dashboard → Widgets subscribe to needed metrics
|
|
||||||
```
|
|
||||||
|
|
||||||
### Current Agent Thresholds (as of 2025-10-12)
|
|
||||||
|
|
||||||
**CPU Load (service.rs:392-400):**
|
|
||||||
- Warning: ≥ 2.0 (testing value, was 5.0)
|
|
||||||
- Critical: ≥ 4.0 (testing value, was 8.0)
|
|
||||||
|
|
||||||
**CPU Temperature (service.rs:412-420):**
|
|
||||||
- Warning: ≥ 70.0°C
|
|
||||||
- Critical: ≥ 80.0°C
|
|
||||||
|
|
||||||
**Memory Usage (service.rs:402-410):**
|
|
||||||
- Warning: ≥ 80%
|
|
||||||
- Critical: ≥ 95%
|
|
||||||
|
|
||||||
### Email Notifications
|
|
||||||
|
|
||||||
**System Configuration:**
|
|
||||||
- From: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se)
|
|
||||||
- To: `cm@cmtec.se`
|
|
||||||
- SMTP: localhost:25 (postfix)
|
|
||||||
- Timezone: Europe/Stockholm (not UTC)
|
|
||||||
|
|
||||||
**Notification Triggers:**
|
|
||||||
- Status degradation: any → "warning" or "critical"
|
|
||||||
- Recovery: "warning"/"critical" → "ok"
|
|
||||||
- Rate limiting: configurable (set to 0 for testing, 30 minutes for production)
|
|
||||||
|
|
||||||
**Monitored Components:**
|
|
||||||
- system.cpu (load status) - SystemCollector
|
|
||||||
- system.memory (usage status) - SystemCollector
|
|
||||||
- system.cpu_temp (temperature status) - SystemCollector (disabled)
|
|
||||||
- system.services (service health status) - ServiceCollector
|
|
||||||
- storage.smart (drive health) - SmartCollector
|
|
||||||
- backup.overall (backup status) - BackupCollector
|
|
||||||
|
|
||||||
### Pure Auto-Discovery Implementation
|
|
||||||
|
|
||||||
**Agent Configuration:**
|
|
||||||
- No config files required
|
|
||||||
- Auto-detects storage devices, services, backup systems
|
|
||||||
- Runtime discovery of system capabilities
|
|
||||||
- CLI: `cm-dashboard-agent [-v]` (intelligent caching enabled)
|
|
||||||
|
|
||||||
**Service Discovery:**
|
|
||||||
- Scans ALL systemd services (active, inactive, failed, dead, etc.) using list-unit-files and list-units --all
|
|
||||||
- Discovers both system services and user services per host:
|
|
||||||
- steambox/cmbox: reads system + cm user services
|
|
||||||
- simonbox: reads system + simon user services
|
|
||||||
- Filters by service_name_filters patterns (gitea, nginx, docker, sunshine, etc.)
|
|
||||||
- Excludes maintenance services (docker-prune, sshd@, ark-permissions, etc.)
|
|
||||||
- No host-specific hardcoded service lists
|
|
||||||
|
|
||||||
### Current Implementation Status
|
|
||||||
|
|
||||||
**Completed:**
|
|
||||||
- [x] Pure auto-discovery agent (no config files)
|
|
||||||
- [x] Agent-side status calculations with defined thresholds
|
|
||||||
- [x] Dashboard displays agent status (no dashboard calculations)
|
|
||||||
- [x] Email notifications with Stockholm timezone
|
|
||||||
- [x] CPU temperature monitoring and notifications
|
|
||||||
- [x] ZMQ message format standardization
|
|
||||||
- [x] Removed all hardcoded dashboard thresholds
|
|
||||||
- [x] CPU thresholds restored to production values (5.0/8.0)
|
|
||||||
- [x] All collectors output standardized status strings (ok/warning/critical/unknown)
|
|
||||||
- [x] Dashboard connection loss detection with 5-second keep-alive
|
|
||||||
- [x] Removed excessive logging from agent
|
|
||||||
- [x] Reduced initial compiler warnings from excessive logging cleanup
|
|
||||||
- [x] **SystemCollector architecture refactoring completed (2025-10-12)**
|
|
||||||
- [x] Created SystemCollector for CPU load, memory, temperature, C-states
|
|
||||||
- [x] Moved system metrics from ServiceCollector to SystemCollector
|
|
||||||
- [x] Updated dashboard to parse and display SystemCollector data
|
|
||||||
- [x] Enhanced service notifications to include specific failure details
|
|
||||||
- [x] CPU temperature thresholds set to 100°C (effectively disabled)
|
|
||||||
- [x] **SystemCollector bug fixes completed (2025-10-12)**
|
|
||||||
- [x] Fixed CPU load parsing for comma decimal separator locale (", " split)
|
|
||||||
- [x] Fixed CPU temperature to prioritize x86_pkg_temp over generic thermal zones
|
|
||||||
- [x] Fixed C-state collection to discover all available states (including C10)
|
|
||||||
- [x] **Dashboard improvements and maintenance mode (2025-10-13)**
|
|
||||||
- [x] Host auto-discovery with predefined CMTEC infrastructure hosts (cmbox, labbox, simonbox, steambox, srv01)
|
|
||||||
- [x] Host navigation limited to connected hosts only (no disconnected host cycling)
|
|
||||||
- [x] Storage widget restructured: Name/Temp/Wear/Usage columns with SMART details as descriptions
|
|
||||||
- [x] Agent-provided descriptions for Storage widget (agent is source of truth for formatting)
|
|
||||||
- [x] Maintenance mode implementation: /tmp/cm-maintenance file suppresses notifications
|
|
||||||
- [x] NixOS borgbackup integration with automatic maintenance mode during backups
|
|
||||||
- [x] System widget simplified to single row with C-states as description lines
|
|
||||||
- [x] CPU load thresholds updated to production values (9.0/10.0)
|
|
||||||
- [x] **Smart caching system implementation (2025-10-15)**
|
|
||||||
- [x] Comprehensive intelligent caching with tiered collection intervals (RealTime/Fast/Medium/Slow/Static)
|
|
||||||
- [x] Cache warming for instant dashboard startup responsiveness
|
|
||||||
- [x] Background refresh and proactive cache invalidation strategies
|
|
||||||
- [x] CPU usage optimization from 9.5% to <2% through smart polling reduction
|
|
||||||
- [x] Cache key consistency fixes for proper collector data flow
|
|
||||||
- [x] ZMQ broadcast mechanism ensuring continuous data delivery to dashboard
|
|
||||||
- [x] Immich service quota detection fix (500GB instead of hardcoded 200GB)
|
|
||||||
- [x] Service-to-directory mapping for accurate disk usage calculation
|
|
||||||
- [x] **Real-time process monitoring implementation (2025-10-16)**
|
|
||||||
- [x] Fixed hardcoded top CPU/RAM process display with real data
|
|
||||||
- [x] Added top CPU and RAM process collection to CpuCollector
|
|
||||||
- [x] Implemented ps-based process monitoring with accurate percentages
|
|
||||||
- [x] Added intelligent filtering to avoid self-monitoring artifacts
|
|
||||||
- [x] Dashboard updated to display real-time top processes instead of placeholder text
|
|
||||||
- [x] Fixed disk metrics permission issues in systemd collector
|
|
||||||
- [x] Enhanced error logging for service directory access problems
|
|
||||||
- [x] Optimized service collection focusing on status, memory, and disk metrics only
|
|
||||||
- [x] **Comprehensive backup monitoring implementation (2025-10-18)**
|
|
||||||
- [x] Added BackupCollector for reading TOML status files with disk space metrics
|
|
||||||
- [x] Implemented BackupWidget with disk usage display and service status details
|
|
||||||
- [x] Fixed backup script disk space parsing by adding missing capture_output=True
|
|
||||||
- [x] Updated backup widget to show actual disk usage instead of repository size
|
|
||||||
- [x] Fixed timestamp parsing to use backup completion time instead of start time
|
|
||||||
- [x] Resolved timezone issues by using UTC timestamps in backup script
|
|
||||||
- [x] Added disk identification metrics (product name, serial number) to backup status
|
|
||||||
- [x] Enhanced UI layout with proper backup monitoring integration
|
|
||||||
- [x] **Complete warning elimination and code cleanup (2025-10-18)**
|
|
||||||
- [x] Removed all unused code including widget subscription system and WidgetType enum
|
|
||||||
- [x] Eliminated unused cache utilities, error variants, and theme functions
|
|
||||||
- [x] Removed unused struct fields and imports throughout codebase
|
|
||||||
- [x] Fixed lifetime warnings and replaced subscription-based widgets with direct metric filtering
|
|
||||||
- [x] Achieved zero build warnings in both agent and dashboard (down from 46 total warnings)
|
|
||||||
- [x] **Complete NixOS configuration migration (2025-10-20)**
|
|
||||||
- [x] Removed all hardcoded defaults from agent (347 lines eliminated)
|
|
||||||
- [x] Created comprehensive NixOS module for declarative configuration management
|
|
||||||
- [x] Added complete agent.toml generation with all settings (thresholds, intervals, cache, notifications)
|
|
||||||
- [x] Implemented host-specific filesystem configurations for all CMTEC infrastructure
|
|
||||||
- [x] Added service include/exclude patterns to NixOS configuration
|
|
||||||
- [x] Made configuration file required for agent startup (fails fast if missing)
|
|
||||||
- [x] Live tested and validated on production systems
|
|
||||||
- [x] Eliminated configuration drift between defaults and deployed settings
|
|
||||||
- [x] All cm-dashboard configuration now managed declaratively through NixOS
|
|
||||||
|
|
||||||
**In Progress:**
|
|
||||||
- [ ] **Storage Widget Tree Structure Implementation (2025-10-22)**
|
|
||||||
- [ ] Replace flat storage display with proper tree structure format
|
|
||||||
- [ ] Implement themed status icons for pool/drive/usage status
|
|
||||||
- [ ] Add tree symbols (├─, └─) with proper indentation for hierarchical display
|
|
||||||
- [ ] Support T: and W: prefixes for temperature and wear metrics
|
|
||||||
- [ ] Use agent-calculated status from NixOS-configured thresholds (no dashboard calculations)
|
|
||||||
|
|
||||||
### Storage Widget Tree Structure Specification
|
|
||||||
|
|
||||||
**Target Display Format:**
|
|
||||||
```
|
|
||||||
● Storage steampool (Raid0):
|
|
||||||
├─ ● sdb T:35°C W:12%
|
|
||||||
├─ ● sdc T:38°C W:8%
|
|
||||||
└─ ● 78.1% 1250.3GB/1600.0GB
|
|
||||||
```
|
|
||||||
|
|
||||||
**Status Icon Sources:**
|
|
||||||
- **Pool Status**: Aggregated status from pool health + usage (`disk_{pool}_usage_percent` metric status)
|
|
||||||
- **Drive Status**: Individual SMART health status (`disk_{pool}_{drive}_health` metric status)
|
|
||||||
- **Usage Status**: Disk usage level status (`disk_{pool}_usage_percent` metric status)
|
|
||||||
|
|
||||||
**Implementation Details:**
|
|
||||||
- **Tree Symbols**: `├─` for intermediate lines, `└─` for final line
|
|
||||||
- **Indentation**: 2 spaces before tree symbols
|
|
||||||
- **Status Icons**: Use `StatusIcons::get_icon(status)` with themed colors
|
|
||||||
- **Temperature Format**: `T:{temp}°C` from `disk_{pool}_{drive}_temperature` metrics
|
|
||||||
- **Wear Format**: `W:{wear}%` from `disk_{pool}_{drive}_wear_percent` metrics
|
|
||||||
- **Pool Type**: Determine from drive count (Single/multi-drive) or RAID type detection
|
|
||||||
- **Status Calculation**: Dashboard displays agent-calculated status, no threshold evaluation
|
|
||||||
|
|
||||||
**Layout Constraints:**
|
|
||||||
- Dynamic: 1 header + N drives + 1 usage line per storage pool
|
|
||||||
- Supports multiple storage pools with proper spacing
|
|
||||||
- Truncation indicator if pools exceed available display space
|
|
||||||
|
|
||||||
**Production Configuration:**
|
|
||||||
- CPU load thresholds: Warning ≥ 9.0, Critical ≥ 10.0
|
|
||||||
- CPU temperature thresholds: Warning ≥ 100°C, Critical ≥ 100°C (effectively disabled)
|
|
||||||
- Memory usage thresholds: Warning ≥ 80%, Critical ≥ 95%
|
|
||||||
- Connection timeout: 15 seconds (agents send data every 5 seconds)
|
|
||||||
- Email rate limiting: 30 minutes (set to 0 for testing)
|
|
||||||
|
|
||||||
### Maintenance Mode
|
### Maintenance Mode
|
||||||
|
|
||||||
**Purpose:**
|
**Purpose:**
|
||||||
|
|
||||||
- Suppress email notifications during planned maintenance or backups
|
- Suppress email notifications during planned maintenance or backups
|
||||||
- Prevents false alerts when services are intentionally stopped
|
- Prevents false alerts when services are intentionally stopped
|
||||||
|
|
||||||
**Implementation:**
|
**Implementation:**
|
||||||
|
|
||||||
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
|
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
|
||||||
- File presence suppresses all email notifications while continuing monitoring
|
- File presence suppresses all email notifications while continuing monitoring
|
||||||
- Dashboard continues to show real status, only notifications are blocked
|
- Dashboard continues to show real status, only notifications are blocked
|
||||||
|
|
||||||
**Usage:**
|
**Usage:**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Enable maintenance mode
|
# Enable maintenance mode
|
||||||
touch /tmp/cm-maintenance
|
touch /tmp/cm-maintenance
|
||||||
@ -467,114 +41,36 @@ rm /tmp/cm-maintenance
|
|||||||
```
|
```
|
||||||
|
|
||||||
**NixOS Integration:**
|
**NixOS Integration:**
|
||||||
|
|
||||||
- Borgbackup script automatically creates/removes maintenance file
|
- Borgbackup script automatically creates/removes maintenance file
|
||||||
- Automatic cleanup via trap ensures maintenance mode doesn't stick
|
- Automatic cleanup via trap ensures maintenance mode doesn't stick
|
||||||
|
- All cinfiguration are shall be done from nixos config
|
||||||
### Configuration-Based Smart Caching System
|
|
||||||
|
|
||||||
**Purpose:**
|
|
||||||
- Reduce agent CPU usage from 10% to <1% through configuration-driven intelligent caching
|
|
||||||
- Maintain dashboard responsiveness with configurable refresh strategies
|
|
||||||
- Optimize for different data volatility characteristics via config files
|
|
||||||
|
|
||||||
**Configuration-Driven Architecture:**
|
|
||||||
```toml
|
|
||||||
# Cache tiers defined in agent.toml
|
|
||||||
[cache.tiers.realtime]
|
|
||||||
interval_seconds = 5
|
|
||||||
description = "High-frequency metrics (CPU load, memory usage)"
|
|
||||||
|
|
||||||
[cache.tiers.medium]
|
|
||||||
interval_seconds = 300
|
|
||||||
description = "Low-frequency metrics (service status, disk usage)"
|
|
||||||
|
|
||||||
[cache.tiers.slow]
|
|
||||||
interval_seconds = 900
|
|
||||||
description = "Very low-frequency metrics (SMART data, backup status)"
|
|
||||||
|
|
||||||
# Metric assignments via configuration
|
|
||||||
[cache.metric_assignments]
|
|
||||||
"cpu_load_*" = "realtime"
|
|
||||||
"service_*_disk_gb" = "medium"
|
|
||||||
"disk_*_temperature" = "slow"
|
|
||||||
```
|
|
||||||
|
|
||||||
**Implementation:**
|
|
||||||
- **ConfigurableCache**: Central cache manager reading tier config from files
|
|
||||||
- **MetricCacheManager**: Assigns metrics to tiers based on configuration patterns
|
|
||||||
- **TierScheduler**: Manages configurable tier-based refresh timing
|
|
||||||
- **Cache warming**: Parallel startup population for instant responsiveness
|
|
||||||
- **Background refresh**: Proactive updates based on configured intervals
|
|
||||||
|
|
||||||
**Configuration:**
|
|
||||||
```toml
|
|
||||||
[cache]
|
|
||||||
enabled = true
|
|
||||||
default_ttl_seconds = 30
|
|
||||||
max_entries = 10000
|
|
||||||
warming_timeout_seconds = 3
|
|
||||||
background_refresh_enabled = true
|
|
||||||
cleanup_interval_seconds = 1800
|
|
||||||
```
|
|
||||||
|
|
||||||
**Performance Benefits:**
|
|
||||||
- CPU usage reduction: 10% → <1% target through configuration optimization
|
|
||||||
- Configurable cache intervals prevent expensive operations from running too frequently
|
|
||||||
- Disk usage detection cached at 5-minute intervals instead of every 5 seconds
|
|
||||||
- Selective metric refresh based on configured volatility patterns
|
|
||||||
|
|
||||||
**Usage:**
|
|
||||||
```bash
|
|
||||||
# Start agent with config-based caching
|
|
||||||
cm-dashboard-agent --config /etc/cm-dashboard/agent.toml [-v]
|
|
||||||
```
|
|
||||||
|
|
||||||
**Architecture:**
|
|
||||||
- **Configuration-driven caching**: Tiered collection with configurable intervals
|
|
||||||
- **Config file management**: All cache behavior defined in TOML configuration
|
|
||||||
- **Responsive design**: Cache warming for instant dashboard startup
|
|
||||||
|
|
||||||
### New Implementation Guidelines - CRITICAL
|
|
||||||
|
|
||||||
**ARCHITECTURE ENFORCEMENT**:
|
**ARCHITECTURE ENFORCEMENT**:
|
||||||
|
|
||||||
- **ZERO legacy code reuse** - Fresh implementation following ARCHITECT.md exactly
|
- **ZERO legacy code reuse** - Fresh implementation following ARCHITECT.md exactly
|
||||||
- **Individual metrics only** - NO grouped metric structures
|
- **Individual metrics only** - NO grouped metric structures
|
||||||
- **Reference-only legacy** - Study old functionality, implement new architecture
|
- **Reference-only legacy** - Study old functionality, implement new architecture
|
||||||
- **Clean slate mindset** - Build as if legacy codebase never existed
|
- **Clean slate mindset** - Build as if legacy codebase never existed
|
||||||
|
|
||||||
**Implementation Rules**:
|
**Implementation Rules**:
|
||||||
|
|
||||||
1. **Individual Metrics**: Each metric is collected, transmitted, and stored individually
|
1. **Individual Metrics**: Each metric is collected, transmitted, and stored individually
|
||||||
2. **Agent Status Authority**: Agent calculates status for each metric using thresholds
|
2. **Agent Status Authority**: Agent calculates status for each metric using thresholds
|
||||||
3. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
|
3. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
|
||||||
4. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status
|
4. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status
|
||||||
5. **ZMQ Communication**: All metrics transmitted via ZMQ, no HTTP APIs
|
**Testing & Building**:
|
||||||
|
|
||||||
**When Adding New Metrics**:
|
|
||||||
1. Define metric name in shared registry (e.g., "disk_nvme1_temperature_celsius")
|
|
||||||
2. Implement collector that returns individual Metric struct
|
|
||||||
3. Agent calculates status using configured thresholds
|
|
||||||
4. Dashboard widgets subscribe to metric by name
|
|
||||||
5. Notification system automatically detects status changes
|
|
||||||
|
|
||||||
**Testing & Building**:
|
|
||||||
- **Workspace builds**: `cargo build --workspace` for all testing
|
- **Workspace builds**: `cargo build --workspace` for all testing
|
||||||
- **Clean compilation**: Remove `target/` between architecture changes
|
- **Clean compilation**: Remove `target/` between architecture changes
|
||||||
- **ZMQ testing**: Test agent-dashboard communication independently
|
- **ZMQ testing**: Test agent-dashboard communication independently
|
||||||
- **Widget testing**: Verify UI layout matches legacy appearance exactly
|
- **Widget testing**: Verify UI layout matches legacy appearance exactly
|
||||||
|
|
||||||
**NEVER in New Implementation**:
|
**NEVER in New Implementation**:
|
||||||
|
|
||||||
- Copy/paste ANY code from legacy backup
|
- Copy/paste ANY code from legacy backup
|
||||||
- Create grouped metric structures (SystemMetrics, etc.)
|
|
||||||
- Calculate status in dashboard widgets
|
- Calculate status in dashboard widgets
|
||||||
- Hardcode metric names in widgets (use const arrays)
|
- Hardcode metric names in widgets (use const arrays)
|
||||||
- Skip individual metric architecture for "simplicity"
|
|
||||||
|
|
||||||
**Legacy Reference Usage**:
|
|
||||||
- Study UI layout and rendering logic only
|
|
||||||
- Understand email notification formatting
|
|
||||||
- Reference status color mapping
|
|
||||||
- Learn host navigation patterns
|
|
||||||
- NO code copying or structural influence
|
|
||||||
|
|
||||||
# Important Communication Guidelines
|
# Important Communication Guidelines
|
||||||
|
|
||||||
@ -585,17 +81,20 @@ NEVER implement code without first getting explicit user agreement on the approa
|
|||||||
## Commit Message Guidelines
|
## Commit Message Guidelines
|
||||||
|
|
||||||
**NEVER mention:**
|
**NEVER mention:**
|
||||||
|
|
||||||
- Claude or any AI assistant names
|
- Claude or any AI assistant names
|
||||||
- Automation or AI-generated content
|
- Automation or AI-generated content
|
||||||
- Any reference to automated code generation
|
- Any reference to automated code generation
|
||||||
|
|
||||||
**ALWAYS:**
|
**ALWAYS:**
|
||||||
|
|
||||||
- Focus purely on technical changes and their purpose
|
- Focus purely on technical changes and their purpose
|
||||||
- Use standard software development commit message format
|
- Use standard software development commit message format
|
||||||
- Describe what was changed and why, not how it was created
|
- Describe what was changed and why, not how it was created
|
||||||
- Write from the perspective of a human developer
|
- Write from the perspective of a human developer
|
||||||
|
|
||||||
**Examples:**
|
**Examples:**
|
||||||
|
|
||||||
- ❌ "Generated with Claude Code"
|
- ❌ "Generated with Claude Code"
|
||||||
- ❌ "AI-assisted implementation"
|
- ❌ "AI-assisted implementation"
|
||||||
- ❌ "Automated refactoring"
|
- ❌ "Automated refactoring"
|
||||||
@ -610,12 +109,14 @@ When code changes are made to cm-dashboard, the NixOS configuration at `~/nixosb
|
|||||||
### Update Process
|
### Update Process
|
||||||
|
|
||||||
1. **Get Latest Commit Hash**
|
1. **Get Latest Commit Hash**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git log -1 --format="%H"
|
git log -1 --format="%H"
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **Update NixOS Configuration**
|
2. **Update NixOS Configuration**
|
||||||
Edit `~/nixosbox/hosts/common/cm-dashboard.nix`:
|
Edit `~/nixosbox/hosts/common/cm-dashboard.nix`:
|
||||||
|
|
||||||
```nix
|
```nix
|
||||||
src = pkgs.fetchgit {
|
src = pkgs.fetchgit {
|
||||||
url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
|
url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
|
||||||
@ -626,6 +127,7 @@ When code changes are made to cm-dashboard, the NixOS configuration at `~/nixosb
|
|||||||
|
|
||||||
3. **Get Correct Source Hash**
|
3. **Get Correct Source Hash**
|
||||||
Build with placeholder hash to get the actual hash:
|
Build with placeholder hash to get the actual hash:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd ~/nixosbox
|
cd ~/nixosbox
|
||||||
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchgit {
|
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchgit {
|
||||||
@ -636,6 +138,7 @@ When code changes are made to cm-dashboard, the NixOS configuration at `~/nixosb
|
|||||||
```
|
```
|
||||||
|
|
||||||
Example output:
|
Example output:
|
||||||
|
|
||||||
```
|
```
|
||||||
error: hash mismatch in fixed-output derivation '/nix/store/...':
|
error: hash mismatch in fixed-output derivation '/nix/store/...':
|
||||||
specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
|
specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
|
||||||
@ -646,6 +149,7 @@ When code changes are made to cm-dashboard, the NixOS configuration at `~/nixosb
|
|||||||
Replace the placeholder with the hash from the error message (the "got:" line).
|
Replace the placeholder with the hash from the error message (the "got:" line).
|
||||||
|
|
||||||
5. **Commit NixOS Configuration**
|
5. **Commit NixOS Configuration**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd ~/nixosbox
|
cd ~/nixosbox
|
||||||
git add hosts/common/cm-dashboard.nix
|
git add hosts/common/cm-dashboard.nix
|
||||||
|
|||||||
17
TODO.md
Normal file
17
TODO.md
Normal file
@ -0,0 +1,17 @@
|
|||||||
|
# TODO
|
||||||
|
|
||||||
|
## Show logged in users (agent/dashboard)
|
||||||
|
|
||||||
|
- Add support to show login users
|
||||||
|
|
||||||
|
## Keyboard navigation and scrolling (dashboard)
|
||||||
|
|
||||||
|
- Change switchng host keybinding to "Shift-Tab"
|
||||||
|
- Add keyboard navigation between panels "Tab"
|
||||||
|
- Add scrolling support when text do not fit
|
||||||
|
|
||||||
|
## Remote execution (agent/dashboard)
|
||||||
|
|
||||||
|
- Add lower statusbar with dynamic updated shortcuts when switchng between panels
|
||||||
|
- Add support for send command via dashboard to agent to do nixos rebuid
|
||||||
|
- Add support for navigating services in dashboard and trigger start/stop/restart
|
||||||
Loading…
x
Reference in New Issue
Block a user