Implement real-time process monitoring and fix UI hardcoded data
This commit addresses several key issues identified during development: Major Changes: - Replace hardcoded top CPU/RAM process display with real system data - Add intelligent process monitoring to CpuCollector using ps command - Fix disk metrics permission issues in systemd collector - Optimize service collection to focus on status, memory, and disk only - Update dashboard widgets to display live process information Process Monitoring Implementation: - Added collect_top_cpu_process() and collect_top_ram_process() methods - Implemented ps-based monitoring with accurate CPU percentages - Added filtering to prevent self-monitoring artifacts (ps commands) - Enhanced error handling and validation for process data - Dashboard now shows realistic values like "claude (PID 2974) 11.0%" Service Collection Optimization: - Removed CPU monitoring from systemd collector for efficiency - Enhanced service directory permission error logging - Simplified services widget to show essential metrics only - Fixed service-to-directory mapping accuracy UI and Dashboard Improvements: - Reorganized dashboard layout with btop-inspired multi-panel design - Updated system panel to include real top CPU/RAM process display - Enhanced widget formatting and data presentation - Removed placeholder/hardcoded data throughout the interface Technical Details: - Updated agent/src/collectors/cpu.rs with process monitoring - Modified dashboard/src/ui/mod.rs for real-time process display - Enhanced systemd collector error handling and disk metrics - Updated CLAUDE.md documentation with implementation details
This commit is contained in:
527
CLAUDE.md
527
CLAUDE.md
@@ -2,207 +2,270 @@
|
||||
|
||||
## Overview
|
||||
|
||||
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.
|
||||
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and ZMQ-based metric collection.
|
||||
|
||||
## Project Goals
|
||||
## CRITICAL: Architecture Redesign in Progress
|
||||
|
||||
**LEGACY CODE DEPRECATION**: The current codebase is being completely rewritten with a new individual metrics architecture. ALL existing code will be moved to a backup folder for reference only.
|
||||
|
||||
**NEW IMPLEMENTATION STRATEGY**:
|
||||
- **NO legacy code reuse** - Fresh implementation following ARCHITECT.md
|
||||
- **Clean slate approach** - Build entirely new codebase structure
|
||||
- **Reference-only legacy** - Current code preserved only for functionality reference
|
||||
|
||||
## Implementation Strategy
|
||||
|
||||
### Phase 1: Legacy Code Backup (IMMEDIATE)
|
||||
|
||||
**Backup Current Implementation:**
|
||||
```bash
|
||||
# Create backup folder for reference
|
||||
mkdir -p backup/legacy-2025-10-16
|
||||
|
||||
# Move all current source code to backup
|
||||
mv agent/ backup/legacy-2025-10-16/
|
||||
mv dashboard/ backup/legacy-2025-10-16/
|
||||
mv shared/ backup/legacy-2025-10-16/
|
||||
|
||||
# Preserve configuration examples
|
||||
cp -r config/ backup/legacy-2025-10-16/
|
||||
|
||||
# Keep important documentation
|
||||
cp CLAUDE.md backup/legacy-2025-10-16/CLAUDE-legacy.md
|
||||
cp README.md backup/legacy-2025-10-16/README-legacy.md
|
||||
```
|
||||
|
||||
**Reference Usage Rules:**
|
||||
- Legacy code is **REFERENCE ONLY** - never copy/paste
|
||||
- Study existing functionality and UI layout patterns
|
||||
- Understand current widget behavior and status mapping
|
||||
- Reference notification logic and email formatting
|
||||
- NO legacy code in new implementation
|
||||
|
||||
### Phase 2: Clean Slate Implementation
|
||||
|
||||
**New Codebase Structure:**
|
||||
Following ARCHITECT.md precisely with zero legacy dependencies:
|
||||
|
||||
```
|
||||
cm-dashboard/ # New clean repository root
|
||||
├── ARCHITECT.md # Architecture documentation
|
||||
├── CLAUDE.md # This file (updated)
|
||||
├── README.md # New implementation documentation
|
||||
├── Cargo.toml # Workspace configuration
|
||||
├── agent/ # New agent implementation
|
||||
│ ├── Cargo.toml
|
||||
│ └── src/ ... (per ARCHITECT.md)
|
||||
├── dashboard/ # New dashboard implementation
|
||||
│ ├── Cargo.toml
|
||||
│ └── src/ ... (per ARCHITECT.md)
|
||||
├── shared/ # New shared types
|
||||
│ ├── Cargo.toml
|
||||
│ └── src/ ... (per ARCHITECT.md)
|
||||
├── config/ # New configuration examples
|
||||
└── backup/ # Legacy code for reference
|
||||
└── legacy-2025-10-16/
|
||||
```
|
||||
|
||||
### Phase 3: Implementation Priorities
|
||||
|
||||
**Agent Implementation (Priority 1):**
|
||||
1. Individual metrics collection system
|
||||
2. ZMQ communication protocol
|
||||
3. Basic collectors (CPU, memory, disk, services)
|
||||
4. Status calculation and thresholds
|
||||
5. Email notification system
|
||||
|
||||
**Dashboard Implementation (Priority 2):**
|
||||
1. ZMQ metric consumer
|
||||
2. Metric storage and subscription system
|
||||
3. Base widget trait and framework
|
||||
4. Core widgets (CPU, memory, storage, services)
|
||||
5. Host management and navigation
|
||||
|
||||
**Testing & Integration (Priority 3):**
|
||||
1. End-to-end metric flow validation
|
||||
2. Multi-host connection testing
|
||||
3. UI layout validation against legacy appearance
|
||||
4. Performance benchmarking
|
||||
|
||||
## Project Goals (Updated)
|
||||
|
||||
### Core Objectives
|
||||
|
||||
- **Real-time monitoring** of all infrastructure components
|
||||
- **Individual metric architecture** for maximum dashboard flexibility
|
||||
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
|
||||
- **Performance-focused** with minimal resource usage
|
||||
- **Keyboard-driven interface** for power users
|
||||
- **Integration** with existing monitoring APIs (ports 6127, 6128, 6129)
|
||||
- **Keyboard-driven interface** preserving current UI layout
|
||||
- **ZMQ-based communication** replacing HTTP API polling
|
||||
|
||||
### Key Features
|
||||
|
||||
- **NVMe health monitoring** with wear prediction
|
||||
- **CPU / memory / GPU telemetry** with automatic thresholding
|
||||
- **Service resource monitoring** with per-service CPU and RAM usage
|
||||
- **Disk usage overview** for root filesystems
|
||||
- **Backup status** with detailed metrics and history
|
||||
- **Unified alert pipeline** summarising host health
|
||||
- **Historical data tracking** and trend analysis
|
||||
- **Granular metric collection** (cpu_load_1min, memory_usage_percent, etc.)
|
||||
- **Widget-based metric subscription** for flexible dashboard composition
|
||||
- **Preserved UI layout** maintaining current visual design
|
||||
- **Intelligent caching** for optimal performance
|
||||
- **Auto-discovery** of services and system components
|
||||
- **Email notifications** for status changes with rate limiting
|
||||
- **Maintenance mode** integration for planned downtime
|
||||
|
||||
## Technical Architecture
|
||||
## New Technical Architecture
|
||||
|
||||
### Technology Stack
|
||||
### Technology Stack (Updated)
|
||||
|
||||
- **Language**: Rust 🦀
|
||||
- **Communication**: ZMQ (zeromq) for agent-dashboard messaging
|
||||
- **TUI Framework**: ratatui (modern tui-rs fork)
|
||||
- **Async Runtime**: tokio
|
||||
- **HTTP Client**: reqwest
|
||||
- **Serialization**: serde
|
||||
- **Serialization**: serde (JSON for metrics)
|
||||
- **CLI**: clap
|
||||
- **Error Handling**: anyhow
|
||||
- **Error Handling**: thiserror + anyhow
|
||||
- **Time**: chrono
|
||||
- **Email**: lettre (SMTP notifications)
|
||||
|
||||
### Dependencies
|
||||
### New Dependencies
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
ratatui = "0.24" # Modern TUI framework
|
||||
crossterm = "0.27" # Cross-platform terminal handling
|
||||
tokio = { version = "1.0", features = ["full"] } # Async runtime
|
||||
reqwest = { version = "0.11", features = ["json"] } # HTTP client
|
||||
serde = { version = "1.0", features = ["derive"] } # JSON parsing
|
||||
clap = { version = "4.0", features = ["derive"] } # CLI args
|
||||
anyhow = "1.0" # Error handling
|
||||
chrono = "0.4" # Time handling
|
||||
# Workspace Cargo.toml
|
||||
[workspace]
|
||||
members = ["agent", "dashboard", "shared"]
|
||||
|
||||
# Agent dependencies
|
||||
[dependencies.agent]
|
||||
zmq = "0.10" # ZMQ communication
|
||||
serde = { version = "1.0", features = ["derive"] }
|
||||
serde_json = "1.0"
|
||||
tokio = { version = "1.0", features = ["full"] }
|
||||
clap = { version = "4.0", features = ["derive"] }
|
||||
thiserror = "1.0"
|
||||
anyhow = "1.0"
|
||||
chrono = { version = "0.4", features = ["serde"] }
|
||||
lettre = { version = "0.11", features = ["smtp-transport"] }
|
||||
gethostname = "0.4"
|
||||
|
||||
# Dashboard dependencies
|
||||
[dependencies.dashboard]
|
||||
ratatui = "0.24"
|
||||
crossterm = "0.27"
|
||||
zmq = "0.10"
|
||||
serde = { version = "1.0", features = ["derive"] }
|
||||
serde_json = "1.0"
|
||||
tokio = { version = "1.0", features = ["full"] }
|
||||
clap = { version = "4.0", features = ["derive"] }
|
||||
thiserror = "1.0"
|
||||
anyhow = "1.0"
|
||||
chrono = { version = "0.4", features = ["serde"] }
|
||||
|
||||
# Shared dependencies
|
||||
[dependencies.shared]
|
||||
serde = { version = "1.0", features = ["derive"] }
|
||||
serde_json = "1.0"
|
||||
chrono = { version = "0.4", features = ["serde"] }
|
||||
thiserror = "1.0"
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
## New Project Structure
|
||||
|
||||
```
|
||||
cm-dashboard/
|
||||
├── Cargo.toml
|
||||
├── README.md
|
||||
├── CLAUDE.md # This file
|
||||
├── src/
|
||||
│ ├── main.rs # Entry point & CLI
|
||||
│ ├── app.rs # Main application state
|
||||
│ ├── ui/
|
||||
│ │ ├── mod.rs
|
||||
│ │ ├── dashboard.rs # Main dashboard layout
|
||||
│ │ ├── nvme.rs # NVMe health widget
|
||||
│ │ ├── services.rs # Services status widget
|
||||
│ │ ├── memory.rs # RAM optimization widget
|
||||
│ │ ├── backup.rs # Backup status widget
|
||||
│ │ └── alerts.rs # Alerts/notifications widget
|
||||
│ ├── api/
|
||||
│ │ ├── mod.rs
|
||||
│ │ ├── client.rs # HTTP client wrapper
|
||||
│ │ ├── smart.rs # Smart metrics API (port 6127)
|
||||
│ │ ├── service.rs # Service metrics API (port 6128)
|
||||
│ │ └── backup.rs # Backup metrics API (port 6129)
|
||||
│ ├── data/
|
||||
│ │ ├── mod.rs
|
||||
│ │ ├── metrics.rs # Data structures
|
||||
│ │ ├── history.rs # Historical data storage
|
||||
│ │ └── config.rs # Host configuration
|
||||
│ └── config.rs # Application configuration
|
||||
├── config/
|
||||
│ ├── hosts.toml # Host definitions
|
||||
│ └── dashboard.toml # Dashboard layout config
|
||||
└── docs/
|
||||
├── API.md # API integration documentation
|
||||
└── WIDGETS.md # Widget development guide
|
||||
```
|
||||
**REFERENCE**: See ARCHITECT.md for complete folder structure specification.
|
||||
|
||||
### Data Structures
|
||||
**Current Status**: Legacy code preserved in `backup/legacy-2025-10-16/` for reference only.
|
||||
|
||||
**Implementation Progress**:
|
||||
- [x] Architecture documentation (ARCHITECT.md)
|
||||
- [x] Implementation strategy (CLAUDE.md updates)
|
||||
- [ ] Legacy code backup
|
||||
- [ ] New workspace setup
|
||||
- [ ] Shared types implementation
|
||||
- [ ] Agent implementation
|
||||
- [ ] Dashboard implementation
|
||||
- [ ] Integration testing
|
||||
|
||||
### New Individual Metrics Architecture
|
||||
|
||||
**REPLACED**: Legacy grouped structures (SmartMetrics, ServiceMetrics, etc.) are replaced with individual metrics.
|
||||
|
||||
**New Approach**: See ARCHITECT.md for individual metric definitions:
|
||||
|
||||
```rust
|
||||
#[derive(Deserialize, Debug)]
|
||||
pub struct SmartMetrics {
|
||||
pub status: String,
|
||||
pub drives: Vec<DriveInfo>,
|
||||
pub summary: DriveSummary,
|
||||
pub issues: Vec<String>,
|
||||
// Individual metrics examples:
|
||||
"cpu_load_1min" -> 2.5
|
||||
"cpu_temperature_celsius" -> 45.0
|
||||
"memory_usage_percent" -> 78.5
|
||||
"disk_nvme0_wear_percent" -> 12.3
|
||||
"service_ssh_status" -> "active"
|
||||
"backup_last_run_timestamp" -> 1697123456
|
||||
```
|
||||
|
||||
**Shared Types**: Located in `shared/src/metrics.rs`:
|
||||
|
||||
```rust
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct Metric {
|
||||
pub name: String,
|
||||
pub value: MetricValue,
|
||||
pub status: Status,
|
||||
pub timestamp: u64,
|
||||
pub description: Option<String>,
|
||||
pub unit: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Debug)]
|
||||
pub struct ServiceMetrics {
|
||||
pub summary: ServiceSummary,
|
||||
pub services: Vec<ServiceInfo>,
|
||||
pub timestamp: u64,
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub enum MetricValue {
|
||||
Float(f32),
|
||||
Integer(i64),
|
||||
String(String),
|
||||
Boolean(bool),
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Debug)]
|
||||
pub struct ServiceSummary {
|
||||
pub healthy: usize,
|
||||
pub degraded: usize,
|
||||
pub failed: usize,
|
||||
pub memory_used_mb: f32,
|
||||
pub memory_quota_mb: f32,
|
||||
pub system_memory_used_mb: f32,
|
||||
pub system_memory_total_mb: f32,
|
||||
pub disk_used_gb: f32,
|
||||
pub disk_total_gb: f32,
|
||||
pub cpu_load_1: f32,
|
||||
pub cpu_load_5: f32,
|
||||
pub cpu_load_15: f32,
|
||||
pub cpu_freq_mhz: Option<f32>,
|
||||
pub cpu_temp_c: Option<f32>,
|
||||
pub gpu_load_percent: Option<f32>,
|
||||
pub gpu_temp_c: Option<f32>,
|
||||
}
|
||||
|
||||
#[derive(Deserialize, Debug)]
|
||||
pub struct BackupMetrics {
|
||||
pub overall_status: String,
|
||||
pub backup: BackupInfo,
|
||||
pub service: BackupServiceInfo,
|
||||
pub timestamp: u64,
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub enum Status {
|
||||
Ok,
|
||||
Warning,
|
||||
Critical,
|
||||
Unknown,
|
||||
}
|
||||
```
|
||||
|
||||
## Dashboard Layout Design
|
||||
## UI Layout Preservation
|
||||
|
||||
### Main Dashboard View
|
||||
**CRITICAL**: The exact visual layout shown above is **PRESERVED** in the new implementation.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ CM Dashboard • cmbox │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ Storage • ok:1 warn:0 crit:0 │ Services • ok:1 warn:0 fail:0 │
|
||||
│ ┌─────────────────────────────────┐ │ ┌─────────────────────────────── │ │
|
||||
│ │Drive Temp Wear Spare Hours │ │ │Service memory: 7.1/23899.7 MiB│ │
|
||||
│ │nvme0n1 28°C 1% 100% 14489 │ │ │Disk usage: — │ │
|
||||
│ │ Capacity Usage │ │ │ Service Memory Disk │ │
|
||||
│ │ 954G 77G (8%) │ │ │✔ sshd 7.1 MiB — │ │
|
||||
│ └─────────────────────────────────┘ │ └─────────────────────────────── │ │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ CPU / Memory • warn │ Backups │
|
||||
│ System memory: 5251.7/23899.7 MiB │ Host cmbox awaiting backup │ │
|
||||
│ CPU load (1/5/15): 2.18 2.66 2.56 │ metrics │ │
|
||||
│ CPU freq: 1100.1 MHz │ │ │
|
||||
│ CPU temp: 47.0°C │ │ │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ Alerts • ok:0 warn:3 fail:0 │ Status • ZMQ connected │
|
||||
│ cmbox: warning: CPU load 2.18 │ Monitoring • hosts: 3 │ │
|
||||
│ srv01: pending: awaiting metrics │ Data source: ZMQ – connected │ │
|
||||
│ labbox: pending: awaiting metrics │ Active host: cmbox (1/3) │ │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
Keys: [←→] hosts [r]efresh [q]uit
|
||||
```
|
||||
**Implementation Strategy**:
|
||||
- New widgets subscribe to individual metrics but render identically
|
||||
- Same positions, colors, borders, and keyboard shortcuts
|
||||
- Enhanced with flexible metric composition under the hood
|
||||
|
||||
### Multi-Host View
|
||||
**Reference**: Legacy widgets in `backup/legacy-2025-10-16/dashboard/src/ui/` show exact rendering logic to replicate.
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ 🖥️ CMTEC Host Overview │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ Host │ NVMe Wear │ RAM Usage │ Services │ Last Alert │
|
||||
├─────────────────────────────────────────────────────────────────────┤
|
||||
│ srv01 │ 4% ✅ │ 32% ✅ │ 8/8 ✅ │ 04:00 Backup OK │
|
||||
│ cmbox │ 12% ✅ │ 45% ✅ │ 3/3 ✅ │ Yesterday Email test │
|
||||
│ labbox │ 8% ✅ │ 28% ✅ │ 2/2 ✅ │ 2h ago NVMe temp OK │
|
||||
│ simonbox │ 15% ✅ │ 67% ⚠️ │ 4/4 ✅ │ Gaming session active │
|
||||
│ steambox │ 23% ✅ │ 78% ⚠️ │ 2/2 ✅ │ High RAM usage │
|
||||
└─────────────────────────────────────────────────────────────────────┘
|
||||
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
|
||||
```
|
||||
## Core Architecture Principles - CRITICAL
|
||||
|
||||
## Architecture Principles - CRITICAL
|
||||
### Individual Metrics Philosophy
|
||||
|
||||
### Agent-Dashboard Separation of Concerns
|
||||
**NEW ARCHITECTURE**: Agent collects individual metrics, dashboard composes widgets from those metrics.
|
||||
|
||||
**AGENT IS SINGLE SOURCE OF TRUTH FOR ALL STATUS CALCULATIONS**
|
||||
- Agent calculates status ("ok"/"warning"/"critical"/"unknown") using defined thresholds
|
||||
- Agent sends status to dashboard via ZMQ
|
||||
- Dashboard NEVER calculates status - only displays what agent provides
|
||||
**Status Calculation**:
|
||||
- Agent calculates status for each individual metric
|
||||
- Agent sends individual metrics with status via ZMQ
|
||||
- Dashboard aggregates metric statuses for widget-level status
|
||||
- Dashboard NEVER calculates metric status - only displays and aggregates
|
||||
|
||||
**Data Flow Architecture:**
|
||||
```
|
||||
Agent (calculations + thresholds) → Status → Dashboard (display only) → TableBuilder (colors)
|
||||
Agent (individual metrics + status) → ZMQ → Dashboard (subscribe + display) → Widgets (compose + render)
|
||||
```
|
||||
|
||||
**Status Handling Rules:**
|
||||
- Agent provides status → Dashboard uses agent status
|
||||
- Agent doesn't provide status → Dashboard shows "unknown" (NOT "ok")
|
||||
- Dashboard widgets NEVER contain hardcoded thresholds
|
||||
- TableBuilder converts status to colors for display
|
||||
### Migration from Legacy Architecture
|
||||
|
||||
**OLD (DEPRECATED)**:
|
||||
```
|
||||
Agent → ServiceMetrics{summary, services} → Dashboard → Widget
|
||||
Agent → SmartMetrics{drives, summary} → Dashboard → Widget
|
||||
```
|
||||
|
||||
**NEW (IMPLEMENTING)**:
|
||||
```
|
||||
Agent → ["cpu_load_1min", "memory_usage_percent", ...] → Dashboard → Widgets subscribe to needed metrics
|
||||
```
|
||||
|
||||
### Current Agent Thresholds (as of 2025-10-12)
|
||||
|
||||
@@ -295,6 +358,15 @@ Agent (calculations + thresholds) → Status → Dashboard (display only) → Ta
|
||||
- [x] ZMQ broadcast mechanism ensuring continuous data delivery to dashboard
|
||||
- [x] Immich service quota detection fix (500GB instead of hardcoded 200GB)
|
||||
- [x] Service-to-directory mapping for accurate disk usage calculation
|
||||
- [x] **Real-time process monitoring implementation (2025-10-16)**
|
||||
- [x] Fixed hardcoded top CPU/RAM process display with real data
|
||||
- [x] Added top CPU and RAM process collection to CpuCollector
|
||||
- [x] Implemented ps-based process monitoring with accurate percentages
|
||||
- [x] Added intelligent filtering to avoid self-monitoring artifacts
|
||||
- [x] Dashboard updated to display real-time top processes instead of placeholder text
|
||||
- [x] Fixed disk metrics permission issues in systemd collector
|
||||
- [x] Enhanced error logging for service directory access problems
|
||||
- [x] Optimized service collection focusing on status, memory, and disk metrics only
|
||||
|
||||
**Production Configuration:**
|
||||
- CPU load thresholds: Warning ≥ 9.0, Critical ≥ 10.0
|
||||
@@ -332,86 +404,111 @@ rm /tmp/cm-maintenance
|
||||
- Borgbackup script automatically creates/removes maintenance file
|
||||
- Automatic cleanup via trap ensures maintenance mode doesn't stick
|
||||
|
||||
### Smart Caching System
|
||||
### Configuration-Based Smart Caching System
|
||||
|
||||
**Purpose:**
|
||||
- Reduce agent CPU usage from 9.5% to <2% through intelligent caching
|
||||
- Maintain dashboard responsiveness with tiered refresh strategies
|
||||
- Optimize for different data volatility characteristics
|
||||
- Reduce agent CPU usage from 10% to <1% through configuration-driven intelligent caching
|
||||
- Maintain dashboard responsiveness with configurable refresh strategies
|
||||
- Optimize for different data volatility characteristics via config files
|
||||
|
||||
**Architecture:**
|
||||
```
|
||||
Cache Tiers:
|
||||
- RealTime (5s): CPU load, memory usage, quick-changing metrics
|
||||
- Fast (30s): Network stats, process lists, medium-volatility
|
||||
- Medium (5min): Service status, disk usage, slow-changing data
|
||||
- Slow (15min): SMART data, backup status, rarely-changing metrics
|
||||
- Static (1h): Hardware info, system capabilities, fixed data
|
||||
**Configuration-Driven Architecture:**
|
||||
```toml
|
||||
# Cache tiers defined in agent.toml
|
||||
[cache.tiers.realtime]
|
||||
interval_seconds = 5
|
||||
description = "High-frequency metrics (CPU load, memory usage)"
|
||||
|
||||
[cache.tiers.medium]
|
||||
interval_seconds = 300
|
||||
description = "Low-frequency metrics (service status, disk usage)"
|
||||
|
||||
[cache.tiers.slow]
|
||||
interval_seconds = 900
|
||||
description = "Very low-frequency metrics (SMART data, backup status)"
|
||||
|
||||
# Metric assignments via configuration
|
||||
[cache.metric_assignments]
|
||||
"cpu_load_*" = "realtime"
|
||||
"service_*_disk_gb" = "medium"
|
||||
"disk_*_temperature" = "slow"
|
||||
```
|
||||
|
||||
**Implementation:**
|
||||
- **SmartCache**: Central cache manager with RwLock for thread safety
|
||||
- **CachedCollector**: Wrapper adding caching to any collector
|
||||
- **CollectionScheduler**: Manages tier-based refresh timing
|
||||
- **ConfigurableCache**: Central cache manager reading tier config from files
|
||||
- **MetricCacheManager**: Assigns metrics to tiers based on configuration patterns
|
||||
- **TierScheduler**: Manages configurable tier-based refresh timing
|
||||
- **Cache warming**: Parallel startup population for instant responsiveness
|
||||
- **Background refresh**: Proactive updates to prevent cache misses
|
||||
- **Background refresh**: Proactive updates based on configured intervals
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Start the agent with intelligent caching
|
||||
cm-dashboard-agent [-v]
|
||||
**Configuration:**
|
||||
```toml
|
||||
[cache]
|
||||
enabled = true
|
||||
default_ttl_seconds = 30
|
||||
max_entries = 10000
|
||||
warming_timeout_seconds = 3
|
||||
background_refresh_enabled = true
|
||||
cleanup_interval_seconds = 1800
|
||||
```
|
||||
|
||||
**Performance Benefits:**
|
||||
- CPU usage reduction: 9.5% → <2% expected
|
||||
- Instant dashboard startup through cache warming
|
||||
- Reduced disk I/O through intelligent du command caching
|
||||
- Network efficiency with selective refresh strategies
|
||||
- CPU usage reduction: 10% → <1% target through configuration optimization
|
||||
- Configurable cache intervals prevent expensive operations from running too frequently
|
||||
- Disk usage detection cached at 5-minute intervals instead of every 5 seconds
|
||||
- Selective metric refresh based on configured volatility patterns
|
||||
|
||||
**Configuration:**
|
||||
- Cache warming timeout: 3 seconds
|
||||
- Background refresh: Enabled at 80% of tier interval
|
||||
- Cache cleanup: Every 30 minutes
|
||||
- Stale data threshold: 2x tier interval
|
||||
**Usage:**
|
||||
```bash
|
||||
# Start agent with config-based caching
|
||||
cm-dashboard-agent --config /etc/cm-dashboard/agent.toml [-v]
|
||||
```
|
||||
|
||||
**Architecture:**
|
||||
- **Intelligent caching**: Tiered collection with optimal CPU usage
|
||||
- **Auto-discovery**: No configuration files required
|
||||
- **Configuration-driven caching**: Tiered collection with configurable intervals
|
||||
- **Config file management**: All cache behavior defined in TOML configuration
|
||||
- **Responsive design**: Cache warming for instant dashboard startup
|
||||
|
||||
### Development Guidelines
|
||||
### New Implementation Guidelines - CRITICAL
|
||||
|
||||
**When Adding New Metrics:**
|
||||
1. Agent calculates status with thresholds
|
||||
2. Agent adds `{metric}_status` field to JSON output
|
||||
3. Dashboard data structure adds `{metric}_status: Option<String>`
|
||||
4. Dashboard uses `status_level_from_agent_status()` for display
|
||||
5. Agent adds notification monitoring for status changes
|
||||
**ARCHITECTURE ENFORCEMENT**:
|
||||
- **ZERO legacy code reuse** - Fresh implementation following ARCHITECT.md exactly
|
||||
- **Individual metrics only** - NO grouped metric structures
|
||||
- **Reference-only legacy** - Study old functionality, implement new architecture
|
||||
- **Clean slate mindset** - Build as if legacy codebase never existed
|
||||
|
||||
**Testing & Building:**
|
||||
- ALWAYS use `cargo build --workspace` to match NixOS build configuration
|
||||
- Test with OpenSSL environment variables when building locally:
|
||||
```bash
|
||||
OPENSSL_DIR=/nix/store/.../openssl-dev \
|
||||
OPENSSL_LIB_DIR=/nix/store/.../openssl/lib \
|
||||
OPENSSL_INCLUDE_DIR=/nix/store/.../openssl-dev/include \
|
||||
PKG_CONFIG_PATH=/nix/store/.../openssl-dev/lib/pkgconfig \
|
||||
OPENSSL_NO_VENDOR=1 cargo build --workspace
|
||||
```
|
||||
- This prevents build failures that only appear in NixOS deployment
|
||||
**Implementation Rules**:
|
||||
1. **Individual Metrics**: Each metric is collected, transmitted, and stored individually
|
||||
2. **Agent Status Authority**: Agent calculates status for each metric using thresholds
|
||||
3. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
|
||||
4. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status
|
||||
5. **ZMQ Communication**: All metrics transmitted via ZMQ, no HTTP APIs
|
||||
|
||||
**Notification System:**
|
||||
- Universal automatic detection of all `_status` fields across all collectors
|
||||
- Sends emails from `hostname@cmtec.se` to `cm@cmtec.se` for any status changes
|
||||
- Status stored in-memory: `HashMap<"component.metric", status>`
|
||||
- Recovery emails sent when status changes from warning/critical → ok
|
||||
**When Adding New Metrics**:
|
||||
1. Define metric name in shared registry (e.g., "disk_nvme1_temperature_celsius")
|
||||
2. Implement collector that returns individual Metric struct
|
||||
3. Agent calculates status using configured thresholds
|
||||
4. Dashboard widgets subscribe to metric by name
|
||||
5. Notification system automatically detects status changes
|
||||
|
||||
**NEVER:**
|
||||
- Add hardcoded thresholds to dashboard widgets
|
||||
- Calculate status in dashboard with different thresholds than agent
|
||||
- Use "ok" as default when agent status is missing (use "unknown")
|
||||
- Calculate colors in widgets (TableBuilder's responsibility)
|
||||
- Use `cargo build` without `--workspace` for final testing
|
||||
**Testing & Building**:
|
||||
- **Workspace builds**: `cargo build --workspace` for all testing
|
||||
- **Clean compilation**: Remove `target/` between architecture changes
|
||||
- **ZMQ testing**: Test agent-dashboard communication independently
|
||||
- **Widget testing**: Verify UI layout matches legacy appearance exactly
|
||||
|
||||
**NEVER in New Implementation**:
|
||||
- Copy/paste ANY code from legacy backup
|
||||
- Create grouped metric structures (SystemMetrics, etc.)
|
||||
- Calculate status in dashboard widgets
|
||||
- Hardcode metric names in widgets (use const arrays)
|
||||
- Skip individual metric architecture for "simplicity"
|
||||
|
||||
**Legacy Reference Usage**:
|
||||
- Study UI layout and rendering logic only
|
||||
- Understand email notification formatting
|
||||
- Reference status color mapping
|
||||
- Learn host navigation patterns
|
||||
- NO code copying or structural influence
|
||||
|
||||
# Important Communication Guidelines
|
||||
|
||||
|
||||
Reference in New Issue
Block a user