Christoffer Martinsson 9e344fb66d Testing

2025-10-12 22:31:46 +02:00

12 KiB

Raw Blame History

CM Dashboard - Infrastructure Monitoring TUI

Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and API integrations.

Project Goals

Core Objectives

Real-time monitoring of all infrastructure components
Multi-host support for cmbox, labbox, simonbox, steambox, srv01
Performance-focused with minimal resource usage
Keyboard-driven interface for power users
Integration with existing monitoring APIs (ports 6127, 6128, 6129)

Key Features

NVMe health monitoring with wear prediction
CPU / memory / GPU telemetry with automatic thresholding
Service resource monitoring with per-service CPU and RAM usage
Disk usage overview for root filesystems
Backup status with detailed metrics and history
Unified alert pipeline summarising host health
Historical data tracking and trend analysis

Technical Architecture

Technology Stack

Language: Rust 🦀
TUI Framework: ratatui (modern tui-rs fork)
Async Runtime: tokio
HTTP Client: reqwest
Serialization: serde
CLI: clap
Error Handling: anyhow
Time: chrono

Dependencies

[dependencies]
ratatui = "0.24"           # Modern TUI framework
crossterm = "0.27"         # Cross-platform terminal handling
tokio = { version = "1.0", features = ["full"] }  # Async runtime
reqwest = { version = "0.11", features = ["json"] }  # HTTP client
serde = { version = "1.0", features = ["derive"] }   # JSON parsing
clap = { version = "4.0", features = ["derive"] }    # CLI args
anyhow = "1.0"             # Error handling
chrono = "0.4"             # Time handling

Project Structure

cm-dashboard/
├── Cargo.toml
├── README.md
├── CLAUDE.md              # This file
├── src/
│   ├── main.rs            # Entry point & CLI
│   ├── app.rs             # Main application state
│   ├── ui/
│   │   ├── mod.rs
│   │   ├── dashboard.rs   # Main dashboard layout
│   │   ├── nvme.rs        # NVMe health widget
│   │   ├── services.rs    # Services status widget
│   │   ├── memory.rs      # RAM optimization widget
│   │   ├── backup.rs      # Backup status widget
│   │   └── alerts.rs      # Alerts/notifications widget
│   ├── api/
│   │   ├── mod.rs
│   │   ├── client.rs      # HTTP client wrapper
│   │   ├── smart.rs       # Smart metrics API (port 6127)
│   │   ├── service.rs     # Service metrics API (port 6128)
│   │   └── backup.rs      # Backup metrics API (port 6129)
│   ├── data/
│   │   ├── mod.rs
│   │   ├── metrics.rs     # Data structures
│   │   ├── history.rs     # Historical data storage
│   │   └── config.rs      # Host configuration
│   └── config.rs          # Application configuration
├── config/
│   ├── hosts.toml         # Host definitions
│   └── dashboard.toml     # Dashboard layout config
└── docs/
    ├── API.md             # API integration documentation
    └── WIDGETS.md         # Widget development guide

Data Structures

#[derive(Deserialize, Debug)]
pub struct SmartMetrics {
    pub status: String,
    pub drives: Vec<DriveInfo>,
    pub summary: DriveSummary,
    pub issues: Vec<String>,
    pub timestamp: u64,
}

#[derive(Deserialize, Debug)]
pub struct ServiceMetrics {
    pub summary: ServiceSummary,
    pub services: Vec<ServiceInfo>,
    pub timestamp: u64,
}

#[derive(Deserialize, Debug)]
pub struct ServiceSummary {
    pub healthy: usize,
    pub degraded: usize,
    pub failed: usize,
    pub memory_used_mb: f32,
    pub memory_quota_mb: f32,
    pub system_memory_used_mb: f32,
    pub system_memory_total_mb: f32,
    pub disk_used_gb: f32,
    pub disk_total_gb: f32,
    pub cpu_load_1: f32,
    pub cpu_load_5: f32,
    pub cpu_load_15: f32,
    pub cpu_freq_mhz: Option<f32>,
    pub cpu_temp_c: Option<f32>,
    pub gpu_load_percent: Option<f32>,
    pub gpu_temp_c: Option<f32>,
}

#[derive(Deserialize, Debug)]
pub struct BackupMetrics {
    pub overall_status: String,
    pub backup: BackupInfo,
    pub service: BackupServiceInfo,
    pub timestamp: u64,
}

Dashboard Layout Design

Main Dashboard View

┌─────────────────────────────────────────────────────────────────────┐
│ CM Dashboard • cmbox                                                 │
├─────────────────────────────────────────────────────────────────────┤
│ Storage • ok:1 warn:0 crit:0       │ Services • ok:1 warn:0 fail:0   │
│ ┌─────────────────────────────────┐ │ ┌─────────────────────────────── │ │
│ │Drive    Temp  Wear Spare Hours │ │ │Service memory: 7.1/23899.7 MiB│ │
│ │nvme0n1  28°C  1%   100%  14489 │ │ │Disk usage: —                  │ │
│ │         Capacity Usage          │ │ │  Service  Memory     Disk      │ │
│ │         954G     77G (8%)       │ │ │✔ sshd     7.1 MiB   —          │ │
│ └─────────────────────────────────┘ │ └─────────────────────────────── │ │
├─────────────────────────────────────────────────────────────────────┤
│ CPU / Memory • warn                 │ Backups                         │
│ System memory: 5251.7/23899.7 MiB  │ Host cmbox awaiting backup      │ │
│ CPU load (1/5/15): 2.18 2.66 2.56  │ metrics                         │ │
│ CPU freq: 1100.1 MHz               │                                 │ │
│ CPU temp: 47.0°C                    │                                 │ │
├─────────────────────────────────────────────────────────────────────┤
│ Alerts • ok:0 warn:3 fail:0        │ Status • ZMQ connected          │
│ cmbox: warning: CPU load 2.18      │ Monitoring • hosts: 3           │ │
│ srv01: pending: awaiting metrics    │ Data source: ZMQ – connected    │ │
│ labbox: pending: awaiting metrics   │ Active host: cmbox (1/3)        │ │
└─────────────────────────────────────────────────────────────────────┘
Keys: [←→] hosts [r]efresh [q]uit

Multi-Host View

┌─────────────────────────────────────────────────────────────────────┐
│ 🖥️  CMTEC Host Overview                                              │
├─────────────────────────────────────────────────────────────────────┤
│ Host      │ NVMe Wear │ RAM Usage │ Services │ Last Alert            │
├─────────────────────────────────────────────────────────────────────┤
│ srv01     │ 4%   ✅   │ 32%  ✅   │ 8/8  ✅  │ 04:00 Backup OK       │
│ cmbox     │ 12%  ✅   │ 45%  ✅   │ 3/3  ✅  │ Yesterday Email test  │
│ labbox    │ 8%   ✅   │ 28%  ✅   │ 2/2  ✅  │ 2h ago NVMe temp OK   │
│ simonbox  │ 15%  ✅   │ 67%  ⚠️   │ 4/4  ✅  │ Gaming session active │
│ steambox  │ 23%  ✅   │ 78%  ⚠️   │ 2/2  ✅  │ High RAM usage        │
└─────────────────────────────────────────────────────────────────────┘
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit

Architecture Principles - CRITICAL

Agent-Dashboard Separation of Concerns

AGENT IS SINGLE SOURCE OF TRUTH FOR ALL STATUS CALCULATIONS

Agent calculates status ("ok"/"warning"/"critical"/"unknown") using defined thresholds
Agent sends status to dashboard via ZMQ
Dashboard NEVER calculates status - only displays what agent provides

Data Flow Architecture:

Agent (calculations + thresholds) → Status → Dashboard (display only) → TableBuilder (colors)

Status Handling Rules:

Agent provides status → Dashboard uses agent status
Agent doesn't provide status → Dashboard shows "unknown" (NOT "ok")
Dashboard widgets NEVER contain hardcoded thresholds
TableBuilder converts status to colors for display

Current Agent Thresholds (as of 2025-10-12)

CPU Load (service.rs:392-400):

Warning: ≥ 2.0 (testing value, was 5.0)
Critical: ≥ 4.0 (testing value, was 8.0)

CPU Temperature (service.rs:412-420):

Warning: ≥ 70.0°C
Critical: ≥ 80.0°C

Memory Usage (service.rs:402-410):

Warning: ≥ 80%
Critical: ≥ 95%

Email Notifications

System Configuration:

From: {hostname}@cmtec.se (e.g., cmbox@cmtec.se)
To: cm@cmtec.se
SMTP: localhost:25 (postfix)
Timezone: Europe/Stockholm (not UTC)

Notification Triggers:

Status degradation: any → "warning" or "critical"
Recovery: "warning"/"critical" → "ok"
Rate limiting: configurable (set to 0 for testing, 30 minutes for production)

Monitored Components:

system.cpu (load status)
system.cpu_temp (temperature status)
system.memory (usage status)
system.services (service health status)
storage.smart (drive health)
backup.overall (backup status)

Pure Auto-Discovery Implementation

Agent Configuration:

No config files required
Auto-detects storage devices, services, backup systems
Runtime discovery of system capabilities
CLI: cm-dashboard-agent [-v] (only verbose flag)

Service Discovery:

Scans running systemd services
Filters by predefined interesting patterns (gitea, nginx, docker, etc.)
No host-specific hardcoded service lists

Current Implementation Status

Completed:

Pure auto-discovery agent (no config files)
Agent-side status calculations with defined thresholds
Dashboard displays agent status (no dashboard calculations)
Email notifications with Stockholm timezone
CPU temperature monitoring and notifications
ZMQ message format standardization
Removed all hardcoded dashboard thresholds

Testing Configuration (REVERT FOR PRODUCTION):

CPU thresholds lowered to 2.0/4.0 for easy testing
Email rate limiting disabled (0 minutes)

Development Guidelines

When Adding New Metrics:

Agent calculates status with thresholds
Agent adds {metric}_status field to JSON output
Dashboard data structure adds {metric}_status: Option<String>
Dashboard uses status_level_from_agent_status() for display
Agent adds notification monitoring for status changes

NEVER:

Add hardcoded thresholds to dashboard widgets
Calculate status in dashboard with different thresholds than agent
Use "ok" as default when agent status is missing (use "unknown")
Calculate colors in widgets (TableBuilder's responsibility)

Important Communication Guidelines

NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.

NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.

NEVER mention Claude or automation in commit messages. Keep commit messages focused on the technical changes only.

12 KiB Raw Blame History Unescape Escape