cm-dashboard/README.md

14 KiB

CM Dashboard

A real-time infrastructure monitoring system with intelligent status aggregation and email notifications, built with Rust and ZMQ.

Current Implementation

This is a complete rewrite implementing an individual metrics architecture where:

  • Agent collects individual metrics (e.g., cpu_load_1min, memory_usage_percent) and calculates status
  • Dashboard subscribes to specific metrics and composes widgets
  • Status Aggregation provides intelligent email notifications with batching
  • Persistent Cache prevents false notifications on restart

Dashboard Interface

cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
┌system───────────────────────────────────────────┐┌services────────────────────────────────────────────────────┐
│CPU:                                             ││Service:                  Status:    RAM:     Disk:         │
│● Load: 0.10 0.52 0.88 • 400.0 MHz               ││● docker                  active     27M      496MB         │
│RAM:                                             ││● docker-registry         active     19M      496MB         │
│● Used: 30% 2.3GB/7.6GB                          ││● gitea                   active     579M     2.6GB         │
│● tmp: 0.0% 0B/2.0GB                             ││● gitea-runner-default    active     11M      2.6GB         │
│Disk nvme0n1:                                    ││● haasp-core              active     9M       1MB           │
│● Health: PASSED                                 ││● haasp-mqtt              active     3M       1MB           │
│● Usage @root: 8.3% • 75.4/906.2 GB              ││● haasp-webgrid           active     10M      1MB           │
│● Usage @boot: 5.9% • 0.1/1.0 GB                 ││● immich-server           active     240M     45.1GB        │
│                                                 ││● mosquitto               active     1M       1MB           │
│                                                 ││● mysql                   active     38M      225MB         │
│                                                 ││● nginx                   active     28M      24MB          │
│                                                 ││  ├─ ● gitea.cmtec.se     51ms                              │
│                                                 ││  ├─ ● haasp.cmtec.se     43ms                              │
│                                                 ││  ├─ ● haasp.net          43ms                              │
│                                                 ││  ├─ ● pages.cmtec.se     45ms                              │
└─────────────────────────────────────────────────┘│  ├─ ● photos.cmtec.se    41ms                              │
┌backup───────────────────────────────────────────┐│  ├─ ● unifi.cmtec.se     46ms                              │
│Latest backup:                                   ││  ├─ ● vault.cmtec.se     47ms                              │
│● Status: OK                                     ││  ├─ ● www.kryddorten.se  81ms                              │
│Duration: 54s • Last: 4h ago                     ││  ├─ ● www.mariehall2.se  86ms                              │
│Disk usage: 48.2GB/915.8GB                       ││● postgresql              active     112M     357MB         │
│P/N: Samsung SSD 870 QVO 1TB                     ││● redis-immich            active     8M       45.1GB        │
│S/N: S5RRNF0W800639Y                             ││● sshd                    active     2M       0             │
│● gitea 2 archives 2.7GB                         ││● unifi                   active     594M     495MB         │
│● immich 2 archives 45.0GB                       ││● vaultwarden             active     12M      1MB           │
│● kryddorten 2 archives 67.6MB                   ││                                                            │
│● mariehall2 2 archives 321.8MB                  ││                                                            │
│● nixosbox 2 archives 4.5MB                      ││                                                            │
│● unifi 2 archives 2.9MB                         ││                                                            │
│● vaultwarden 2 archives 305kB                   ││                                                            │
└─────────────────────────────────────────────────┘└────────────────────────────────────────────────────────────┘

Navigation: ←→ switch hosts, r refresh, q quit

Features

  • Real-time monitoring - Dashboard updates every 1-2 seconds
  • Individual metric collection - Granular data for flexible dashboard composition
  • Intelligent status aggregation - Host-level status calculated from all services
  • Smart email notifications - Batched, detailed alerts with service groupings
  • Persistent state - Prevents false notifications on restarts
  • ZMQ communication - Efficient agent-to-dashboard messaging
  • Clean TUI - Terminal-based dashboard with color-coded status indicators

Architecture

Core Components

  • Agent (cm-dashboard-agent) - Collects metrics and sends via ZMQ
  • Dashboard (cm-dashboard) - Real-time TUI display consuming metrics
  • Shared (cm-dashboard-shared) - Common types and protocol
  • Status Aggregation - Intelligent batching and notification management
  • Persistent Cache - Maintains state across restarts

Status Levels

  • 🟢 Ok - Service running normally
  • 🔵 Pending - Service starting/stopping/reloading
  • 🟡 Warning - Service issues (high load, memory, disk usage)
  • 🔴 Critical - Service failed or critical thresholds exceeded
  • Unknown - Service state cannot be determined

Quick Start

Build

# With Nix (recommended)
nix-shell -p openssl pkg-config --run "cargo build --workspace"

# Or with system dependencies
sudo apt install libssl-dev pkg-config  # Ubuntu/Debian
cargo build --workspace

Run

# Start agent (requires configuration file)
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml

# Start dashboard 
./target/debug/cm-dashboard --config /path/to/dashboard.toml

Configuration

Agent Configuration (agent.toml)

The agent requires a comprehensive TOML configuration file:

collection_interval_seconds = 2

[zmq]
publisher_port = 6130
command_port = 6131
bind_address = "0.0.0.0"
timeout_ms = 5000
heartbeat_interval_ms = 30000

[collectors.cpu]
enabled = true
interval_seconds = 2
load_warning_threshold = 9.0
load_critical_threshold = 10.0
temperature_warning_threshold = 100.0
temperature_critical_threshold = 110.0

[collectors.memory]
enabled = true
interval_seconds = 2
usage_warning_percent = 80.0
usage_critical_percent = 95.0

[collectors.disk]
enabled = true
interval_seconds = 300
usage_warning_percent = 80.0
usage_critical_percent = 90.0

[[collectors.disk.filesystems]]
name = "root"
uuid = "4cade5ce-85a5-4a03-83c8-dfd1d3888d79"
mount_point = "/"
fs_type = "ext4"
monitor = true

[collectors.systemd]
enabled = true
interval_seconds = 10
memory_warning_mb = 1000.0
memory_critical_mb = 2000.0
service_name_filters = [
  "nginx", "postgresql", "redis", "docker", "sshd"
]
excluded_services = [
  "nginx-config-reload", "sshd-keygen"
]

[notifications]
enabled = true
smtp_host = "localhost"
smtp_port = 25
from_email = "{hostname}@example.com"
to_email = "admin@example.com"
rate_limit_minutes = 0
trigger_on_warnings = true
trigger_on_failures = true
recovery_requires_all_ok = true
suppress_individual_recoveries = true

[status_aggregation]
enabled = true
aggregation_method = "worst_case"
notification_interval_seconds = 30

[cache]
persist_path = "/var/lib/cm-dashboard/cache.json"

Dashboard Configuration (dashboard.toml)

[zmq]
hosts = [
  { name = "server1", address = "192.168.1.100", port = 6130 },
  { name = "server2", address = "192.168.1.101", port = 6130 }
]
connection_timeout_ms = 5000
reconnect_interval_ms = 10000

[ui]
refresh_interval_ms = 1000
theme = "dark"

Collectors

The agent implements several specialized collectors:

CPU Collector (cpu.rs)

  • Load average (1, 5, 15 minute)
  • CPU temperature monitoring
  • Real-time process monitoring (top CPU consumers)
  • Status calculation with configurable thresholds

Memory Collector (memory.rs)

  • RAM usage (total, used, available)
  • Swap monitoring
  • Real-time process monitoring (top RAM consumers)
  • Memory pressure detection

Disk Collector (disk.rs)

  • Filesystem usage per mount point
  • SMART health monitoring
  • Temperature and wear tracking
  • Configurable filesystem monitoring

Systemd Collector (systemd.rs)

  • Service status monitoring (active, inactive, failed)
  • Memory usage per service
  • Service filtering and exclusions
  • Handles transitional states (Status::Pending)

Backup Collector (backup.rs)

  • Reads TOML status files from backup systems
  • Archive age verification
  • Disk usage tracking
  • Repository health monitoring

Email Notifications

Intelligent Batching

The system implements smart notification batching to prevent email spam:

  • Real-time dashboard updates - Status changes appear immediately
  • Batched email notifications - Aggregated every 30 seconds
  • Detailed groupings - Services organized by severity

Example Alert Email

Subject: Status Alert: 2 critical, 1 warning, 15 started

Status Summary (30s duration)
Host Status: Ok → Warning

🔴 CRITICAL ISSUES (2):
  postgresql: Ok → Critical
  nginx: Warning → Critical

🟡 WARNINGS (1):
  redis: Ok → Warning (memory usage 85%)

✅ RECOVERIES (0):

🟢 SERVICE STARTUPS (15):
  docker: Unknown → Ok
  sshd: Unknown → Ok
  ...

--
CM Dashboard Agent
Generated at 2025-10-21 19:42:42 CET

Individual Metrics Architecture

The system follows a metrics-first architecture:

Agent Side

// Agent collects individual metrics
vec![
    Metric::new("cpu_load_1min".to_string(), MetricValue::Float(2.5), Status::Ok),
    Metric::new("memory_usage_percent".to_string(), MetricValue::Float(78.5), Status::Warning),
    Metric::new("service_nginx_status".to_string(), MetricValue::String("active".to_string()), Status::Ok),
]

Dashboard Side

// Widgets subscribe to specific metrics
impl Widget for CpuWidget {
    fn update_from_metrics(&mut self, metrics: &[&Metric]) {
        for metric in metrics {
            match metric.name.as_str() {
                "cpu_load_1min" => self.load_1min = metric.value.as_f32(),
                "cpu_load_5min" => self.load_5min = metric.value.as_f32(),
                "cpu_temperature_celsius" => self.temperature = metric.value.as_f32(),
                _ => {}
            }
        }
    }
}

Persistent Cache

The cache system prevents false notifications:

  • Automatic saving - Saves when service status changes
  • Persistent storage - Maintains state across agent restarts
  • Simple design - No complex TTL or cleanup logic
  • Status preservation - Prevents duplicate notifications

Development

Project Structure

cm-dashboard/
├── agent/                  # Metrics collection agent
│   ├── src/
│   │   ├── collectors/     # CPU, memory, disk, systemd, backup
│   │   ├── status/         # Status aggregation and notifications
│   │   ├── cache/          # Persistent metric caching
│   │   ├── config/         # TOML configuration loading
│   │   └── notifications/  # Email notification system
├── dashboard/              # TUI dashboard application
│   ├── src/
│   │   ├── ui/widgets/     # CPU, memory, services, backup widgets
│   │   ├── metrics/        # Metric storage and filtering
│   │   └── communication/  # ZMQ metric consumption
├── shared/                 # Shared types and utilities
│   └── src/
│       ├── metrics.rs      # Metric, Status, and Value types
│       ├── protocol.rs     # ZMQ message format
│       └── cache.rs        # Cache configuration
└── README.md              # This file

Building

# Debug build
cargo build --workspace

# Release build  
cargo build --workspace --release

# Run tests
cargo test --workspace

# Check code formatting
cargo fmt --all -- --check

# Run clippy linter
cargo clippy --workspace -- -D warnings

Dependencies

  • tokio - Async runtime
  • zmq - Message passing between agent and dashboard
  • ratatui - Terminal user interface
  • serde - Serialization for metrics and config
  • anyhow/thiserror - Error handling
  • tracing - Structured logging
  • lettre - SMTP email notifications
  • clap - Command-line argument parsing
  • toml - Configuration file parsing

NixOS Integration

This project is designed for declarative deployment via NixOS:

Configuration Generation

The NixOS module automatically generates the agent configuration:

# hosts/common/cm-dashboard.nix
services.cm-dashboard-agent = {
  enable = true;
  port = 6130;
};

Deployment

# Update NixOS configuration
git add hosts/common/cm-dashboard.nix
git commit -m "Update cm-dashboard configuration"
git push

# Rebuild system (user-performed)
sudo nixos-rebuild switch --flake .

Monitoring Intervals

  • CPU/Memory: 2 seconds (real-time monitoring)
  • Disk usage: 300 seconds (5 minutes)
  • Systemd services: 10 seconds
  • SMART health: 600 seconds (10 minutes)
  • Backup status: 60 seconds (1 minute)
  • Email notifications: 30 seconds (batched)
  • Dashboard updates: 1 second (real-time display)

License

MIT License - see LICENSE file for details