Christoffer Martinsson c3fc5a181d
All checks were successful
Build and Release / build-and-release (push) Successful in 1m12s
Fix service name mismatch in pending transitions lookup
The root cause of transitional service icons not showing was that service names
were stored as raw names (e.g., "sshd") in pending_transitions but looked up
against formatted display lines (e.g., "sshd                    active     1M     ").

Changes:
- Modified display_lines structure to include both formatted text and raw service names
- Updated rendering loop to use raw service names for pending transition lookups
- Fixed get_selected_service() method to use the new tuple structure
- Transitional icons (↑ ↓ ↻) should now appear correctly when pressing s/S/r keys
2025-10-28 15:00:48 +01:00

CM Dashboard

A real-time infrastructure monitoring system with intelligent status aggregation and email notifications, built with Rust and ZMQ.

Current Implementation

This is a complete rewrite implementing an individual metrics architecture where:

  • Agent collects individual metrics (e.g., cpu_load_1min, memory_usage_percent) and calculates status
  • Dashboard subscribes to specific metrics and composes widgets
  • Status Aggregation provides intelligent email notifications with batching
  • Persistent Cache prevents false notifications on restart

Dashboard Interface

cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
┌system──────────────────────────────┐┌services─────────────────────────────────────────┐
│CPU:                                ││Service:                  Status:  RAM:   Disk:  │
│● Load: 0.10 0.52 0.88 • 400.0 MHz  ││● docker                  active   27M    496MB  │
│RAM:                                ││● docker-registry         active   19M    496MB  │
│● Used: 30% 2.3GB/7.6GB             ││● gitea                   active   579M   2.6GB  │
│● tmp: 0.0% 0B/2.0GB                ││● gitea-runner-default    active   11M    2.6GB  │
│Disk nvme0n1:                       ││● haasp-core              active   9M     1MB    │
│● Health: PASSED                    ││● haasp-mqtt              active   3M     1MB    │
│● Usage @root: 8.3% • 75.4/906.2 GB ││● haasp-webgrid           active   10M    1MB    │
│● Usage @boot: 5.9% • 0.1/1.0 GB    ││● immich-server           active   240M   45.1GB │
│                                    ││● mosquitto               active   1M     1MB    │
│                                    ││● mysql                   active   38M    225MB  │
│                                    ││● nginx                   active   28M    24MB   │
│                                    ││  ├─ ● gitea.cmtec.se     51ms                   │
│                                    ││  ├─ ● haasp.cmtec.se     43ms                   │
│                                    ││  ├─ ● haasp.net          43ms                   │
│                                    ││  ├─ ● pages.cmtec.se     45ms                   │
└────────────────────────────────────┘│  ├─ ● photos.cmtec.se    41ms                   │
┌backup──────────────────────────────┐│  ├─ ● unifi.cmtec.se     46ms                   │
│Latest backup:                      ││  ├─ ● vault.cmtec.se     47ms                   │
│● Status: OK                        ││  ├─ ● www.kryddorten.se  81ms                   │
│Duration: 54s • Last: 4h ago        ││  ├─ ● www.mariehall2.se  86ms                   │
│Disk usage: 48.2GB/915.8GB          ││● postgresql              active   112M   357MB  │
│P/N: Samsung SSD 870 QVO 1TB        ││● redis-immich            active   8M     45.1GB │
│S/N: S5RRNF0W800639Y                ││● sshd                    active   2M     0      │
│● gitea 2 archives 2.7GB            ││● unifi                   active   594M   495MB  │
│● immich 2 archives 45.0GB          ││● vaultwarden             active   12M    1MB    │
│● kryddorten 2 archives 67.6MB      ││                                                 │
│● mariehall2 2 archives 321.8MB     ││                                                 │
│● nixosbox 2 archives 4.5MB         ││                                                 │
│● unifi 2 archives 2.9MB            ││                                                 │
│● vaultwarden 2 archives 305kB      ││                                                 │
└────────────────────────────────────┘└─────────────────────────────────────────────────┘

Navigation: ←→ switch hosts, r refresh, q quit

Features

  • Real-time monitoring - Dashboard updates every 1-2 seconds
  • Individual metric collection - Granular data for flexible dashboard composition
  • Intelligent status aggregation - Host-level status calculated from all services
  • Smart email notifications - Batched, detailed alerts with service groupings
  • Persistent state - Prevents false notifications on restarts
  • ZMQ communication - Efficient agent-to-dashboard messaging
  • Clean TUI - Terminal-based dashboard with color-coded status indicators

Architecture

Core Components

  • Agent (cm-dashboard-agent) - Collects metrics and sends via ZMQ
  • Dashboard (cm-dashboard) - Real-time TUI display consuming metrics
  • Shared (cm-dashboard-shared) - Common types and protocol
  • Status Aggregation - Intelligent batching and notification management
  • Persistent Cache - Maintains state across restarts

Status Levels

  • 🟢 Ok - Service running normally
  • 🔵 Pending - Service starting/stopping/reloading
  • 🟡 Warning - Service issues (high load, memory, disk usage)
  • 🔴 Critical - Service failed or critical thresholds exceeded
  • Unknown - Service state cannot be determined

Quick Start

Build

# With Nix (recommended)
nix-shell -p openssl pkg-config --run "cargo build --workspace"

# Or with system dependencies
sudo apt install libssl-dev pkg-config  # Ubuntu/Debian
cargo build --workspace

Run

# Start agent (requires configuration file)
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml

# Start dashboard
./target/debug/cm-dashboard --config /path/to/dashboard.toml

Configuration

Agent Configuration (agent.toml)

The agent requires a comprehensive TOML configuration file:

collection_interval_seconds = 2

[zmq]
publisher_port = 6130
command_port = 6131
bind_address = "0.0.0.0"
timeout_ms = 5000
heartbeat_interval_ms = 30000

[collectors.cpu]
enabled = true
interval_seconds = 2
load_warning_threshold = 9.0
load_critical_threshold = 10.0
temperature_warning_threshold = 100.0
temperature_critical_threshold = 110.0

[collectors.memory]
enabled = true
interval_seconds = 2
usage_warning_percent = 80.0
usage_critical_percent = 95.0

[collectors.disk]
enabled = true
interval_seconds = 300
usage_warning_percent = 80.0
usage_critical_percent = 90.0

[[collectors.disk.filesystems]]
name = "root"
uuid = "4cade5ce-85a5-4a03-83c8-dfd1d3888d79"
mount_point = "/"
fs_type = "ext4"
monitor = true

[collectors.systemd]
enabled = true
interval_seconds = 10
memory_warning_mb = 1000.0
memory_critical_mb = 2000.0
service_name_filters = [
  "nginx*", "postgresql*", "redis*", "docker*", "sshd*", 
  "gitea*", "immich*", "haasp*", "mosquitto*", "mysql*", 
  "unifi*", "vaultwarden*"
]
excluded_services = [
  "nginx-config-reload", "sshd-keygen", "systemd-", 
  "getty@", "user@", "dbus-", "NetworkManager-"
]

[notifications]
enabled = true
smtp_host = "localhost"
smtp_port = 25
from_email = "{hostname}@example.com"
to_email = "admin@example.com"
rate_limit_minutes = 0
trigger_on_warnings = true
trigger_on_failures = true
recovery_requires_all_ok = true
suppress_individual_recoveries = true

[status_aggregation]
enabled = true
aggregation_method = "worst_case"
notification_interval_seconds = 30

[cache]
persist_path = "/var/lib/cm-dashboard/cache.json"

Dashboard Configuration (dashboard.toml)

[zmq]
hosts = [
  { name = "server1", address = "192.168.1.100", port = 6130 },
  { name = "server2", address = "192.168.1.101", port = 6130 }
]
connection_timeout_ms = 5000
reconnect_interval_ms = 10000

[ui]
refresh_interval_ms = 1000
theme = "dark"

Collectors

The agent implements several specialized collectors:

CPU Collector (cpu.rs)

  • Load average (1, 5, 15 minute)
  • CPU temperature monitoring
  • Real-time process monitoring (top CPU consumers)
  • Status calculation with configurable thresholds

Memory Collector (memory.rs)

  • RAM usage (total, used, available)
  • Swap monitoring
  • Real-time process monitoring (top RAM consumers)
  • Memory pressure detection

Disk Collector (disk.rs)

  • Filesystem usage per mount point
  • SMART health monitoring
  • Temperature and wear tracking
  • Configurable filesystem monitoring

Systemd Collector (systemd.rs)

  • Service status monitoring (active, inactive, failed)
  • Memory usage per service
  • Service filtering and exclusions
  • Handles transitional states (Status::Pending)

Backup Collector (backup.rs)

  • Reads TOML status files from backup systems
  • Archive age verification
  • Disk usage tracking
  • Repository health monitoring

Email Notifications

Intelligent Batching

The system implements smart notification batching to prevent email spam:

  • Real-time dashboard updates - Status changes appear immediately
  • Batched email notifications - Aggregated every 30 seconds
  • Detailed groupings - Services organized by severity

Example Alert Email

Subject: Status Alert: 2 critical, 1 warning, 15 started

Status Summary (30s duration)
Host Status: Ok → Warning

🔴 CRITICAL ISSUES (2):
  postgresql: Ok → Critical
  nginx: Warning → Critical

🟡 WARNINGS (1):
  redis: Ok → Warning (memory usage 85%)

✅ RECOVERIES (0):

🟢 SERVICE STARTUPS (15):
  docker: Unknown → Ok
  sshd: Unknown → Ok
  ...

--
CM Dashboard Agent
Generated at 2025-10-21 19:42:42 CET

Individual Metrics Architecture

The system follows a metrics-first architecture:

Agent Side

// Agent collects individual metrics
vec![
    Metric::new("cpu_load_1min".to_string(), MetricValue::Float(2.5), Status::Ok),
    Metric::new("memory_usage_percent".to_string(), MetricValue::Float(78.5), Status::Warning),
    Metric::new("service_nginx_status".to_string(), MetricValue::String("active".to_string()), Status::Ok),
]

Dashboard Side

// Widgets subscribe to specific metrics
impl Widget for CpuWidget {
    fn update_from_metrics(&mut self, metrics: &[&Metric]) {
        for metric in metrics {
            match metric.name.as_str() {
                "cpu_load_1min" => self.load_1min = metric.value.as_f32(),
                "cpu_load_5min" => self.load_5min = metric.value.as_f32(),
                "cpu_temperature_celsius" => self.temperature = metric.value.as_f32(),
                _ => {}
            }
        }
    }
}

Persistent Cache

The cache system prevents false notifications:

  • Automatic saving - Saves when service status changes
  • Persistent storage - Maintains state across agent restarts
  • Simple design - No complex TTL or cleanup logic
  • Status preservation - Prevents duplicate notifications

Development

Project Structure

cm-dashboard/
├── agent/                  # Metrics collection agent
│   ├── src/
│   │   ├── collectors/     # CPU, memory, disk, systemd, backup
│   │   ├── status/         # Status aggregation and notifications
│   │   ├── cache/          # Persistent metric caching
│   │   ├── config/         # TOML configuration loading
│   │   └── notifications/  # Email notification system
├── dashboard/              # TUI dashboard application
│   ├── src/
│   │   ├── ui/widgets/     # CPU, memory, services, backup widgets
│   │   ├── metrics/        # Metric storage and filtering
│   │   └── communication/  # ZMQ metric consumption
├── shared/                 # Shared types and utilities
│   └── src/
│       ├── metrics.rs      # Metric, Status, and Value types
│       ├── protocol.rs     # ZMQ message format
│       └── cache.rs        # Cache configuration
└── README.md              # This file

Building

# Debug build
cargo build --workspace

# Release build
cargo build --workspace --release

# Run tests
cargo test --workspace

# Check code formatting
cargo fmt --all -- --check

# Run clippy linter
cargo clippy --workspace -- -D warnings

Dependencies

  • tokio - Async runtime
  • zmq - Message passing between agent and dashboard
  • ratatui - Terminal user interface
  • serde - Serialization for metrics and config
  • anyhow/thiserror - Error handling
  • tracing - Structured logging
  • lettre - SMTP email notifications
  • clap - Command-line argument parsing
  • toml - Configuration file parsing

NixOS Integration

This project is designed for declarative deployment via NixOS:

Configuration Generation

The NixOS module automatically generates the agent configuration:

# hosts/common/cm-dashboard.nix
services.cm-dashboard-agent = {
  enable = true;
  port = 6130;
};

Deployment

# Update NixOS configuration
git add hosts/common/cm-dashboard.nix
git commit -m "Update cm-dashboard configuration"
git push

# Rebuild system (user-performed)
sudo nixos-rebuild switch --flake .

Monitoring Intervals

  • CPU/Memory: 2 seconds (real-time monitoring)
  • Disk usage: 300 seconds (5 minutes)
  • Systemd services: 10 seconds
  • SMART health: 600 seconds (10 minutes)
  • Backup status: 60 seconds (1 minute)
  • Email notifications: 30 seconds (batched)
  • Dashboard updates: 1 second (real-time display)

License

MIT License - see LICENSE file for details

Description
Linux TUI dashboard for host health overview
Readme 13 MiB
2025-12-09 11:56:40 +01:00
Languages
Rust 100%