Christoffer Martinsson 64ceed6236 Add Gitea Actions workflow for automated binary releases
- Build cm-dashboard and cm-dashboard-agent binaries on tag push
- Upload binaries as release assets via Gitea API
- Use curl-based approach instead of external actions
- Support manual workflow dispatch for testing
2025-10-25 16:04:31 +02:00
2025-10-21 20:47:30 +02:00

CM Dashboard

A real-time infrastructure monitoring system with intelligent status aggregation and email notifications, built with Rust and ZMQ.

Current Implementation

This is a complete rewrite implementing an individual metrics architecture where:

  • Agent collects individual metrics (e.g., cpu_load_1min, memory_usage_percent) and calculates status
  • Dashboard subscribes to specific metrics and composes widgets
  • Status Aggregation provides intelligent email notifications with batching
  • Persistent Cache prevents false notifications on restart

Dashboard Interface

cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
┌system──────────────────────────────┐┌services─────────────────────────────────────────┐
│CPU:                                ││Service:                  Status:  RAM:   Disk:  │
│● Load: 0.10 0.52 0.88 • 400.0 MHz  ││● docker                  active   27M    496MB  │
│RAM:                                ││● docker-registry         active   19M    496MB  │
│● Used: 30% 2.3GB/7.6GB             ││● gitea                   active   579M   2.6GB  │
│● tmp: 0.0% 0B/2.0GB                ││● gitea-runner-default    active   11M    2.6GB  │
│Disk nvme0n1:                       ││● haasp-core              active   9M     1MB    │
│● Health: PASSED                    ││● haasp-mqtt              active   3M     1MB    │
│● Usage @root: 8.3% • 75.4/906.2 GB ││● haasp-webgrid           active   10M    1MB    │
│● Usage @boot: 5.9% • 0.1/1.0 GB    ││● immich-server           active   240M   45.1GB │
│                                    ││● mosquitto               active   1M     1MB    │
│                                    ││● mysql                   active   38M    225MB  │
│                                    ││● nginx                   active   28M    24MB   │
│                                    ││  ├─ ● gitea.cmtec.se     51ms                   │
│                                    ││  ├─ ● haasp.cmtec.se     43ms                   │
│                                    ││  ├─ ● haasp.net          43ms                   │
│                                    ││  ├─ ● pages.cmtec.se     45ms                   │
└────────────────────────────────────┘│  ├─ ● photos.cmtec.se    41ms                   │
┌backup──────────────────────────────┐│  ├─ ● unifi.cmtec.se     46ms                   │
│Latest backup:                      ││  ├─ ● vault.cmtec.se     47ms                   │
│● Status: OK                        ││  ├─ ● www.kryddorten.se  81ms                   │
│Duration: 54s • Last: 4h ago        ││  ├─ ● www.mariehall2.se  86ms                   │
│Disk usage: 48.2GB/915.8GB          ││● postgresql              active   112M   357MB  │
│P/N: Samsung SSD 870 QVO 1TB        ││● redis-immich            active   8M     45.1GB │
│S/N: S5RRNF0W800639Y                ││● sshd                    active   2M     0      │
│● gitea 2 archives 2.7GB            ││● unifi                   active   594M   495MB  │
│● immich 2 archives 45.0GB          ││● vaultwarden             active   12M    1MB    │
│● kryddorten 2 archives 67.6MB      ││                                                 │
│● mariehall2 2 archives 321.8MB     ││                                                 │
│● nixosbox 2 archives 4.5MB         ││                                                 │
│● unifi 2 archives 2.9MB            ││                                                 │
│● vaultwarden 2 archives 305kB      ││                                                 │
└────────────────────────────────────┘└─────────────────────────────────────────────────┘

Navigation: ←→ switch hosts, r refresh, q quit

Features

  • Real-time monitoring - Dashboard updates every 1-2 seconds
  • Individual metric collection - Granular data for flexible dashboard composition
  • Intelligent status aggregation - Host-level status calculated from all services
  • Smart email notifications - Batched, detailed alerts with service groupings
  • Persistent state - Prevents false notifications on restarts
  • ZMQ communication - Efficient agent-to-dashboard messaging
  • Clean TUI - Terminal-based dashboard with color-coded status indicators

Architecture

Core Components

  • Agent (cm-dashboard-agent) - Collects metrics and sends via ZMQ
  • Dashboard (cm-dashboard) - Real-time TUI display consuming metrics
  • Shared (cm-dashboard-shared) - Common types and protocol
  • Status Aggregation - Intelligent batching and notification management
  • Persistent Cache - Maintains state across restarts

Status Levels

  • 🟢 Ok - Service running normally
  • 🔵 Pending - Service starting/stopping/reloading
  • 🟡 Warning - Service issues (high load, memory, disk usage)
  • 🔴 Critical - Service failed or critical thresholds exceeded
  • Unknown - Service state cannot be determined

Quick Start

Build

# With Nix (recommended)
nix-shell -p openssl pkg-config --run "cargo build --workspace"

# Or with system dependencies
sudo apt install libssl-dev pkg-config  # Ubuntu/Debian
cargo build --workspace

Run

# Start agent (requires configuration file)
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml

# Start dashboard
./target/debug/cm-dashboard --config /path/to/dashboard.toml

Configuration

Agent Configuration (agent.toml)

The agent requires a comprehensive TOML configuration file:

collection_interval_seconds = 2

[zmq]
publisher_port = 6130
command_port = 6131
bind_address = "0.0.0.0"
timeout_ms = 5000
heartbeat_interval_ms = 30000

[collectors.cpu]
enabled = true
interval_seconds = 2
load_warning_threshold = 9.0
load_critical_threshold = 10.0
temperature_warning_threshold = 100.0
temperature_critical_threshold = 110.0

[collectors.memory]
enabled = true
interval_seconds = 2
usage_warning_percent = 80.0
usage_critical_percent = 95.0

[collectors.disk]
enabled = true
interval_seconds = 300
usage_warning_percent = 80.0
usage_critical_percent = 90.0

[[collectors.disk.filesystems]]
name = "root"
uuid = "4cade5ce-85a5-4a03-83c8-dfd1d3888d79"
mount_point = "/"
fs_type = "ext4"
monitor = true

[collectors.systemd]
enabled = true
interval_seconds = 10
memory_warning_mb = 1000.0
memory_critical_mb = 2000.0
service_name_filters = [
  "nginx", "postgresql", "redis", "docker", "sshd"
]
excluded_services = [
  "nginx-config-reload", "sshd-keygen"
]

[notifications]
enabled = true
smtp_host = "localhost"
smtp_port = 25
from_email = "{hostname}@example.com"
to_email = "admin@example.com"
rate_limit_minutes = 0
trigger_on_warnings = true
trigger_on_failures = true
recovery_requires_all_ok = true
suppress_individual_recoveries = true

[status_aggregation]
enabled = true
aggregation_method = "worst_case"
notification_interval_seconds = 30

[cache]
persist_path = "/var/lib/cm-dashboard/cache.json"

Dashboard Configuration (dashboard.toml)

[zmq]
hosts = [
  { name = "server1", address = "192.168.1.100", port = 6130 },
  { name = "server2", address = "192.168.1.101", port = 6130 }
]
connection_timeout_ms = 5000
reconnect_interval_ms = 10000

[ui]
refresh_interval_ms = 1000
theme = "dark"

Collectors

The agent implements several specialized collectors:

CPU Collector (cpu.rs)

  • Load average (1, 5, 15 minute)
  • CPU temperature monitoring
  • Real-time process monitoring (top CPU consumers)
  • Status calculation with configurable thresholds

Memory Collector (memory.rs)

  • RAM usage (total, used, available)
  • Swap monitoring
  • Real-time process monitoring (top RAM consumers)
  • Memory pressure detection

Disk Collector (disk.rs)

  • Filesystem usage per mount point
  • SMART health monitoring
  • Temperature and wear tracking
  • Configurable filesystem monitoring

Systemd Collector (systemd.rs)

  • Service status monitoring (active, inactive, failed)
  • Memory usage per service
  • Service filtering and exclusions
  • Handles transitional states (Status::Pending)

Backup Collector (backup.rs)

  • Reads TOML status files from backup systems
  • Archive age verification
  • Disk usage tracking
  • Repository health monitoring

Email Notifications

Intelligent Batching

The system implements smart notification batching to prevent email spam:

  • Real-time dashboard updates - Status changes appear immediately
  • Batched email notifications - Aggregated every 30 seconds
  • Detailed groupings - Services organized by severity

Example Alert Email

Subject: Status Alert: 2 critical, 1 warning, 15 started

Status Summary (30s duration)
Host Status: Ok → Warning

🔴 CRITICAL ISSUES (2):
  postgresql: Ok → Critical
  nginx: Warning → Critical

🟡 WARNINGS (1):
  redis: Ok → Warning (memory usage 85%)

✅ RECOVERIES (0):

🟢 SERVICE STARTUPS (15):
  docker: Unknown → Ok
  sshd: Unknown → Ok
  ...

--
CM Dashboard Agent
Generated at 2025-10-21 19:42:42 CET

Individual Metrics Architecture

The system follows a metrics-first architecture:

Agent Side

// Agent collects individual metrics
vec![
    Metric::new("cpu_load_1min".to_string(), MetricValue::Float(2.5), Status::Ok),
    Metric::new("memory_usage_percent".to_string(), MetricValue::Float(78.5), Status::Warning),
    Metric::new("service_nginx_status".to_string(), MetricValue::String("active".to_string()), Status::Ok),
]

Dashboard Side

// Widgets subscribe to specific metrics
impl Widget for CpuWidget {
    fn update_from_metrics(&mut self, metrics: &[&Metric]) {
        for metric in metrics {
            match metric.name.as_str() {
                "cpu_load_1min" => self.load_1min = metric.value.as_f32(),
                "cpu_load_5min" => self.load_5min = metric.value.as_f32(),
                "cpu_temperature_celsius" => self.temperature = metric.value.as_f32(),
                _ => {}
            }
        }
    }
}

Persistent Cache

The cache system prevents false notifications:

  • Automatic saving - Saves when service status changes
  • Persistent storage - Maintains state across agent restarts
  • Simple design - No complex TTL or cleanup logic
  • Status preservation - Prevents duplicate notifications

Development

Project Structure

cm-dashboard/
├── agent/                  # Metrics collection agent
│   ├── src/
│   │   ├── collectors/     # CPU, memory, disk, systemd, backup
│   │   ├── status/         # Status aggregation and notifications
│   │   ├── cache/          # Persistent metric caching
│   │   ├── config/         # TOML configuration loading
│   │   └── notifications/  # Email notification system
├── dashboard/              # TUI dashboard application
│   ├── src/
│   │   ├── ui/widgets/     # CPU, memory, services, backup widgets
│   │   ├── metrics/        # Metric storage and filtering
│   │   └── communication/  # ZMQ metric consumption
├── shared/                 # Shared types and utilities
│   └── src/
│       ├── metrics.rs      # Metric, Status, and Value types
│       ├── protocol.rs     # ZMQ message format
│       └── cache.rs        # Cache configuration
└── README.md              # This file

Building

# Debug build
cargo build --workspace

# Release build
cargo build --workspace --release

# Run tests
cargo test --workspace

# Check code formatting
cargo fmt --all -- --check

# Run clippy linter
cargo clippy --workspace -- -D warnings

Dependencies

  • tokio - Async runtime
  • zmq - Message passing between agent and dashboard
  • ratatui - Terminal user interface
  • serde - Serialization for metrics and config
  • anyhow/thiserror - Error handling
  • tracing - Structured logging
  • lettre - SMTP email notifications
  • clap - Command-line argument parsing
  • toml - Configuration file parsing

NixOS Integration

This project is designed for declarative deployment via NixOS:

Configuration Generation

The NixOS module automatically generates the agent configuration:

# hosts/common/cm-dashboard.nix
services.cm-dashboard-agent = {
  enable = true;
  port = 6130;
};

Deployment

# Update NixOS configuration
git add hosts/common/cm-dashboard.nix
git commit -m "Update cm-dashboard configuration"
git push

# Rebuild system (user-performed)
sudo nixos-rebuild switch --flake .

Monitoring Intervals

  • CPU/Memory: 2 seconds (real-time monitoring)
  • Disk usage: 300 seconds (5 minutes)
  • Systemd services: 10 seconds
  • SMART health: 600 seconds (10 minutes)
  • Backup status: 60 seconds (1 minute)
  • Email notifications: 30 seconds (batched)
  • Dashboard updates: 1 second (real-time display)

License

MIT License - see LICENSE file for details

Description
Linux TUI dashboard for host health overview
Readme 13 MiB
2025-12-09 10:47:18 +01:00
Languages
Rust 100%