Go to file

Christoffer Martinsson 64ceed6236 Add Gitea Actions workflow for automated binary releases

- Build cm-dashboard and cm-dashboard-agent binaries on tag push
- Upload binaries as release assets via Gitea API
- Use curl-based approach instead of external actions
- Support manual workflow dispatch for testing

2025-10-25 16:04:31 +02:00

.gitea/workflows

Add Gitea Actions workflow for automated binary releases

2025-10-25 16:04:31 +02:00

agent

Add rebuild output logging for debugging

2025-10-25 15:23:20 +02:00

dashboard

Remove Config field and fix Build/Agent hash display

2025-10-25 14:57:40 +02:00

shared

Implement simple persistent cache with automatic saving on status changes

2025-10-21 20:12:19 +02:00

.gitignore

Implement real-time process monitoring and fix UI hardcoded data

2025-10-16 23:55:05 +02:00

AGENTS.md

Switch dashboard to ZMQ gossip data source

2025-10-11 13:36:46 +02:00

Cargo.lock

Update Cargo.lock with chrono-tz dependency for NixOS build

2025-10-20 14:36:17 +02:00

Cargo.toml

Implement real-time process monitoring and fix UI hardcoded data

2025-10-16 23:55:05 +02:00

CLAUDE.md

Fix config hash to use nix store hash and disable cache persistence

2025-10-25 12:57:47 +02:00

README.md

Updated readme

2025-10-21 20:47:30 +02:00

TODO.md

Change tree symbols to blue color across all panels

2025-10-23 20:16:10 +02:00

README.md

CM Dashboard

A real-time infrastructure monitoring system with intelligent status aggregation and email notifications, built with Rust and ZMQ.

Current Implementation

This is a complete rewrite implementing an individual metrics architecture where:

Agent collects individual metrics (e.g., cpu_load_1min, memory_usage_percent) and calculates status
Dashboard subscribes to specific metrics and composes widgets
Status Aggregation provides intelligent email notifications with batching
Persistent Cache prevents false notifications on restart

Dashboard Interface

cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
┌system──────────────────────────────┐┌services─────────────────────────────────────────┐
│CPU:                                ││Service:                  Status:  RAM:   Disk:  │
│● Load: 0.10 0.52 0.88 • 400.0 MHz  ││● docker                  active   27M    496MB  │
│RAM:                                ││● docker-registry         active   19M    496MB  │
│● Used: 30% 2.3GB/7.6GB             ││● gitea                   active   579M   2.6GB  │
│● tmp: 0.0% 0B/2.0GB                ││● gitea-runner-default    active   11M    2.6GB  │
│Disk nvme0n1:                       ││● haasp-core              active   9M     1MB    │
│● Health: PASSED                    ││● haasp-mqtt              active   3M     1MB    │
│● Usage @root: 8.3% • 75.4/906.2 GB ││● haasp-webgrid           active   10M    1MB    │
│● Usage @boot: 5.9% • 0.1/1.0 GB    ││● immich-server           active   240M   45.1GB │
│                                    ││● mosquitto               active   1M     1MB    │
│                                    ││● mysql                   active   38M    225MB  │
│                                    ││● nginx                   active   28M    24MB   │
│                                    ││  ├─ ● gitea.cmtec.se     51ms                   │
│                                    ││  ├─ ● haasp.cmtec.se     43ms                   │
│                                    ││  ├─ ● haasp.net          43ms                   │
│                                    ││  ├─ ● pages.cmtec.se     45ms                   │
└────────────────────────────────────┘│  ├─ ● photos.cmtec.se    41ms                   │
┌backup──────────────────────────────┐│  ├─ ● unifi.cmtec.se     46ms                   │
│Latest backup:                      ││  ├─ ● vault.cmtec.se     47ms                   │
│● Status: OK                        ││  ├─ ● www.kryddorten.se  81ms                   │
│Duration: 54s • Last: 4h ago        ││  ├─ ● www.mariehall2.se  86ms                   │
│Disk usage: 48.2GB/915.8GB          ││● postgresql              active   112M   357MB  │
│P/N: Samsung SSD 870 QVO 1TB        ││● redis-immich            active   8M     45.1GB │
│S/N: S5RRNF0W800639Y                ││● sshd                    active   2M     0      │
│● gitea 2 archives 2.7GB            ││● unifi                   active   594M   495MB  │
│● immich 2 archives 45.0GB          ││● vaultwarden             active   12M    1MB    │
│● kryddorten 2 archives 67.6MB      ││                                                 │
│● mariehall2 2 archives 321.8MB     ││                                                 │
│● nixosbox 2 archives 4.5MB         ││                                                 │
│● unifi 2 archives 2.9MB            ││                                                 │
│● vaultwarden 2 archives 305kB      ││                                                 │
└────────────────────────────────────┘└─────────────────────────────────────────────────┘

Navigation: ←→ switch hosts, r refresh, q quit

Features

Real-time monitoring - Dashboard updates every 1-2 seconds
Individual metric collection - Granular data for flexible dashboard composition
Intelligent status aggregation - Host-level status calculated from all services
Smart email notifications - Batched, detailed alerts with service groupings
Persistent state - Prevents false notifications on restarts
ZMQ communication - Efficient agent-to-dashboard messaging
Clean TUI - Terminal-based dashboard with color-coded status indicators

Architecture

Core Components

Agent (cm-dashboard-agent) - Collects metrics and sends via ZMQ
Dashboard (cm-dashboard) - Real-time TUI display consuming metrics
Shared (cm-dashboard-shared) - Common types and protocol
Status Aggregation - Intelligent batching and notification management
Persistent Cache - Maintains state across restarts

Status Levels

🟢 Ok - Service running normally
🔵 Pending - Service starting/stopping/reloading
🟡 Warning - Service issues (high load, memory, disk usage)
🔴 Critical - Service failed or critical thresholds exceeded
❓ Unknown - Service state cannot be determined

Quick Start

Build

# With Nix (recommended)
nix-shell -p openssl pkg-config --run "cargo build --workspace"

# Or with system dependencies
sudo apt install libssl-dev pkg-config  # Ubuntu/Debian
cargo build --workspace

Run

# Start agent (requires configuration file)
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml

# Start dashboard
./target/debug/cm-dashboard --config /path/to/dashboard.toml

Configuration

Agent Configuration (`agent.toml`)

The agent requires a comprehensive TOML configuration file:

collection_interval_seconds = 2

[zmq]
publisher_port = 6130
command_port = 6131
bind_address = "0.0.0.0"
timeout_ms = 5000
heartbeat_interval_ms = 30000

[collectors.cpu]
enabled = true
interval_seconds = 2
load_warning_threshold = 9.0
load_critical_threshold = 10.0
temperature_warning_threshold = 100.0
temperature_critical_threshold = 110.0

[collectors.memory]
enabled = true
interval_seconds = 2
usage_warning_percent = 80.0
usage_critical_percent = 95.0

[collectors.disk]
enabled = true
interval_seconds = 300
usage_warning_percent = 80.0
usage_critical_percent = 90.0

[[collectors.disk.filesystems]]
name = "root"
uuid = "4cade5ce-85a5-4a03-83c8-dfd1d3888d79"
mount_point = "/"
fs_type = "ext4"
monitor = true

[collectors.systemd]
enabled = true
interval_seconds = 10
memory_warning_mb = 1000.0
memory_critical_mb = 2000.0
service_name_filters = [
  "nginx", "postgresql", "redis", "docker", "sshd"
]
excluded_services = [
  "nginx-config-reload", "sshd-keygen"
]

[notifications]
enabled = true
smtp_host = "localhost"
smtp_port = 25
from_email = "{hostname}@example.com"
to_email = "admin@example.com"
rate_limit_minutes = 0
trigger_on_warnings = true
trigger_on_failures = true
recovery_requires_all_ok = true
suppress_individual_recoveries = true

[status_aggregation]
enabled = true
aggregation_method = "worst_case"
notification_interval_seconds = 30

[cache]
persist_path = "/var/lib/cm-dashboard/cache.json"

Dashboard Configuration (`dashboard.toml`)

[zmq]
hosts = [
  { name = "server1", address = "192.168.1.100", port = 6130 },
  { name = "server2", address = "192.168.1.101", port = 6130 }
]
connection_timeout_ms = 5000
reconnect_interval_ms = 10000

[ui]
refresh_interval_ms = 1000
theme = "dark"

Collectors

The agent implements several specialized collectors:

CPU Collector (`cpu.rs`)

Load average (1, 5, 15 minute)
CPU temperature monitoring
Real-time process monitoring (top CPU consumers)
Status calculation with configurable thresholds

Memory Collector (`memory.rs`)

RAM usage (total, used, available)
Swap monitoring
Real-time process monitoring (top RAM consumers)
Memory pressure detection

Disk Collector (`disk.rs`)

Filesystem usage per mount point
SMART health monitoring
Temperature and wear tracking
Configurable filesystem monitoring

Systemd Collector (`systemd.rs`)

Service status monitoring (active, inactive, failed)
Memory usage per service
Service filtering and exclusions
Handles transitional states (Status::Pending)

Backup Collector (`backup.rs`)

Reads TOML status files from backup systems
Archive age verification
Disk usage tracking
Repository health monitoring

Email Notifications

Intelligent Batching

The system implements smart notification batching to prevent email spam:

Real-time dashboard updates - Status changes appear immediately
Batched email notifications - Aggregated every 30 seconds
Detailed groupings - Services organized by severity

Example Alert Email

Subject: Status Alert: 2 critical, 1 warning, 15 started

Status Summary (30s duration)
Host Status: Ok → Warning

🔴 CRITICAL ISSUES (2):
  postgresql: Ok → Critical
  nginx: Warning → Critical

🟡 WARNINGS (1):
  redis: Ok → Warning (memory usage 85%)

✅ RECOVERIES (0):

🟢 SERVICE STARTUPS (15):
  docker: Unknown → Ok
  sshd: Unknown → Ok
  ...

--
CM Dashboard Agent
Generated at 2025-10-21 19:42:42 CET

Individual Metrics Architecture

The system follows a metrics-first architecture:

Agent Side

// Agent collects individual metrics
vec![
    Metric::new("cpu_load_1min".to_string(), MetricValue::Float(2.5), Status::Ok),
    Metric::new("memory_usage_percent".to_string(), MetricValue::Float(78.5), Status::Warning),
    Metric::new("service_nginx_status".to_string(), MetricValue::String("active".to_string()), Status::Ok),
]

Dashboard Side

// Widgets subscribe to specific metrics
impl Widget for CpuWidget {
    fn update_from_metrics(&mut self, metrics: &[&Metric]) {
        for metric in metrics {
            match metric.name.as_str() {
                "cpu_load_1min" => self.load_1min = metric.value.as_f32(),
                "cpu_load_5min" => self.load_5min = metric.value.as_f32(),
                "cpu_temperature_celsius" => self.temperature = metric.value.as_f32(),
                _ => {}
            }
        }
    }
}

Persistent Cache

The cache system prevents false notifications:

Automatic saving - Saves when service status changes
Persistent storage - Maintains state across agent restarts
Simple design - No complex TTL or cleanup logic
Status preservation - Prevents duplicate notifications

Development

Project Structure

cm-dashboard/
├── agent/                  # Metrics collection agent
│   ├── src/
│   │   ├── collectors/     # CPU, memory, disk, systemd, backup
│   │   ├── status/         # Status aggregation and notifications
│   │   ├── cache/          # Persistent metric caching
│   │   ├── config/         # TOML configuration loading
│   │   └── notifications/  # Email notification system
├── dashboard/              # TUI dashboard application
│   ├── src/
│   │   ├── ui/widgets/     # CPU, memory, services, backup widgets
│   │   ├── metrics/        # Metric storage and filtering
│   │   └── communication/  # ZMQ metric consumption
├── shared/                 # Shared types and utilities
│   └── src/
│       ├── metrics.rs      # Metric, Status, and Value types
│       ├── protocol.rs     # ZMQ message format
│       └── cache.rs        # Cache configuration
└── README.md              # This file

Building

# Debug build
cargo build --workspace

# Release build
cargo build --workspace --release

# Run tests
cargo test --workspace

# Check code formatting
cargo fmt --all -- --check

# Run clippy linter
cargo clippy --workspace -- -D warnings

Dependencies

tokio - Async runtime
zmq - Message passing between agent and dashboard
ratatui - Terminal user interface
serde - Serialization for metrics and config
anyhow/thiserror - Error handling
tracing - Structured logging
lettre - SMTP email notifications
clap - Command-line argument parsing
toml - Configuration file parsing

NixOS Integration

This project is designed for declarative deployment via NixOS:

Configuration Generation

The NixOS module automatically generates the agent configuration:

# hosts/common/cm-dashboard.nix
services.cm-dashboard-agent = {
  enable = true;
  port = 6130;
};

Deployment

# Update NixOS configuration
git add hosts/common/cm-dashboard.nix
git commit -m "Update cm-dashboard configuration"
git push

# Rebuild system (user-performed)
sudo nixos-rebuild switch --flake .

Monitoring Intervals

CPU/Memory: 2 seconds (real-time monitoring)
Disk usage: 300 seconds (5 minutes)
Systemd services: 10 seconds
SMART health: 600 seconds (10 minutes)
Backup status: 60 seconds (1 minute)
Email notifications: 30 seconds (batched)
Dashboard updates: 1 second (real-time display)

License

MIT License - see LICENSE file for details

README.md

CM Dashboard

Current Implementation

Dashboard Interface

Features

Architecture

Core Components

Status Levels

Quick Start

Build

Run

Configuration

Agent Configuration (agent.toml)

Dashboard Configuration (dashboard.toml)

Collectors

CPU Collector (cpu.rs)

Memory Collector (memory.rs)

Disk Collector (disk.rs)

Systemd Collector (systemd.rs)

Backup Collector (backup.rs)

Email Notifications

Intelligent Batching

Example Alert Email

Individual Metrics Architecture

Agent Side

Dashboard Side

Persistent Cache

Development

Project Structure

Building

Dependencies

NixOS Integration

Configuration Generation

Deployment

Monitoring Intervals

License

Agent Configuration (`agent.toml`)

Dashboard Configuration (`dashboard.toml`)

CPU Collector (`cpu.rs`)

Memory Collector (`memory.rs`)

Disk Collector (`disk.rs`)

Systemd Collector (`systemd.rs`)

Backup Collector (`backup.rs`)