cm-dashboard/README.md

# CM Dashboard

A real-time infrastructure monitoring system with intelligent status aggregation and email notifications, built with Rust and ZMQ.

## Current Implementation

This is a complete rewrite implementing an **individual metrics architecture** where:

- **Agent** collects individual metrics (e.g., `cpu_load_1min`, `memory_usage_percent`) and calculates status
- **Dashboard** subscribes to specific metrics and composes widgets
- **Status Aggregation** provides intelligent email notifications with batching
- **Persistent Cache** prevents false notifications on restart

## Dashboard Interface

```
cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
┌system──────────────────────────────┐┌services─────────────────────────────────────────┐
│CPU:                                ││Service:                  Status:  RAM:   Disk:  │
│● Load: 0.10 0.52 0.88 • 400.0 MHz  ││● docker                  active   27M    496MB  │
│RAM:                                ││● docker-registry         active   19M    496MB  │
│● Used: 30% 2.3GB/7.6GB             ││● gitea                   active   579M   2.6GB  │
│● tmp: 0.0% 0B/2.0GB                ││● gitea-runner-default    active   11M    2.6GB  │
│Disk nvme0n1:                       ││● haasp-core              active   9M     1MB    │
│● Health: PASSED                    ││● haasp-mqtt              active   3M     1MB    │
│● Usage @root: 8.3% • 75.4/906.2 GB ││● haasp-webgrid           active   10M    1MB    │
│● Usage @boot: 5.9% • 0.1/1.0 GB    ││● immich-server           active   240M   45.1GB │
│                                    ││● mosquitto               active   1M     1MB    │
│                                    ││● mysql                   active   38M    225MB  │
│                                    ││● nginx                   active   28M    24MB   │
│                                    ││  ├─ ● gitea.cmtec.se     51ms                   │
│                                    ││  ├─ ● haasp.cmtec.se     43ms                   │
│                                    ││  ├─ ● haasp.net          43ms                   │
│                                    ││  ├─ ● pages.cmtec.se     45ms                   │
└────────────────────────────────────┘│  ├─ ● photos.cmtec.se    41ms                   │
┌backup──────────────────────────────┐│  ├─ ● unifi.cmtec.se     46ms                   │
│Latest backup:                      ││  ├─ ● vault.cmtec.se     47ms                   │
│● Status: OK                        ││  ├─ ● www.kryddorten.se  81ms                   │
│Duration: 54s • Last: 4h ago        ││  ├─ ● www.mariehall2.se  86ms                   │
│Disk usage: 48.2GB/915.8GB          ││● postgresql              active   112M   357MB  │
│P/N: Samsung SSD 870 QVO 1TB        ││● redis-immich            active   8M     45.1GB │
│S/N: S5RRNF0W800639Y                ││● sshd                    active   2M     0      │
│● gitea 2 archives 2.7GB            ││● unifi                   active   594M   495MB  │
│● immich 2 archives 45.0GB          ││● vaultwarden             active   12M    1MB    │
│● kryddorten 2 archives 67.6MB      ││                                                 │
│● mariehall2 2 archives 321.8MB     ││                                                 │
│● nixosbox 2 archives 4.5MB         ││                                                 │
│● unifi 2 archives 2.9MB            ││                                                 │
│● vaultwarden 2 archives 305kB      ││                                                 │
└────────────────────────────────────┘└─────────────────────────────────────────────────┘
```

**Navigation**: `←→` switch hosts, `r` refresh, `q` quit

## Features

- **Real-time monitoring** - Dashboard updates every 1-2 seconds
- **Individual metric collection** - Granular data for flexible dashboard composition
- **Intelligent status aggregation** - Host-level status calculated from all services
- **Smart email notifications** - Batched, detailed alerts with service groupings
- **Persistent state** - Prevents false notifications on restarts
- **ZMQ communication** - Efficient agent-to-dashboard messaging
- **Clean TUI** - Terminal-based dashboard with color-coded status indicators

## Architecture

### Core Components

- **Agent** (`cm-dashboard-agent`) - Collects metrics and sends via ZMQ
- **Dashboard** (`cm-dashboard`) - Real-time TUI display consuming metrics
- **Shared** (`cm-dashboard-shared`) - Common types and protocol
- **Status Aggregation** - Intelligent batching and notification management
- **Persistent Cache** - Maintains state across restarts

### Status Levels

- **🟢 Ok** - Service running normally
- **🔵 Pending** - Service starting/stopping/reloading
- **🟡 Warning** - Service issues (high load, memory, disk usage)
- **🔴 Critical** - Service failed or critical thresholds exceeded
- **❓ Unknown** - Service state cannot be determined

## Quick Start

### Build

```bash
# With Nix (recommended)
nix-shell -p openssl pkg-config --run "cargo build --workspace"

# Or with system dependencies
sudo apt install libssl-dev pkg-config  # Ubuntu/Debian
cargo build --workspace
```

### Run

```bash
# Start agent (requires configuration file)
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml

# Start dashboard
./target/debug/cm-dashboard --config /path/to/dashboard.toml
```

## Configuration

### Agent Configuration (`agent.toml`)

The agent requires a comprehensive TOML configuration file:

```toml
collection_interval_seconds = 2

[zmq]
publisher_port = 6130
command_port = 6131
bind_address = "0.0.0.0"
timeout_ms = 5000
heartbeat_interval_ms = 30000

[collectors.cpu]
enabled = true
interval_seconds = 2
load_warning_threshold = 9.0
load_critical_threshold = 10.0
temperature_warning_threshold = 100.0
temperature_critical_threshold = 110.0

[collectors.memory]
enabled = true
interval_seconds = 2
usage_warning_percent = 80.0
usage_critical_percent = 95.0

[collectors.disk]
enabled = true
interval_seconds = 300
usage_warning_percent = 80.0
usage_critical_percent = 90.0

[[collectors.disk.filesystems]]
name = "root"
uuid = "4cade5ce-85a5-4a03-83c8-dfd1d3888d79"
mount_point = "/"
fs_type = "ext4"
monitor = true

[collectors.systemd]
enabled = true
interval_seconds = 10
memory_warning_mb = 1000.0
memory_critical_mb = 2000.0
service_name_filters = [
  "nginx*", "postgresql*", "redis*", "docker*", "sshd*",
  "gitea*", "immich*", "haasp*", "mosquitto*", "mysql*",
  "unifi*", "vaultwarden*"
]
excluded_services = [
  "nginx-config-reload", "sshd-keygen", "systemd-",
  "getty@", "user@", "dbus-", "NetworkManager-"
]

[notifications]
enabled = true
smtp_host = "localhost"
smtp_port = 25
from_email = "{hostname}@example.com"
to_email = "admin@example.com"
rate_limit_minutes = 0
trigger_on_warnings = true
trigger_on_failures = true
recovery_requires_all_ok = true
suppress_individual_recoveries = true

[status_aggregation]
enabled = true
aggregation_method = "worst_case"
notification_interval_seconds = 30

[cache]
persist_path = "/var/lib/cm-dashboard/cache.json"
```

### Dashboard Configuration (`dashboard.toml`)

```toml
[zmq]
hosts = [
  { name = "server1", address = "192.168.1.100", port = 6130 },
  { name = "server2", address = "192.168.1.101", port = 6130 }
]
connection_timeout_ms = 5000
reconnect_interval_ms = 10000

[ui]
refresh_interval_ms = 1000
theme = "dark"
```

## Collectors

The agent implements several specialized collectors:

### CPU Collector (`cpu.rs`)

- Load average (1, 5, 15 minute)
- CPU temperature monitoring
- Real-time process monitoring (top CPU consumers)
- Status calculation with configurable thresholds

### Memory Collector (`memory.rs`)

- RAM usage (total, used, available)
- Swap monitoring
- Real-time process monitoring (top RAM consumers)
- Memory pressure detection

### Disk Collector (`disk.rs`)

- Filesystem usage per mount point
- SMART health monitoring
- Temperature and wear tracking
- Configurable filesystem monitoring

### Systemd Collector (`systemd.rs`)

- Service status monitoring (`active`, `inactive`, `failed`)
- Memory usage per service
- Service filtering and exclusions
- Handles transitional states (`Status::Pending`)

### Backup Collector (`backup.rs`)

- Reads TOML status files from backup systems
- Archive age verification
- Disk usage tracking
- Repository health monitoring

## Email Notifications

### Intelligent Batching

The system implements smart notification batching to prevent email spam:

- **Real-time dashboard updates** - Status changes appear immediately
- **Batched email notifications** - Aggregated every 30 seconds
- **Detailed groupings** - Services organized by severity

### Example Alert Email

```
Subject: Status Alert: 2 critical, 1 warning, 15 started

Status Summary (30s duration)
Host Status: Ok → Warning

🔴 CRITICAL ISSUES (2):
  postgresql: Ok → Critical
  nginx: Warning → Critical

🟡 WARNINGS (1):
  redis: Ok → Warning (memory usage 85%)

✅ RECOVERIES (0):

🟢 SERVICE STARTUPS (15):
  docker: Unknown → Ok
  sshd: Unknown → Ok
  ...

--
CM Dashboard Agent
Generated at 2025-10-21 19:42:42 CET
```

## Individual Metrics Architecture

The system follows a **metrics-first architecture**:

### Agent Side

```rust
// Agent collects individual metrics
vec![
    Metric::new("cpu_load_1min".to_string(), MetricValue::Float(2.5), Status::Ok),
    Metric::new("memory_usage_percent".to_string(), MetricValue::Float(78.5), Status::Warning),
    Metric::new("service_nginx_status".to_string(), MetricValue::String("active".to_string()), Status::Ok),
]
```

### Dashboard Side

```rust
// Widgets subscribe to specific metrics
impl Widget for CpuWidget {
    fn update_from_metrics(&mut self, metrics: &[&Metric]) {
        for metric in metrics {
            match metric.name.as_str() {
                "cpu_load_1min" => self.load_1min = metric.value.as_f32(),
                "cpu_load_5min" => self.load_5min = metric.value.as_f32(),
                "cpu_temperature_celsius" => self.temperature = metric.value.as_f32(),
                _ => {}
            }
        }
    }
}
```

## Persistent Cache

The cache system prevents false notifications:

- **Automatic saving** - Saves when service status changes
- **Persistent storage** - Maintains state across agent restarts
- **Simple design** - No complex TTL or cleanup logic
- **Status preservation** - Prevents duplicate notifications

## Development

### Project Structure

```
cm-dashboard/
├── agent/                  # Metrics collection agent
│   ├── src/
│   │   ├── collectors/     # CPU, memory, disk, systemd, backup
│   │   ├── status/         # Status aggregation and notifications
│   │   ├── cache/          # Persistent metric caching
│   │   ├── config/         # TOML configuration loading
│   │   └── notifications/  # Email notification system
├── dashboard/              # TUI dashboard application
│   ├── src/
│   │   ├── ui/widgets/     # CPU, memory, services, backup widgets
│   │   ├── metrics/        # Metric storage and filtering
│   │   └── communication/  # ZMQ metric consumption
├── shared/                 # Shared types and utilities
│   └── src/
│       ├── metrics.rs      # Metric, Status, and Value types
│       ├── protocol.rs     # ZMQ message format
│       └── cache.rs        # Cache configuration
└── README.md              # This file
```

### Building

```bash
# Debug build
cargo build --workspace

# Release build
cargo build --workspace --release

# Run tests
cargo test --workspace

# Check code formatting
cargo fmt --all -- --check

# Run clippy linter
cargo clippy --workspace -- -D warnings
```

### Dependencies

- **tokio** - Async runtime
- **zmq** - Message passing between agent and dashboard
- **ratatui** - Terminal user interface
- **serde** - Serialization for metrics and config
- **anyhow/thiserror** - Error handling
- **tracing** - Structured logging
- **lettre** - SMTP email notifications
- **clap** - Command-line argument parsing
- **toml** - Configuration file parsing

## NixOS Integration

This project is designed for declarative deployment via NixOS:

### Configuration Generation

The NixOS module automatically generates the agent configuration:

```nix
# hosts/common/cm-dashboard.nix
services.cm-dashboard-agent = {
  enable = true;
  port = 6130;
};
```

### Deployment

```bash
# Update NixOS configuration
git add hosts/common/cm-dashboard.nix
git commit -m "Update cm-dashboard configuration"
git push

# Rebuild system (user-performed)
sudo nixos-rebuild switch --flake .
```

## Monitoring Intervals

- **CPU/Memory**: 2 seconds (real-time monitoring)
- **Disk usage**: 300 seconds (5 minutes)
- **Systemd services**: 10 seconds
- **SMART health**: 600 seconds (10 minutes)
- **Backup status**: 60 seconds (1 minute)
- **Email notifications**: 30 seconds (batched)
- **Dashboard updates**: 1 second (real-time display)

## License

MIT License - see LICENSE file for details