All checks were successful
Build and Release / build-and-release (push) Successful in 1m21s
- Add terminal popup UI component with 80% screen coverage and terminal styling - Extend ZMQ protocol with CommandOutputMessage for streaming output - Implement real-time output streaming in agent system rebuild handler - Add keyboard controls (ESC/Q to close, ↑↓ to scroll) for popup interaction - Fix system panel Build display to show actual NixOS build instead of config hash - Update service filters in README with wildcard patterns for better matching - Add periodic progress updates during nixos-rebuild execution - Integrate command output handling in dashboard main loop
418 lines
14 KiB
Markdown
418 lines
14 KiB
Markdown
# CM Dashboard
|
|
|
|
A real-time infrastructure monitoring system with intelligent status aggregation and email notifications, built with Rust and ZMQ.
|
|
|
|
## Current Implementation
|
|
|
|
This is a complete rewrite implementing an **individual metrics architecture** where:
|
|
|
|
- **Agent** collects individual metrics (e.g., `cpu_load_1min`, `memory_usage_percent`) and calculates status
|
|
- **Dashboard** subscribes to specific metrics and composes widgets
|
|
- **Status Aggregation** provides intelligent email notifications with batching
|
|
- **Persistent Cache** prevents false notifications on restart
|
|
|
|
## Dashboard Interface
|
|
|
|
```
|
|
cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
|
|
┌system──────────────────────────────┐┌services─────────────────────────────────────────┐
|
|
│CPU: ││Service: Status: RAM: Disk: │
|
|
│● Load: 0.10 0.52 0.88 • 400.0 MHz ││● docker active 27M 496MB │
|
|
│RAM: ││● docker-registry active 19M 496MB │
|
|
│● Used: 30% 2.3GB/7.6GB ││● gitea active 579M 2.6GB │
|
|
│● tmp: 0.0% 0B/2.0GB ││● gitea-runner-default active 11M 2.6GB │
|
|
│Disk nvme0n1: ││● haasp-core active 9M 1MB │
|
|
│● Health: PASSED ││● haasp-mqtt active 3M 1MB │
|
|
│● Usage @root: 8.3% • 75.4/906.2 GB ││● haasp-webgrid active 10M 1MB │
|
|
│● Usage @boot: 5.9% • 0.1/1.0 GB ││● immich-server active 240M 45.1GB │
|
|
│ ││● mosquitto active 1M 1MB │
|
|
│ ││● mysql active 38M 225MB │
|
|
│ ││● nginx active 28M 24MB │
|
|
│ ││ ├─ ● gitea.cmtec.se 51ms │
|
|
│ ││ ├─ ● haasp.cmtec.se 43ms │
|
|
│ ││ ├─ ● haasp.net 43ms │
|
|
│ ││ ├─ ● pages.cmtec.se 45ms │
|
|
└────────────────────────────────────┘│ ├─ ● photos.cmtec.se 41ms │
|
|
┌backup──────────────────────────────┐│ ├─ ● unifi.cmtec.se 46ms │
|
|
│Latest backup: ││ ├─ ● vault.cmtec.se 47ms │
|
|
│● Status: OK ││ ├─ ● www.kryddorten.se 81ms │
|
|
│Duration: 54s • Last: 4h ago ││ ├─ ● www.mariehall2.se 86ms │
|
|
│Disk usage: 48.2GB/915.8GB ││● postgresql active 112M 357MB │
|
|
│P/N: Samsung SSD 870 QVO 1TB ││● redis-immich active 8M 45.1GB │
|
|
│S/N: S5RRNF0W800639Y ││● sshd active 2M 0 │
|
|
│● gitea 2 archives 2.7GB ││● unifi active 594M 495MB │
|
|
│● immich 2 archives 45.0GB ││● vaultwarden active 12M 1MB │
|
|
│● kryddorten 2 archives 67.6MB ││ │
|
|
│● mariehall2 2 archives 321.8MB ││ │
|
|
│● nixosbox 2 archives 4.5MB ││ │
|
|
│● unifi 2 archives 2.9MB ││ │
|
|
│● vaultwarden 2 archives 305kB ││ │
|
|
└────────────────────────────────────┘└─────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Navigation**: `←→` switch hosts, `r` refresh, `q` quit
|
|
|
|
## Features
|
|
|
|
- **Real-time monitoring** - Dashboard updates every 1-2 seconds
|
|
- **Individual metric collection** - Granular data for flexible dashboard composition
|
|
- **Intelligent status aggregation** - Host-level status calculated from all services
|
|
- **Smart email notifications** - Batched, detailed alerts with service groupings
|
|
- **Persistent state** - Prevents false notifications on restarts
|
|
- **ZMQ communication** - Efficient agent-to-dashboard messaging
|
|
- **Clean TUI** - Terminal-based dashboard with color-coded status indicators
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
- **Agent** (`cm-dashboard-agent`) - Collects metrics and sends via ZMQ
|
|
- **Dashboard** (`cm-dashboard`) - Real-time TUI display consuming metrics
|
|
- **Shared** (`cm-dashboard-shared`) - Common types and protocol
|
|
- **Status Aggregation** - Intelligent batching and notification management
|
|
- **Persistent Cache** - Maintains state across restarts
|
|
|
|
### Status Levels
|
|
|
|
- **🟢 Ok** - Service running normally
|
|
- **🔵 Pending** - Service starting/stopping/reloading
|
|
- **🟡 Warning** - Service issues (high load, memory, disk usage)
|
|
- **🔴 Critical** - Service failed or critical thresholds exceeded
|
|
- **❓ Unknown** - Service state cannot be determined
|
|
|
|
## Quick Start
|
|
|
|
### Build
|
|
|
|
```bash
|
|
# With Nix (recommended)
|
|
nix-shell -p openssl pkg-config --run "cargo build --workspace"
|
|
|
|
# Or with system dependencies
|
|
sudo apt install libssl-dev pkg-config # Ubuntu/Debian
|
|
cargo build --workspace
|
|
```
|
|
|
|
### Run
|
|
|
|
```bash
|
|
# Start agent (requires configuration file)
|
|
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml
|
|
|
|
# Start dashboard
|
|
./target/debug/cm-dashboard --config /path/to/dashboard.toml
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Agent Configuration (`agent.toml`)
|
|
|
|
The agent requires a comprehensive TOML configuration file:
|
|
|
|
```toml
|
|
collection_interval_seconds = 2
|
|
|
|
[zmq]
|
|
publisher_port = 6130
|
|
command_port = 6131
|
|
bind_address = "0.0.0.0"
|
|
timeout_ms = 5000
|
|
heartbeat_interval_ms = 30000
|
|
|
|
[collectors.cpu]
|
|
enabled = true
|
|
interval_seconds = 2
|
|
load_warning_threshold = 9.0
|
|
load_critical_threshold = 10.0
|
|
temperature_warning_threshold = 100.0
|
|
temperature_critical_threshold = 110.0
|
|
|
|
[collectors.memory]
|
|
enabled = true
|
|
interval_seconds = 2
|
|
usage_warning_percent = 80.0
|
|
usage_critical_percent = 95.0
|
|
|
|
[collectors.disk]
|
|
enabled = true
|
|
interval_seconds = 300
|
|
usage_warning_percent = 80.0
|
|
usage_critical_percent = 90.0
|
|
|
|
[[collectors.disk.filesystems]]
|
|
name = "root"
|
|
uuid = "4cade5ce-85a5-4a03-83c8-dfd1d3888d79"
|
|
mount_point = "/"
|
|
fs_type = "ext4"
|
|
monitor = true
|
|
|
|
[collectors.systemd]
|
|
enabled = true
|
|
interval_seconds = 10
|
|
memory_warning_mb = 1000.0
|
|
memory_critical_mb = 2000.0
|
|
service_name_filters = [
|
|
"nginx*", "postgresql*", "redis*", "docker*", "sshd*",
|
|
"gitea*", "immich*", "haasp*", "mosquitto*", "mysql*",
|
|
"unifi*", "vaultwarden*"
|
|
]
|
|
excluded_services = [
|
|
"nginx-config-reload", "sshd-keygen", "systemd-",
|
|
"getty@", "user@", "dbus-", "NetworkManager-"
|
|
]
|
|
|
|
[notifications]
|
|
enabled = true
|
|
smtp_host = "localhost"
|
|
smtp_port = 25
|
|
from_email = "{hostname}@example.com"
|
|
to_email = "admin@example.com"
|
|
rate_limit_minutes = 0
|
|
trigger_on_warnings = true
|
|
trigger_on_failures = true
|
|
recovery_requires_all_ok = true
|
|
suppress_individual_recoveries = true
|
|
|
|
[status_aggregation]
|
|
enabled = true
|
|
aggregation_method = "worst_case"
|
|
notification_interval_seconds = 30
|
|
|
|
[cache]
|
|
persist_path = "/var/lib/cm-dashboard/cache.json"
|
|
```
|
|
|
|
### Dashboard Configuration (`dashboard.toml`)
|
|
|
|
```toml
|
|
[zmq]
|
|
hosts = [
|
|
{ name = "server1", address = "192.168.1.100", port = 6130 },
|
|
{ name = "server2", address = "192.168.1.101", port = 6130 }
|
|
]
|
|
connection_timeout_ms = 5000
|
|
reconnect_interval_ms = 10000
|
|
|
|
[ui]
|
|
refresh_interval_ms = 1000
|
|
theme = "dark"
|
|
```
|
|
|
|
## Collectors
|
|
|
|
The agent implements several specialized collectors:
|
|
|
|
### CPU Collector (`cpu.rs`)
|
|
|
|
- Load average (1, 5, 15 minute)
|
|
- CPU temperature monitoring
|
|
- Real-time process monitoring (top CPU consumers)
|
|
- Status calculation with configurable thresholds
|
|
|
|
### Memory Collector (`memory.rs`)
|
|
|
|
- RAM usage (total, used, available)
|
|
- Swap monitoring
|
|
- Real-time process monitoring (top RAM consumers)
|
|
- Memory pressure detection
|
|
|
|
### Disk Collector (`disk.rs`)
|
|
|
|
- Filesystem usage per mount point
|
|
- SMART health monitoring
|
|
- Temperature and wear tracking
|
|
- Configurable filesystem monitoring
|
|
|
|
### Systemd Collector (`systemd.rs`)
|
|
|
|
- Service status monitoring (`active`, `inactive`, `failed`)
|
|
- Memory usage per service
|
|
- Service filtering and exclusions
|
|
- Handles transitional states (`Status::Pending`)
|
|
|
|
### Backup Collector (`backup.rs`)
|
|
|
|
- Reads TOML status files from backup systems
|
|
- Archive age verification
|
|
- Disk usage tracking
|
|
- Repository health monitoring
|
|
|
|
## Email Notifications
|
|
|
|
### Intelligent Batching
|
|
|
|
The system implements smart notification batching to prevent email spam:
|
|
|
|
- **Real-time dashboard updates** - Status changes appear immediately
|
|
- **Batched email notifications** - Aggregated every 30 seconds
|
|
- **Detailed groupings** - Services organized by severity
|
|
|
|
### Example Alert Email
|
|
|
|
```
|
|
Subject: Status Alert: 2 critical, 1 warning, 15 started
|
|
|
|
Status Summary (30s duration)
|
|
Host Status: Ok → Warning
|
|
|
|
🔴 CRITICAL ISSUES (2):
|
|
postgresql: Ok → Critical
|
|
nginx: Warning → Critical
|
|
|
|
🟡 WARNINGS (1):
|
|
redis: Ok → Warning (memory usage 85%)
|
|
|
|
✅ RECOVERIES (0):
|
|
|
|
🟢 SERVICE STARTUPS (15):
|
|
docker: Unknown → Ok
|
|
sshd: Unknown → Ok
|
|
...
|
|
|
|
--
|
|
CM Dashboard Agent
|
|
Generated at 2025-10-21 19:42:42 CET
|
|
```
|
|
|
|
## Individual Metrics Architecture
|
|
|
|
The system follows a **metrics-first architecture**:
|
|
|
|
### Agent Side
|
|
|
|
```rust
|
|
// Agent collects individual metrics
|
|
vec![
|
|
Metric::new("cpu_load_1min".to_string(), MetricValue::Float(2.5), Status::Ok),
|
|
Metric::new("memory_usage_percent".to_string(), MetricValue::Float(78.5), Status::Warning),
|
|
Metric::new("service_nginx_status".to_string(), MetricValue::String("active".to_string()), Status::Ok),
|
|
]
|
|
```
|
|
|
|
### Dashboard Side
|
|
|
|
```rust
|
|
// Widgets subscribe to specific metrics
|
|
impl Widget for CpuWidget {
|
|
fn update_from_metrics(&mut self, metrics: &[&Metric]) {
|
|
for metric in metrics {
|
|
match metric.name.as_str() {
|
|
"cpu_load_1min" => self.load_1min = metric.value.as_f32(),
|
|
"cpu_load_5min" => self.load_5min = metric.value.as_f32(),
|
|
"cpu_temperature_celsius" => self.temperature = metric.value.as_f32(),
|
|
_ => {}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Persistent Cache
|
|
|
|
The cache system prevents false notifications:
|
|
|
|
- **Automatic saving** - Saves when service status changes
|
|
- **Persistent storage** - Maintains state across agent restarts
|
|
- **Simple design** - No complex TTL or cleanup logic
|
|
- **Status preservation** - Prevents duplicate notifications
|
|
|
|
## Development
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
cm-dashboard/
|
|
├── agent/ # Metrics collection agent
|
|
│ ├── src/
|
|
│ │ ├── collectors/ # CPU, memory, disk, systemd, backup
|
|
│ │ ├── status/ # Status aggregation and notifications
|
|
│ │ ├── cache/ # Persistent metric caching
|
|
│ │ ├── config/ # TOML configuration loading
|
|
│ │ └── notifications/ # Email notification system
|
|
├── dashboard/ # TUI dashboard application
|
|
│ ├── src/
|
|
│ │ ├── ui/widgets/ # CPU, memory, services, backup widgets
|
|
│ │ ├── metrics/ # Metric storage and filtering
|
|
│ │ └── communication/ # ZMQ metric consumption
|
|
├── shared/ # Shared types and utilities
|
|
│ └── src/
|
|
│ ├── metrics.rs # Metric, Status, and Value types
|
|
│ ├── protocol.rs # ZMQ message format
|
|
│ └── cache.rs # Cache configuration
|
|
└── README.md # This file
|
|
```
|
|
|
|
### Building
|
|
|
|
```bash
|
|
# Debug build
|
|
cargo build --workspace
|
|
|
|
# Release build
|
|
cargo build --workspace --release
|
|
|
|
# Run tests
|
|
cargo test --workspace
|
|
|
|
# Check code formatting
|
|
cargo fmt --all -- --check
|
|
|
|
# Run clippy linter
|
|
cargo clippy --workspace -- -D warnings
|
|
```
|
|
|
|
### Dependencies
|
|
|
|
- **tokio** - Async runtime
|
|
- **zmq** - Message passing between agent and dashboard
|
|
- **ratatui** - Terminal user interface
|
|
- **serde** - Serialization for metrics and config
|
|
- **anyhow/thiserror** - Error handling
|
|
- **tracing** - Structured logging
|
|
- **lettre** - SMTP email notifications
|
|
- **clap** - Command-line argument parsing
|
|
- **toml** - Configuration file parsing
|
|
|
|
## NixOS Integration
|
|
|
|
This project is designed for declarative deployment via NixOS:
|
|
|
|
### Configuration Generation
|
|
|
|
The NixOS module automatically generates the agent configuration:
|
|
|
|
```nix
|
|
# hosts/common/cm-dashboard.nix
|
|
services.cm-dashboard-agent = {
|
|
enable = true;
|
|
port = 6130;
|
|
};
|
|
```
|
|
|
|
### Deployment
|
|
|
|
```bash
|
|
# Update NixOS configuration
|
|
git add hosts/common/cm-dashboard.nix
|
|
git commit -m "Update cm-dashboard configuration"
|
|
git push
|
|
|
|
# Rebuild system (user-performed)
|
|
sudo nixos-rebuild switch --flake .
|
|
```
|
|
|
|
## Monitoring Intervals
|
|
|
|
- **CPU/Memory**: 2 seconds (real-time monitoring)
|
|
- **Disk usage**: 300 seconds (5 minutes)
|
|
- **Systemd services**: 10 seconds
|
|
- **SMART health**: 600 seconds (10 minutes)
|
|
- **Backup status**: 60 seconds (1 minute)
|
|
- **Email notifications**: 30 seconds (batched)
|
|
- **Dashboard updates**: 1 second (real-time display)
|
|
|
|
## License
|
|
|
|
MIT License - see LICENSE file for details
|
|
|