cm-dashboard/README.md
Christoffer Martinsson d644b7d40a Fix NixOS config path in automated release workflow
Update release.yml to use correct path hosts/services/cm-dashboard.nix
instead of hosts/common/cm-dashboard.nix. Also update documentation
in CLAUDE.md and README.md to reflect the correct file location.
2025-11-15 10:21:30 +01:00

362 lines
12 KiB
Markdown

# CM Dashboard
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.
## Features
### Core Monitoring
- **Real-time metrics**: CPU, RAM, Storage, and Service status
- **Multi-host support**: Monitor multiple servers from single dashboard
- **Service management**: Start/stop services with intelligent status tracking
- **NixOS integration**: System rebuild via SSH + tmux popup
- **Backup monitoring**: Borgbackup status and scheduling
- **Email notifications**: Intelligent batching prevents spam
### User-Stopped Service Tracking
Services stopped via the dashboard are intelligently tracked to prevent false alerts:
- **Smart status reporting**: User-stopped services show as Status::OK instead of Warning
- **Persistent storage**: Tracking survives agent restarts via JSON storage
- **Automatic management**: Flags cleared when services restarted via dashboard
- **Maintenance friendly**: No false alerts during intentional service operations
## Architecture
### Individual Metrics Philosophy
- **Agent**: Collects individual metrics, calculates status using thresholds
- **Dashboard**: Subscribes to specific metrics, composes widgets from individual data
- **ZMQ Communication**: Efficient real-time metric transmission
- **Status Aggregation**: Host-level status calculated from all service metrics
### Components
```
┌─────────────────┐ ZMQ ┌─────────────────┐
│ │◄──────────►│ │
│ Agent │ Metrics │ Dashboard │
│ - Collectors │ │ - TUI │
│ - Status │ │ - Widgets │
│ - Tracking │ │ - Commands │
│ │ │ │
└─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ JSON Storage │ │ SSH + tmux │
│ - User-stopped │ │ - Remote rebuild│
│ - Cache │ │ - Process │
│ - State │ │ isolation │
└─────────────────┘ └─────────────────┘
```
### Service Control Flow
1. **User Action**: Dashboard sends `UserStart`/`UserStop` commands
2. **Agent Processing**:
- Marks service as user-stopped (if stopping)
- Executes `systemctl start/stop service`
- Syncs state to global tracker
3. **Status Calculation**:
- Systemd collector checks user-stopped flag
- Reports Status::OK for user-stopped inactive services
- Normal Warning status for system failures
## Interface
```
cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
┌system──────────────────────────────┐┌services─────────────────────────────────────────┐
│NixOS: ││Service: Status: RAM: Disk: │
│Build: 25.05.20251004.3bcc93c ││● docker active 27M 496MB │
│Agent: v0.1.43 ││● gitea active 579M 2.6GB │
│Active users: cm, simon ││● nginx active 28M 24MB │
│CPU: ││ ├─ ● gitea.cmtec.se 51ms │
│● Load: 0.10 0.52 0.88 • 3000MHz ││ ├─ ● photos.cmtec.se 41ms │
│RAM: ││● postgresql active 112M 357MB │
│● Usage: 33% 2.6GB/7.6GB ││● redis-immich user-stopped │
│● /tmp: 0% 0B/2.0GB ││● sshd active 2M 0 │
│Storage: ││● unifi active 594M 495MB │
│● root (Single): ││ │
│ ├─ ● nvme0n1 W: 1% ││ │
│ └─ ● 18% 167.4GB/928.2GB ││ │
└────────────────────────────────────┘└─────────────────────────────────────────────────┘
```
### Navigation
- **Tab**: Switch between hosts
- **↑↓ or j/k**: Navigate services
- **s**: Start selected service (UserStart)
- **S**: Stop selected service (UserStop)
- **J**: Show service logs (journalctl in tmux popup)
- **R**: Rebuild current host
- **q**: Quit
### Status Indicators
- **Green ●**: Active service
- **Yellow ◐**: Inactive service (system issue)
- **Red ◯**: Failed service
- **Blue arrows**: Service transitioning (↑ starting, ↓ stopping, ↻ restarting)
- **"user-stopped"**: Service stopped via dashboard (Status::OK)
## Quick Start
### Building
```bash
# With Nix (recommended)
nix-shell -p openssl pkg-config --run "cargo build --workspace"
# Or with system dependencies
sudo apt install libssl-dev pkg-config # Ubuntu/Debian
cargo build --workspace
```
### Running
```bash
# Start agent (requires configuration)
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml
# Start dashboard (inside tmux session)
tmux
./target/debug/cm-dashboard --config /etc/cm-dashboard/dashboard.toml
```
## Configuration
### Agent Configuration
```toml
collection_interval_seconds = 2
[zmq]
publisher_port = 6130
command_port = 6131
bind_address = "0.0.0.0"
transmission_interval_seconds = 2
[collectors.cpu]
enabled = true
interval_seconds = 2
load_warning_threshold = 5.0
load_critical_threshold = 10.0
[collectors.memory]
enabled = true
interval_seconds = 2
usage_warning_percent = 80.0
usage_critical_percent = 90.0
[collectors.systemd]
enabled = true
interval_seconds = 10
service_name_filters = ["nginx*", "postgresql*", "docker*", "sshd*"]
excluded_services = ["nginx-config-reload", "systemd-", "getty@"]
nginx_latency_critical_ms = 1000.0
http_timeout_seconds = 10
[notifications]
enabled = true
smtp_host = "localhost"
smtp_port = 25
from_email = "{hostname}@example.com"
to_email = "admin@example.com"
aggregation_interval_seconds = 30
```
### Dashboard Configuration
```toml
[zmq]
subscriber_ports = [6130]
[hosts]
predefined_hosts = ["cmbox", "srv01", "srv02"]
[ui]
ssh_user = "cm"
rebuild_alias = "nixos-rebuild-cmtec"
```
## Technical Implementation
### Collectors
#### Systemd Collector
- **Service Discovery**: Uses `systemctl list-unit-files` + `list-units --all`
- **Status Calculation**: Checks user-stopped flag before assigning Warning status
- **Memory Tracking**: Per-service memory usage via `systemctl show`
- **Sub-services**: Nginx site latency, Docker containers
- **User-stopped Integration**: `UserStoppedServiceTracker::is_service_user_stopped()`
#### User-Stopped Service Tracker
- **Storage**: `/var/lib/cm-dashboard/user-stopped-services.json`
- **Thread Safety**: Global singleton with `Arc<Mutex<>>`
- **Persistence**: Automatic save on state changes
- **Global Access**: Static methods for collector integration
#### Other Collectors
- **CPU**: Load average, temperature, frequency monitoring
- **Memory**: RAM/swap usage, tmpfs monitoring
- **Disk**: Filesystem usage, SMART health data
- **NixOS**: Build version, active users, agent version
- **Backup**: Borgbackup repository status and metrics
### ZMQ Protocol
```rust
// Metric Message
#[derive(Serialize, Deserialize)]
pub struct MetricMessage {
pub hostname: String,
pub timestamp: u64,
pub metrics: Vec<Metric>,
}
// Service Commands
pub enum AgentCommand {
ServiceControl {
service_name: String,
action: ServiceAction,
},
SystemRebuild { /* SSH config */ },
CollectNow,
}
pub enum ServiceAction {
Start, // System-initiated
Stop, // System-initiated
UserStart, // User via dashboard (clears user-stopped)
UserStop, // User via dashboard (marks user-stopped)
Status,
}
```
### Maintenance Mode
Suppress notifications during planned maintenance:
```bash
# Enable maintenance mode
touch /tmp/cm-maintenance
# Perform maintenance
systemctl stop service
# ... work ...
systemctl start service
# Disable maintenance mode
rm /tmp/cm-maintenance
```
## Email Notifications
### Intelligent Batching
- **Real-time dashboard**: Immediate status updates
- **Batched emails**: Aggregated every 30 seconds
- **Smart grouping**: Services organized by severity
- **Recovery suppression**: Reduces notification spam
### Example Alert
```
Subject: Status Alert: 1 critical, 2 warnings, 0 recoveries
Status Summary (30s duration)
Host Status: Ok → Warning
🔴 CRITICAL ISSUES (1):
postgresql: Ok → Critical (memory usage 95%)
🟡 WARNINGS (2):
nginx: Ok → Warning (high load 8.5)
redis: user-stopped → Warning (restarted by system)
✅ RECOVERIES (0):
--
CM Dashboard Agent v0.1.43
```
## Development
### Project Structure
```
cm-dashboard/
├── agent/ # Metrics collection agent
│ ├── src/
│ │ ├── collectors/ # CPU, memory, disk, systemd, backup, nixos
│ │ ├── service_tracker.rs # User-stopped service tracking
│ │ ├── status/ # Status aggregation and notifications
│ │ ├── config/ # TOML configuration loading
│ │ └── communication/ # ZMQ message handling
├── dashboard/ # TUI dashboard application
│ ├── src/
│ │ ├── ui/widgets/ # CPU, memory, services, backup, system
│ │ ├── communication/ # ZMQ consumption and commands
│ │ └── app.rs # Main application loop
├── shared/ # Shared types and utilities
│ └── src/
│ ├── metrics.rs # Metric, Status, StatusTracker types
│ ├── protocol.rs # ZMQ message format
│ └── cache.rs # Cache configuration
└── CLAUDE.md # Development guidelines and rules
```
### Testing
```bash
# Build and test
nix-shell -p openssl pkg-config --run "cargo build --workspace"
nix-shell -p openssl pkg-config --run "cargo test --workspace"
# Code quality
cargo fmt --all
cargo clippy --workspace -- -D warnings
```
## Deployment
### Automated Binary Releases
```bash
# Create new release
cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X
```
This triggers automated:
- Static binary compilation with `RUSTFLAGS="-C target-feature=+crt-static"`
- GitHub-style release creation
- Tarball upload to Gitea
### NixOS Integration
Update `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:
```nix
version = "v0.1.43";
src = pkgs.fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-HASH";
};
```
Get hash via:
```bash
cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
url = "URL_HERE";
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
```
## Monitoring Intervals
- **Metrics Collection**: 2 seconds (CPU, memory, services)
- **Metric Transmission**: 2 seconds (ZMQ publish)
- **Dashboard Updates**: 1 second (UI refresh)
- **Email Notifications**: 30 seconds (batched)
- **Disk Monitoring**: 300 seconds (5 minutes)
- **Service Discovery**: 300 seconds (5 minutes cache)
## License
MIT License - see LICENSE file for details.