Update release.yml to use correct path hosts/services/cm-dashboard.nix instead of hosts/common/cm-dashboard.nix. Also update documentation in CLAUDE.md and README.md to reflect the correct file location.
362 lines
12 KiB
Markdown
362 lines
12 KiB
Markdown
# CM Dashboard
|
|
|
|
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.
|
|
|
|
## Features
|
|
|
|
### Core Monitoring
|
|
- **Real-time metrics**: CPU, RAM, Storage, and Service status
|
|
- **Multi-host support**: Monitor multiple servers from single dashboard
|
|
- **Service management**: Start/stop services with intelligent status tracking
|
|
- **NixOS integration**: System rebuild via SSH + tmux popup
|
|
- **Backup monitoring**: Borgbackup status and scheduling
|
|
- **Email notifications**: Intelligent batching prevents spam
|
|
|
|
### User-Stopped Service Tracking
|
|
Services stopped via the dashboard are intelligently tracked to prevent false alerts:
|
|
|
|
- **Smart status reporting**: User-stopped services show as Status::OK instead of Warning
|
|
- **Persistent storage**: Tracking survives agent restarts via JSON storage
|
|
- **Automatic management**: Flags cleared when services restarted via dashboard
|
|
- **Maintenance friendly**: No false alerts during intentional service operations
|
|
|
|
## Architecture
|
|
|
|
### Individual Metrics Philosophy
|
|
- **Agent**: Collects individual metrics, calculates status using thresholds
|
|
- **Dashboard**: Subscribes to specific metrics, composes widgets from individual data
|
|
- **ZMQ Communication**: Efficient real-time metric transmission
|
|
- **Status Aggregation**: Host-level status calculated from all service metrics
|
|
|
|
### Components
|
|
|
|
```
|
|
┌─────────────────┐ ZMQ ┌─────────────────┐
|
|
│ │◄──────────►│ │
|
|
│ Agent │ Metrics │ Dashboard │
|
|
│ - Collectors │ │ - TUI │
|
|
│ - Status │ │ - Widgets │
|
|
│ - Tracking │ │ - Commands │
|
|
│ │ │ │
|
|
└─────────────────┘ └─────────────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌─────────────────┐ ┌─────────────────┐
|
|
│ JSON Storage │ │ SSH + tmux │
|
|
│ - User-stopped │ │ - Remote rebuild│
|
|
│ - Cache │ │ - Process │
|
|
│ - State │ │ isolation │
|
|
└─────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
### Service Control Flow
|
|
|
|
1. **User Action**: Dashboard sends `UserStart`/`UserStop` commands
|
|
2. **Agent Processing**:
|
|
- Marks service as user-stopped (if stopping)
|
|
- Executes `systemctl start/stop service`
|
|
- Syncs state to global tracker
|
|
3. **Status Calculation**:
|
|
- Systemd collector checks user-stopped flag
|
|
- Reports Status::OK for user-stopped inactive services
|
|
- Normal Warning status for system failures
|
|
|
|
## Interface
|
|
|
|
```
|
|
cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
|
|
┌system──────────────────────────────┐┌services─────────────────────────────────────────┐
|
|
│NixOS: ││Service: Status: RAM: Disk: │
|
|
│Build: 25.05.20251004.3bcc93c ││● docker active 27M 496MB │
|
|
│Agent: v0.1.43 ││● gitea active 579M 2.6GB │
|
|
│Active users: cm, simon ││● nginx active 28M 24MB │
|
|
│CPU: ││ ├─ ● gitea.cmtec.se 51ms │
|
|
│● Load: 0.10 0.52 0.88 • 3000MHz ││ ├─ ● photos.cmtec.se 41ms │
|
|
│RAM: ││● postgresql active 112M 357MB │
|
|
│● Usage: 33% 2.6GB/7.6GB ││● redis-immich user-stopped │
|
|
│● /tmp: 0% 0B/2.0GB ││● sshd active 2M 0 │
|
|
│Storage: ││● unifi active 594M 495MB │
|
|
│● root (Single): ││ │
|
|
│ ├─ ● nvme0n1 W: 1% ││ │
|
|
│ └─ ● 18% 167.4GB/928.2GB ││ │
|
|
└────────────────────────────────────┘└─────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Navigation
|
|
- **Tab**: Switch between hosts
|
|
- **↑↓ or j/k**: Navigate services
|
|
- **s**: Start selected service (UserStart)
|
|
- **S**: Stop selected service (UserStop)
|
|
- **J**: Show service logs (journalctl in tmux popup)
|
|
- **R**: Rebuild current host
|
|
- **q**: Quit
|
|
|
|
### Status Indicators
|
|
- **Green ●**: Active service
|
|
- **Yellow ◐**: Inactive service (system issue)
|
|
- **Red ◯**: Failed service
|
|
- **Blue arrows**: Service transitioning (↑ starting, ↓ stopping, ↻ restarting)
|
|
- **"user-stopped"**: Service stopped via dashboard (Status::OK)
|
|
|
|
## Quick Start
|
|
|
|
### Building
|
|
|
|
```bash
|
|
# With Nix (recommended)
|
|
nix-shell -p openssl pkg-config --run "cargo build --workspace"
|
|
|
|
# Or with system dependencies
|
|
sudo apt install libssl-dev pkg-config # Ubuntu/Debian
|
|
cargo build --workspace
|
|
```
|
|
|
|
### Running
|
|
|
|
```bash
|
|
# Start agent (requires configuration)
|
|
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml
|
|
|
|
# Start dashboard (inside tmux session)
|
|
tmux
|
|
./target/debug/cm-dashboard --config /etc/cm-dashboard/dashboard.toml
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Agent Configuration
|
|
|
|
```toml
|
|
collection_interval_seconds = 2
|
|
|
|
[zmq]
|
|
publisher_port = 6130
|
|
command_port = 6131
|
|
bind_address = "0.0.0.0"
|
|
transmission_interval_seconds = 2
|
|
|
|
[collectors.cpu]
|
|
enabled = true
|
|
interval_seconds = 2
|
|
load_warning_threshold = 5.0
|
|
load_critical_threshold = 10.0
|
|
|
|
[collectors.memory]
|
|
enabled = true
|
|
interval_seconds = 2
|
|
usage_warning_percent = 80.0
|
|
usage_critical_percent = 90.0
|
|
|
|
[collectors.systemd]
|
|
enabled = true
|
|
interval_seconds = 10
|
|
service_name_filters = ["nginx*", "postgresql*", "docker*", "sshd*"]
|
|
excluded_services = ["nginx-config-reload", "systemd-", "getty@"]
|
|
nginx_latency_critical_ms = 1000.0
|
|
http_timeout_seconds = 10
|
|
|
|
[notifications]
|
|
enabled = true
|
|
smtp_host = "localhost"
|
|
smtp_port = 25
|
|
from_email = "{hostname}@example.com"
|
|
to_email = "admin@example.com"
|
|
aggregation_interval_seconds = 30
|
|
```
|
|
|
|
### Dashboard Configuration
|
|
|
|
```toml
|
|
[zmq]
|
|
subscriber_ports = [6130]
|
|
|
|
[hosts]
|
|
predefined_hosts = ["cmbox", "srv01", "srv02"]
|
|
|
|
[ui]
|
|
ssh_user = "cm"
|
|
rebuild_alias = "nixos-rebuild-cmtec"
|
|
```
|
|
|
|
## Technical Implementation
|
|
|
|
### Collectors
|
|
|
|
#### Systemd Collector
|
|
- **Service Discovery**: Uses `systemctl list-unit-files` + `list-units --all`
|
|
- **Status Calculation**: Checks user-stopped flag before assigning Warning status
|
|
- **Memory Tracking**: Per-service memory usage via `systemctl show`
|
|
- **Sub-services**: Nginx site latency, Docker containers
|
|
- **User-stopped Integration**: `UserStoppedServiceTracker::is_service_user_stopped()`
|
|
|
|
#### User-Stopped Service Tracker
|
|
- **Storage**: `/var/lib/cm-dashboard/user-stopped-services.json`
|
|
- **Thread Safety**: Global singleton with `Arc<Mutex<>>`
|
|
- **Persistence**: Automatic save on state changes
|
|
- **Global Access**: Static methods for collector integration
|
|
|
|
#### Other Collectors
|
|
- **CPU**: Load average, temperature, frequency monitoring
|
|
- **Memory**: RAM/swap usage, tmpfs monitoring
|
|
- **Disk**: Filesystem usage, SMART health data
|
|
- **NixOS**: Build version, active users, agent version
|
|
- **Backup**: Borgbackup repository status and metrics
|
|
|
|
### ZMQ Protocol
|
|
|
|
```rust
|
|
// Metric Message
|
|
#[derive(Serialize, Deserialize)]
|
|
pub struct MetricMessage {
|
|
pub hostname: String,
|
|
pub timestamp: u64,
|
|
pub metrics: Vec<Metric>,
|
|
}
|
|
|
|
// Service Commands
|
|
pub enum AgentCommand {
|
|
ServiceControl {
|
|
service_name: String,
|
|
action: ServiceAction,
|
|
},
|
|
SystemRebuild { /* SSH config */ },
|
|
CollectNow,
|
|
}
|
|
|
|
pub enum ServiceAction {
|
|
Start, // System-initiated
|
|
Stop, // System-initiated
|
|
UserStart, // User via dashboard (clears user-stopped)
|
|
UserStop, // User via dashboard (marks user-stopped)
|
|
Status,
|
|
}
|
|
```
|
|
|
|
### Maintenance Mode
|
|
|
|
Suppress notifications during planned maintenance:
|
|
|
|
```bash
|
|
# Enable maintenance mode
|
|
touch /tmp/cm-maintenance
|
|
|
|
# Perform maintenance
|
|
systemctl stop service
|
|
# ... work ...
|
|
systemctl start service
|
|
|
|
# Disable maintenance mode
|
|
rm /tmp/cm-maintenance
|
|
```
|
|
|
|
## Email Notifications
|
|
|
|
### Intelligent Batching
|
|
- **Real-time dashboard**: Immediate status updates
|
|
- **Batched emails**: Aggregated every 30 seconds
|
|
- **Smart grouping**: Services organized by severity
|
|
- **Recovery suppression**: Reduces notification spam
|
|
|
|
### Example Alert
|
|
```
|
|
Subject: Status Alert: 1 critical, 2 warnings, 0 recoveries
|
|
|
|
Status Summary (30s duration)
|
|
Host Status: Ok → Warning
|
|
|
|
🔴 CRITICAL ISSUES (1):
|
|
postgresql: Ok → Critical (memory usage 95%)
|
|
|
|
🟡 WARNINGS (2):
|
|
nginx: Ok → Warning (high load 8.5)
|
|
redis: user-stopped → Warning (restarted by system)
|
|
|
|
✅ RECOVERIES (0):
|
|
|
|
--
|
|
CM Dashboard Agent v0.1.43
|
|
```
|
|
|
|
## Development
|
|
|
|
### Project Structure
|
|
```
|
|
cm-dashboard/
|
|
├── agent/ # Metrics collection agent
|
|
│ ├── src/
|
|
│ │ ├── collectors/ # CPU, memory, disk, systemd, backup, nixos
|
|
│ │ ├── service_tracker.rs # User-stopped service tracking
|
|
│ │ ├── status/ # Status aggregation and notifications
|
|
│ │ ├── config/ # TOML configuration loading
|
|
│ │ └── communication/ # ZMQ message handling
|
|
├── dashboard/ # TUI dashboard application
|
|
│ ├── src/
|
|
│ │ ├── ui/widgets/ # CPU, memory, services, backup, system
|
|
│ │ ├── communication/ # ZMQ consumption and commands
|
|
│ │ └── app.rs # Main application loop
|
|
├── shared/ # Shared types and utilities
|
|
│ └── src/
|
|
│ ├── metrics.rs # Metric, Status, StatusTracker types
|
|
│ ├── protocol.rs # ZMQ message format
|
|
│ └── cache.rs # Cache configuration
|
|
└── CLAUDE.md # Development guidelines and rules
|
|
```
|
|
|
|
### Testing
|
|
```bash
|
|
# Build and test
|
|
nix-shell -p openssl pkg-config --run "cargo build --workspace"
|
|
nix-shell -p openssl pkg-config --run "cargo test --workspace"
|
|
|
|
# Code quality
|
|
cargo fmt --all
|
|
cargo clippy --workspace -- -D warnings
|
|
```
|
|
|
|
## Deployment
|
|
|
|
### Automated Binary Releases
|
|
```bash
|
|
# Create new release
|
|
cd ~/projects/cm-dashboard
|
|
git tag v0.1.X
|
|
git push origin v0.1.X
|
|
```
|
|
|
|
This triggers automated:
|
|
- Static binary compilation with `RUSTFLAGS="-C target-feature=+crt-static"`
|
|
- GitHub-style release creation
|
|
- Tarball upload to Gitea
|
|
|
|
### NixOS Integration
|
|
Update `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:
|
|
|
|
```nix
|
|
version = "v0.1.43";
|
|
src = pkgs.fetchurl {
|
|
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
|
|
sha256 = "sha256-HASH";
|
|
};
|
|
```
|
|
|
|
Get hash via:
|
|
```bash
|
|
cd ~/projects/nixosbox
|
|
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
|
|
url = "URL_HERE";
|
|
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
|
|
}' 2>&1 | grep "got:"
|
|
```
|
|
|
|
## Monitoring Intervals
|
|
|
|
- **Metrics Collection**: 2 seconds (CPU, memory, services)
|
|
- **Metric Transmission**: 2 seconds (ZMQ publish)
|
|
- **Dashboard Updates**: 1 second (UI refresh)
|
|
- **Email Notifications**: 30 seconds (batched)
|
|
- **Disk Monitoring**: 300 seconds (5 minutes)
|
|
- **Service Discovery**: 300 seconds (5 minutes cache)
|
|
|
|
## License
|
|
|
|
MIT License - see LICENSE file for details. |