All checks were successful
Build and Release / build-and-release (push) Successful in 2m34s
Replace ZMQ-based service start/stop commands with SSH execution in tmux popups. This provides better user feedback with real-time systemctl output while eliminating blocking operations from the main message processing loop. Changes: - Service start/stop now use SSH with progress display - Added backup functionality with 'B' key - Preserved transitional icons (↑/↓) for immediate visual feedback - Removed all ZMQ service control commands and handlers - Updated configuration to include backup_alias setting - All operations (rebuild, backup, services) now use consistent SSH interface This ensures stable heartbeat processing while providing superior user experience with live command output and service status feedback.
365 lines
12 KiB
Markdown
365 lines
12 KiB
Markdown
# CM Dashboard
|
|
|
|
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.
|
|
|
|
## Features
|
|
|
|
### Core Monitoring
|
|
- **Real-time metrics**: CPU, RAM, Storage, and Service status
|
|
- **Multi-host support**: Monitor multiple servers from single dashboard
|
|
- **Service management**: Start/stop services with intelligent status tracking
|
|
- **NixOS integration**: System rebuild via SSH + tmux popup
|
|
- **Backup monitoring**: Borgbackup status and scheduling
|
|
- **Email notifications**: Intelligent batching prevents spam
|
|
|
|
### User-Stopped Service Tracking
|
|
Services stopped via the dashboard are intelligently tracked to prevent false alerts:
|
|
|
|
- **Smart status reporting**: User-stopped services show as Status::OK instead of Warning
|
|
- **Persistent storage**: Tracking survives agent restarts via JSON storage
|
|
- **Automatic management**: Flags cleared when services restarted via dashboard
|
|
- **Maintenance friendly**: No false alerts during intentional service operations
|
|
|
|
## Architecture
|
|
|
|
### Individual Metrics Philosophy
|
|
- **Agent**: Collects individual metrics, calculates status using thresholds
|
|
- **Dashboard**: Subscribes to specific metrics, composes widgets from individual data
|
|
- **ZMQ Communication**: Efficient real-time metric transmission
|
|
- **Status Aggregation**: Host-level status calculated from all service metrics
|
|
|
|
### Components
|
|
|
|
```
|
|
┌─────────────────┐ ZMQ ┌─────────────────┐
|
|
│ │◄──────────►│ │
|
|
│ Agent │ Metrics │ Dashboard │
|
|
│ - Collectors │ │ - TUI │
|
|
│ - Status │ │ - Widgets │
|
|
│ - Tracking │ │ - Commands │
|
|
│ │ │ │
|
|
└─────────────────┘ └─────────────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌─────────────────┐ ┌─────────────────┐
|
|
│ JSON Storage │ │ SSH + tmux │
|
|
│ - User-stopped │ │ - Remote rebuild│
|
|
│ - Cache │ │ - Process │
|
|
│ - State │ │ isolation │
|
|
└─────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
### Service Control Flow
|
|
|
|
1. **User Action**: Dashboard sends `UserStart`/`UserStop` commands
|
|
2. **Agent Processing**:
|
|
- Marks service as user-stopped (if stopping)
|
|
- Executes `systemctl start/stop service`
|
|
- Syncs state to global tracker
|
|
3. **Status Calculation**:
|
|
- Systemd collector checks user-stopped flag
|
|
- Reports Status::OK for user-stopped inactive services
|
|
- Normal Warning status for system failures
|
|
|
|
## Interface
|
|
|
|
```
|
|
cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
|
|
┌system──────────────────────────────┐┌services─────────────────────────────────────────┐
|
|
│NixOS: ││Service: Status: RAM: Disk: │
|
|
│Build: 25.05.20251004.3bcc93c ││● docker active 27M 496MB │
|
|
│Agent: v0.1.43 ││● gitea active 579M 2.6GB │
|
|
│Active users: cm, simon ││● nginx active 28M 24MB │
|
|
│CPU: ││ ├─ ● gitea.cmtec.se 51ms │
|
|
│● Load: 0.10 0.52 0.88 • 3000MHz ││ ├─ ● photos.cmtec.se 41ms │
|
|
│RAM: ││● postgresql active 112M 357MB │
|
|
│● Usage: 33% 2.6GB/7.6GB ││● redis-immich user-stopped │
|
|
│● /tmp: 0% 0B/2.0GB ││● sshd active 2M 0 │
|
|
│Storage: ││● unifi active 594M 495MB │
|
|
│● root (Single): ││ │
|
|
│ ├─ ● nvme0n1 W: 1% ││ │
|
|
│ └─ ● 18% 167.4GB/928.2GB ││ │
|
|
└────────────────────────────────────┘└─────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Navigation
|
|
- **Tab**: Switch between hosts
|
|
- **↑↓ or j/k**: Navigate services
|
|
- **s**: Start selected service (UserStart)
|
|
- **S**: Stop selected service (UserStop)
|
|
- **J**: Show service logs (journalctl in tmux popup)
|
|
- **L**: Show custom log files (tail -f custom paths in tmux popup)
|
|
- **R**: Rebuild current host
|
|
- **B**: Run backup on current host
|
|
- **q**: Quit
|
|
|
|
### Status Indicators
|
|
- **Green ●**: Active service
|
|
- **Yellow ◐**: Inactive service (system issue)
|
|
- **Red ◯**: Failed service
|
|
- **Blue arrows**: Service transitioning (↑ starting, ↓ stopping, ↻ restarting)
|
|
- **"user-stopped"**: Service stopped via dashboard (Status::OK)
|
|
|
|
## Quick Start
|
|
|
|
### Building
|
|
|
|
```bash
|
|
# With Nix (recommended)
|
|
nix-shell -p openssl pkg-config --run "cargo build --workspace"
|
|
|
|
# Or with system dependencies
|
|
sudo apt install libssl-dev pkg-config # Ubuntu/Debian
|
|
cargo build --workspace
|
|
```
|
|
|
|
### Running
|
|
|
|
```bash
|
|
# Start agent (requires configuration)
|
|
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml
|
|
|
|
# Start dashboard (inside tmux session)
|
|
tmux
|
|
./target/debug/cm-dashboard --config /etc/cm-dashboard/dashboard.toml
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Agent Configuration
|
|
|
|
```toml
|
|
collection_interval_seconds = 2
|
|
|
|
[zmq]
|
|
publisher_port = 6130
|
|
command_port = 6131
|
|
bind_address = "0.0.0.0"
|
|
transmission_interval_seconds = 2
|
|
|
|
[collectors.cpu]
|
|
enabled = true
|
|
interval_seconds = 2
|
|
load_warning_threshold = 5.0
|
|
load_critical_threshold = 10.0
|
|
|
|
[collectors.memory]
|
|
enabled = true
|
|
interval_seconds = 2
|
|
usage_warning_percent = 80.0
|
|
usage_critical_percent = 90.0
|
|
|
|
[collectors.systemd]
|
|
enabled = true
|
|
interval_seconds = 10
|
|
service_name_filters = ["nginx*", "postgresql*", "docker*", "sshd*"]
|
|
excluded_services = ["nginx-config-reload", "systemd-", "getty@"]
|
|
nginx_latency_critical_ms = 1000.0
|
|
http_timeout_seconds = 10
|
|
|
|
[notifications]
|
|
enabled = true
|
|
smtp_host = "localhost"
|
|
smtp_port = 25
|
|
from_email = "{hostname}@example.com"
|
|
to_email = "admin@example.com"
|
|
aggregation_interval_seconds = 30
|
|
```
|
|
|
|
### Dashboard Configuration
|
|
|
|
```toml
|
|
[zmq]
|
|
subscriber_ports = [6130]
|
|
|
|
[hosts]
|
|
predefined_hosts = ["cmbox", "srv01", "srv02"]
|
|
|
|
[ssh]
|
|
rebuild_user = "cm"
|
|
rebuild_alias = "nixos-rebuild-cmtec"
|
|
backup_alias = "cm-backup-run"
|
|
```
|
|
|
|
## Technical Implementation
|
|
|
|
### Collectors
|
|
|
|
#### Systemd Collector
|
|
- **Service Discovery**: Uses `systemctl list-unit-files` + `list-units --all`
|
|
- **Status Calculation**: Checks user-stopped flag before assigning Warning status
|
|
- **Memory Tracking**: Per-service memory usage via `systemctl show`
|
|
- **Sub-services**: Nginx site latency, Docker containers
|
|
- **User-stopped Integration**: `UserStoppedServiceTracker::is_service_user_stopped()`
|
|
|
|
#### User-Stopped Service Tracker
|
|
- **Storage**: `/var/lib/cm-dashboard/user-stopped-services.json`
|
|
- **Thread Safety**: Global singleton with `Arc<Mutex<>>`
|
|
- **Persistence**: Automatic save on state changes
|
|
- **Global Access**: Static methods for collector integration
|
|
|
|
#### Other Collectors
|
|
- **CPU**: Load average, temperature, frequency monitoring
|
|
- **Memory**: RAM/swap usage, tmpfs monitoring
|
|
- **Disk**: Filesystem usage, SMART health data
|
|
- **NixOS**: Build version, active users, agent version
|
|
- **Backup**: Borgbackup repository status and metrics
|
|
|
|
### ZMQ Protocol
|
|
|
|
```rust
|
|
// Metric Message
|
|
#[derive(Serialize, Deserialize)]
|
|
pub struct MetricMessage {
|
|
pub hostname: String,
|
|
pub timestamp: u64,
|
|
pub metrics: Vec<Metric>,
|
|
}
|
|
|
|
// Service Commands
|
|
pub enum AgentCommand {
|
|
ServiceControl {
|
|
service_name: String,
|
|
action: ServiceAction,
|
|
},
|
|
SystemRebuild { /* SSH config */ },
|
|
CollectNow,
|
|
}
|
|
|
|
pub enum ServiceAction {
|
|
Start, // System-initiated
|
|
Stop, // System-initiated
|
|
UserStart, // User via dashboard (clears user-stopped)
|
|
UserStop, // User via dashboard (marks user-stopped)
|
|
Status,
|
|
}
|
|
```
|
|
|
|
### Maintenance Mode
|
|
|
|
Suppress notifications during planned maintenance:
|
|
|
|
```bash
|
|
# Enable maintenance mode
|
|
touch /tmp/cm-maintenance
|
|
|
|
# Perform maintenance
|
|
systemctl stop service
|
|
# ... work ...
|
|
systemctl start service
|
|
|
|
# Disable maintenance mode
|
|
rm /tmp/cm-maintenance
|
|
```
|
|
|
|
## Email Notifications
|
|
|
|
### Intelligent Batching
|
|
- **Real-time dashboard**: Immediate status updates
|
|
- **Batched emails**: Aggregated every 30 seconds
|
|
- **Smart grouping**: Services organized by severity
|
|
- **Recovery suppression**: Reduces notification spam
|
|
|
|
### Example Alert
|
|
```
|
|
Subject: Status Alert: 1 critical, 2 warnings, 0 recoveries
|
|
|
|
Status Summary (30s duration)
|
|
Host Status: Ok → Warning
|
|
|
|
🔴 CRITICAL ISSUES (1):
|
|
postgresql: Ok → Critical (memory usage 95%)
|
|
|
|
🟡 WARNINGS (2):
|
|
nginx: Ok → Warning (high load 8.5)
|
|
redis: user-stopped → Warning (restarted by system)
|
|
|
|
✅ RECOVERIES (0):
|
|
|
|
--
|
|
CM Dashboard Agent v0.1.43
|
|
```
|
|
|
|
## Development
|
|
|
|
### Project Structure
|
|
```
|
|
cm-dashboard/
|
|
├── agent/ # Metrics collection agent
|
|
│ ├── src/
|
|
│ │ ├── collectors/ # CPU, memory, disk, systemd, backup, nixos
|
|
│ │ ├── service_tracker.rs # User-stopped service tracking
|
|
│ │ ├── status/ # Status aggregation and notifications
|
|
│ │ ├── config/ # TOML configuration loading
|
|
│ │ └── communication/ # ZMQ message handling
|
|
├── dashboard/ # TUI dashboard application
|
|
│ ├── src/
|
|
│ │ ├── ui/widgets/ # CPU, memory, services, backup, system
|
|
│ │ ├── communication/ # ZMQ consumption and commands
|
|
│ │ └── app.rs # Main application loop
|
|
├── shared/ # Shared types and utilities
|
|
│ └── src/
|
|
│ ├── metrics.rs # Metric, Status, StatusTracker types
|
|
│ ├── protocol.rs # ZMQ message format
|
|
│ └── cache.rs # Cache configuration
|
|
└── CLAUDE.md # Development guidelines and rules
|
|
```
|
|
|
|
### Testing
|
|
```bash
|
|
# Build and test
|
|
nix-shell -p openssl pkg-config --run "cargo build --workspace"
|
|
nix-shell -p openssl pkg-config --run "cargo test --workspace"
|
|
|
|
# Code quality
|
|
cargo fmt --all
|
|
cargo clippy --workspace -- -D warnings
|
|
```
|
|
|
|
## Deployment
|
|
|
|
### Automated Binary Releases
|
|
```bash
|
|
# Create new release
|
|
cd ~/projects/cm-dashboard
|
|
git tag v0.1.X
|
|
git push origin v0.1.X
|
|
```
|
|
|
|
This triggers automated:
|
|
- Static binary compilation with `RUSTFLAGS="-C target-feature=+crt-static"`
|
|
- GitHub-style release creation
|
|
- Tarball upload to Gitea
|
|
|
|
### NixOS Integration
|
|
Update `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:
|
|
|
|
```nix
|
|
version = "v0.1.43";
|
|
src = pkgs.fetchurl {
|
|
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
|
|
sha256 = "sha256-HASH";
|
|
};
|
|
```
|
|
|
|
Get hash via:
|
|
```bash
|
|
cd ~/projects/nixosbox
|
|
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
|
|
url = "URL_HERE";
|
|
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
|
|
}' 2>&1 | grep "got:"
|
|
```
|
|
|
|
## Monitoring Intervals
|
|
|
|
- **Metrics Collection**: 2 seconds (CPU, memory, services)
|
|
- **Metric Transmission**: 2 seconds (ZMQ publish)
|
|
- **Dashboard Updates**: 1 second (UI refresh)
|
|
- **Email Notifications**: 30 seconds (batched)
|
|
- **Disk Monitoring**: 300 seconds (5 minutes)
|
|
- **Service Discovery**: 300 seconds (5 minutes cache)
|
|
|
|
## License
|
|
|
|
MIT License - see LICENSE file for details. |