Services widget: - Fix disk quota formatting with proper rounding instead of truncation - Remove decimals from RAM quotas and use GB instead of G - Change quota display to use GB consistently Backups widget: - Change GiB to GB for consistency - Remove spaces between numbers and units - Update disk usage format to match other widgets: used (totalGB) - Remove percentage display for cleaner format System widget: - Add support for logged-in users in description lines - Format C-states with "C-State:" prefix on first line, indent subsequent lines - Add logged_in_users field to SystemSummary data structure Documentation: - Add example hash error output to NixOS update instructions
544 lines
20 KiB
Markdown
544 lines
20 KiB
Markdown
# CM Dashboard - Infrastructure Monitoring TUI
|
|
|
|
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for specific monitoring needs and API integrations. Features real-time monitoring of all infrastructure components with intelligent email notifications and automatic status calculation.
|
|
|
|
### System Widget
|
|
```
|
|
┌System───────────────────────────────────────────────────────┐
|
|
│ Memory usage │
|
|
│✔ 3.0 / 7.8 GB │
|
|
│ CPU load CPU temp │
|
|
│✔ 1.05 • 0.96 • 0.58 64.0°C │
|
|
│ C1E C3 C6 C8 C9 C10 │
|
|
│✔ 0.5% 0.5% 10.4% 10.2% 0.4% 77.9% │
|
|
│ GPU load GPU temp │
|
|
│✔ — — │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Services Widget (Enhanced)
|
|
```
|
|
┌Services────────────────────────────────────────────────────┐
|
|
│ Service Memory (GB) CPU Disk │
|
|
│✔ Service Memory 7.1/23899.7 MiB — │
|
|
│✔ Disk Usage — — 45/100 GB │
|
|
│⚠ CPU Load — 2.18 — │
|
|
│✔ CPU Temperature — 47.0°C — │
|
|
│✔ docker-registry 0.0 GB 0.0% <1 MB │
|
|
│✔ gitea 0.4/4.1 GB 0.2% 970 MB │
|
|
│ 1 active connections │
|
|
│✔ nginx 0.0/1.0 GB 0.0% <1 MB │
|
|
│✔ ├─ docker.cmtec.se │
|
|
│✔ ├─ git.cmtec.se │
|
|
│✔ ├─ gitea.cmtec.se │
|
|
│✔ ├─ haasp.cmtec.se │
|
|
│✔ ├─ pages.cmtec.se │
|
|
│✔ └─ www.kryddorten.se │
|
|
│✔ postgresql 0.1 GB 0.0% 378 MB │
|
|
│ 1 active connections │
|
|
│✔ redis-immich 0.0 GB 0.4% <1 MB │
|
|
│✔ sshd 0.0 GB 0.0% <1 MB │
|
|
│ 1 SSH connection │
|
|
│✔ unifi 0.9/2.0 GB 0.4% 391 MB │
|
|
└────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Storage Widget
|
|
```
|
|
┌Storage──────────────────────────────────────────────────────┐
|
|
│ Drive Temp Wear Spare Hours Capacity Usage │
|
|
│✔ nvme0n1 57°C 4% 100% 11463 932G 23G (2%) │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Backups Widget
|
|
```
|
|
┌Backups──────────────────────────────────────────────────────┐
|
|
│ Backup Status Details │
|
|
│✔ Latest 3h ago 1.4 GiB │
|
|
│ 8 archives, 2.4 GiB total │
|
|
│✔ Disk ok 2.4/468 GB (1%) │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Hosts Widget
|
|
```
|
|
┌Hosts────────────────────────────────────────────────────────┐
|
|
│ Host Status Timestamp │
|
|
│✔ cmbox ok 2025-10-13 05:45:28 │
|
|
│✔ srv01 ok 2025-10-13 05:45:28 │
|
|
│? labbox No data received — │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Navigation**: `←→` hosts, `r` refresh, `q` quit
|
|
|
|
## Key Features
|
|
|
|
### Real-time Monitoring
|
|
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
|
|
- **Performance-focused** with minimal resource usage
|
|
- **Keyboard-driven interface** for power users
|
|
- **ZMQ gossip network** for efficient data distribution
|
|
|
|
### Infrastructure Monitoring
|
|
- **NVMe health monitoring** with wear prediction and temperature tracking
|
|
- **CPU/Memory/GPU telemetry** with automatic thresholding
|
|
- **Service resource monitoring** with per-service CPU and RAM usage
|
|
- **Disk usage overview** for root filesystems
|
|
- **Backup status** with detailed metrics and history
|
|
- **C-state monitoring** for CPU power management analysis
|
|
|
|
### Intelligent Alerting
|
|
- **Agent-calculated status** with predefined thresholds
|
|
- **Email notifications** via SMTP with rate limiting
|
|
- **Recovery notifications** with context about original issues
|
|
- **Stockholm timezone** support for email timestamps
|
|
- **Unified alert pipeline** summarizing host health
|
|
|
|
## Architecture
|
|
|
|
### Agent-Dashboard Separation
|
|
The system follows a strict separation of concerns:
|
|
|
|
- **Agent**: Single source of truth for all status calculations using defined thresholds
|
|
- **Dashboard**: Display-only interface that shows agent-provided status
|
|
- **Data Flow**: Agent (calculations) → Status → Dashboard (display) → Colors
|
|
|
|
### Agent Thresholds (Production)
|
|
- **CPU Load**: Warning ≥ 5.0, Critical ≥ 8.0
|
|
- **Memory Usage**: Warning ≥ 80%, Critical ≥ 95%
|
|
- **CPU Temperature**: Warning ≥ 100°C, Critical ≥ 100°C (effectively disabled)
|
|
|
|
### Email Notification System
|
|
- **From**: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se)
|
|
- **To**: `cm@cmtec.se`
|
|
- **SMTP**: localhost:25 (postfix)
|
|
- **Rate Limiting**: 30 minutes (configurable)
|
|
- **Triggers**: Status degradation and recovery with detailed context
|
|
|
|
## Installation
|
|
|
|
### Requirements
|
|
- Rust toolchain 1.75+ (install via [`rustup`](https://rustup.rs))
|
|
- Root privileges for agent (hardware monitoring access)
|
|
- Network access for ZMQ communication (default port 6130)
|
|
- SMTP server for notifications (postfix recommended)
|
|
|
|
### Build from Source
|
|
```bash
|
|
git clone https://github.com/cmtec/cm-dashboard.git
|
|
cd cm-dashboard
|
|
cargo build --release
|
|
```
|
|
|
|
Optimized binaries available at:
|
|
- Dashboard: `target/release/cm-dashboard`
|
|
- Agent: `target/release/cm-dashboard-agent`
|
|
|
|
### Installation
|
|
```bash
|
|
# Install dashboard
|
|
cargo install --path dashboard
|
|
|
|
# Install agent (requires root for hardware access)
|
|
sudo cargo install --path agent
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
### Dashboard
|
|
```bash
|
|
# Run with default configuration
|
|
cm-dashboard
|
|
|
|
# Specify host to monitor
|
|
cm-dashboard --host cmbox
|
|
|
|
# Override ZMQ endpoints
|
|
cm-dashboard --zmq-endpoint tcp://srv01:6130,tcp://labbox:6130
|
|
|
|
# Increase logging verbosity
|
|
cm-dashboard -v
|
|
```
|
|
|
|
### Agent (Pure Auto-Discovery)
|
|
The agent requires **no configuration files** and auto-discovers all system components:
|
|
|
|
```bash
|
|
# Basic agent startup (auto-detects everything)
|
|
sudo cm-dashboard-agent
|
|
|
|
# With verbose logging for troubleshooting
|
|
sudo cm-dashboard-agent -v
|
|
```
|
|
|
|
The agent automatically:
|
|
- **Discovers storage devices** for SMART monitoring
|
|
- **Detects running systemd services** for resource tracking
|
|
- **Configures collection intervals** based on system capabilities
|
|
- **Sets up email notifications** using hostname@cmtec.se
|
|
|
|
## Configuration
|
|
|
|
### Dashboard Configuration
|
|
The dashboard creates `config/dashboard.toml` on first run:
|
|
|
|
```toml
|
|
[hosts]
|
|
default_host = "srv01"
|
|
|
|
[[hosts.hosts]]
|
|
name = "srv01"
|
|
enabled = true
|
|
|
|
[[hosts.hosts]]
|
|
name = "cmbox"
|
|
enabled = true
|
|
|
|
[dashboard]
|
|
tick_rate_ms = 250
|
|
history_duration_minutes = 60
|
|
|
|
[data_source]
|
|
kind = "zmq"
|
|
|
|
[data_source.zmq]
|
|
endpoints = ["tcp://127.0.0.1:6130"]
|
|
```
|
|
|
|
### Agent Configuration (Optional)
|
|
The agent works without configuration but supports optional settings:
|
|
|
|
```bash
|
|
# Generate example configuration
|
|
cm-dashboard-agent --help
|
|
|
|
# Override specific settings
|
|
sudo cm-dashboard-agent \
|
|
--hostname cmbox \
|
|
--bind tcp://*:6130 \
|
|
--interval 5000
|
|
```
|
|
|
|
## Widget Layout
|
|
|
|
### Services Widget Structure
|
|
The Services widget now displays both system metrics and services in a unified table:
|
|
|
|
```
|
|
┌Services────────────────────────────────────────────────────┐
|
|
│ Service Memory (GB) CPU Disk │
|
|
│✔ Service Memory 7.1/23899.7 MiB — │ ← System metric as service row
|
|
│✔ Disk Usage — — 45/100 GB │ ← System metric as service row
|
|
│⚠ CPU Load — 2.18 — │ ← System metric as service row
|
|
│✔ CPU Temperature — 47.0°C — │ ← System metric as service row
|
|
│✔ docker-registry 0.0 GB 0.0% <1 MB │ ← Regular service
|
|
│✔ nginx 0.0/1.0 GB 0.0% <1 MB │ ← Regular service
|
|
│✔ ├─ docker.cmtec.se │ ← Nginx site (sub-service)
|
|
│✔ ├─ git.cmtec.se │ ← Nginx site (sub-service)
|
|
│✔ └─ gitea.cmtec.se │ ← Nginx site (sub-service)
|
|
│✔ sshd 0.0 GB 0.0% <1 MB │ ← Regular service
|
|
│ 1 SSH connection │ ← Service description
|
|
└────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Row Types:**
|
|
- **System Metrics**: CPU Load, Service Memory, Disk Usage, CPU Temperature with status indicators
|
|
- **Regular Services**: Full resource data (memory, CPU, disk) with optional description lines
|
|
- **Sub-services**: Nginx sites with tree structure, status indicators only (no resource columns)
|
|
- **Description Lines**: Connection counts and service-specific info without status indicators
|
|
|
|
### Hosts Widget (formerly Alerts)
|
|
The Hosts widget provides a summary view of all monitored hosts:
|
|
|
|
```
|
|
┌Hosts────────────────────────────────────────────────────────┐
|
|
│ Host Status Timestamp │
|
|
│✔ cmbox ok 2025-10-13 05:45:28 │
|
|
│✔ srv01 ok 2025-10-13 05:45:28 │
|
|
│? labbox No data received — │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Monitoring Components
|
|
|
|
### System Collector
|
|
- **CPU Load**: 1/5/15 minute averages with warning/critical thresholds
|
|
- **Memory Usage**: Used/total with percentage calculation
|
|
- **CPU Temperature**: x86_pkg_temp prioritized for accuracy
|
|
- **C-States**: Power management state distribution (C0-C10)
|
|
|
|
### Service Collector
|
|
- **System Metrics as Services**: CPU Load, Service Memory, Disk Usage, CPU Temperature displayed as individual service rows
|
|
- **Systemd Services**: Auto-discovery of interesting services with resource monitoring
|
|
- **Nginx Site Monitoring**: Individual rows for each nginx virtual host with tree structure (`├─` and `└─`)
|
|
- **Resource Usage**: Per-service memory, CPU, and disk consumption
|
|
- **Service Health**: Running/stopped/degraded status with detailed failure info
|
|
- **Connection Tracking**: SSH connections, database connections as description lines
|
|
|
|
### SMART Collector
|
|
- **NVMe Health**: Temperature, wear leveling, spare blocks
|
|
- **Drive Capacity**: Total/used space with percentage
|
|
- **SMART Attributes**: Critical health indicators
|
|
|
|
### Backup Collector
|
|
- **Restic Integration**: Backup status and history
|
|
- **Health Monitoring**: Success/failure tracking
|
|
- **Storage Metrics**: Backup size and retention
|
|
|
|
## Keyboard Controls
|
|
|
|
| Key | Action |
|
|
|-----|--------|
|
|
| `←` / `h` | Previous host |
|
|
| `→` / `l` / `Tab` | Next host |
|
|
| `?` | Toggle help overlay |
|
|
| `r` | Force refresh |
|
|
| `q` / `Esc` | Quit |
|
|
|
|
## Email Notifications
|
|
|
|
### Notification Triggers
|
|
- **Status Degradation**: Any status change to warning/critical
|
|
- **Recovery**: Warning/critical status returning to ok
|
|
- **Service Failures**: Individual service stop/start events
|
|
|
|
### Example Recovery Email
|
|
```
|
|
✅ RESOLVED: system cpu on cmbox
|
|
|
|
Status Change Alert
|
|
|
|
Host: cmbox
|
|
Component: system
|
|
Metric: cpu
|
|
Status Change: warning → ok
|
|
Time: 2025-10-12 22:15:30 CET
|
|
|
|
Details:
|
|
Recovered from: CPU load (1/5/15min): 6.20 / 5.80 / 4.50
|
|
Current status: CPU load (1/5/15min): 3.30 / 3.17 / 2.84
|
|
|
|
--
|
|
CM Dashboard Agent
|
|
Generated at 2025-10-12 22:15:30 CET
|
|
```
|
|
|
|
### Rate Limiting
|
|
- **Default**: 30 minutes between notifications per component
|
|
- **Testing**: Set to 0 for immediate notifications
|
|
- **Configurable**: Adjustable per deployment needs
|
|
|
|
## Development
|
|
|
|
### Project Structure
|
|
```
|
|
cm-dashboard/
|
|
├── agent/ # Monitoring agent
|
|
│ ├── src/
|
|
│ │ ├── collectors/ # Data collection modules
|
|
│ │ ├── notifications.rs # Email notification system
|
|
│ │ └── simple_agent.rs # Main agent logic
|
|
├── dashboard/ # TUI dashboard
|
|
│ ├── src/
|
|
│ │ ├── ui/ # Widget implementations
|
|
│ │ ├── data/ # Data structures
|
|
│ │ └── app.rs # Application state
|
|
├── shared/ # Common data structures
|
|
└── config/ # Configuration files
|
|
```
|
|
|
|
### Development Commands
|
|
```bash
|
|
# Format code
|
|
cargo fmt
|
|
|
|
# Check all packages
|
|
cargo check
|
|
|
|
# Run tests
|
|
cargo test
|
|
|
|
# Build release
|
|
cargo build --release
|
|
|
|
# Run with logging
|
|
RUST_LOG=debug cargo run -p cm-dashboard-agent
|
|
```
|
|
|
|
### Architecture Principles
|
|
|
|
#### Status Calculation Rules
|
|
- **Agent calculates all status** using predefined thresholds
|
|
- **Dashboard never calculates status** - only displays agent data
|
|
- **No hardcoded thresholds in dashboard** widgets
|
|
- **Use "unknown" when agent status missing** (never default to "ok")
|
|
|
|
#### Data Flow
|
|
```
|
|
System Metrics → Agent Collectors → Status Calculation → ZMQ → Dashboard → Display
|
|
↓
|
|
Email Notifications
|
|
```
|
|
|
|
#### Pure Auto-Discovery
|
|
- **No config files required** for basic operation
|
|
- **Runtime discovery** of system capabilities
|
|
- **Service auto-detection** via systemd patterns
|
|
- **Storage device enumeration** via /sys filesystem
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### Agent Won't Start
|
|
```bash
|
|
# Check permissions (agent requires root)
|
|
sudo cm-dashboard-agent -v
|
|
|
|
# Verify ZMQ binding
|
|
sudo netstat -tulpn | grep 6130
|
|
|
|
# Check system access
|
|
sudo smartctl --scan
|
|
```
|
|
|
|
#### Dashboard Connection Issues
|
|
```bash
|
|
# Test ZMQ connectivity
|
|
cm-dashboard --zmq-endpoint tcp://target-host:6130 -v
|
|
|
|
# Check network connectivity
|
|
telnet target-host 6130
|
|
```
|
|
|
|
#### Email Notifications Not Working
|
|
```bash
|
|
# Check postfix status
|
|
sudo systemctl status postfix
|
|
|
|
# Test SMTP manually
|
|
telnet localhost 25
|
|
|
|
# Verify notification settings
|
|
sudo cm-dashboard-agent -v | grep notification
|
|
```
|
|
|
|
### Logging
|
|
Set `RUST_LOG=debug` for detailed logging:
|
|
```bash
|
|
RUST_LOG=debug sudo cm-dashboard-agent
|
|
RUST_LOG=debug cm-dashboard
|
|
```
|
|
|
|
## License
|
|
|
|
MIT License - see LICENSE file for details.
|
|
|
|
## Contributing
|
|
|
|
1. Fork the repository
|
|
2. Create feature branch (`git checkout -b feature/amazing-feature`)
|
|
3. Commit changes (`git commit -m 'Add amazing feature'`)
|
|
4. Push to branch (`git push origin feature/amazing-feature`)
|
|
5. Open Pull Request
|
|
|
|
For bugs and feature requests, please use GitHub Issues.
|
|
|
|
## NixOS Integration
|
|
|
|
### Updating cm-dashboard in NixOS Configuration
|
|
|
|
When new code is pushed to the cm-dashboard repository, follow these steps to update the NixOS configuration:
|
|
|
|
#### 1. Get the Latest Commit Hash
|
|
```bash
|
|
# Get the latest commit from the API
|
|
curl -s "https://gitea.cmtec.se/api/v1/repos/cm/cm-dashboard/commits?sha=main&limit=1" | head -20
|
|
|
|
# Or use git
|
|
git log --oneline -1
|
|
```
|
|
|
|
#### 2. Update the NixOS Configuration
|
|
Edit `hosts/common/cm-dashboard.nix` and update the `rev` field:
|
|
```nix
|
|
src = pkgs.fetchFromGitea {
|
|
domain = "gitea.cmtec.se";
|
|
owner = "cm";
|
|
repo = "cm-dashboard";
|
|
rev = "f786d054f2ece80823f85e46933857af96e241b2"; # Update this
|
|
hash = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; # Reset temporarily
|
|
};
|
|
```
|
|
|
|
#### 3. Get the Correct Hash
|
|
Build with placeholder hash to get the actual hash:
|
|
```bash
|
|
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchFromGitea {
|
|
domain = "gitea.cmtec.se";
|
|
owner = "cm";
|
|
repo = "cm-dashboard";
|
|
rev = "YOUR_COMMIT_HASH";
|
|
hash = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
|
|
}' 2>&1 | grep "got:"
|
|
```
|
|
|
|
Example output:
|
|
```
|
|
error: hash mismatch in fixed-output derivation '/nix/store/...':
|
|
specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
|
|
got: sha256-x8crxNusOUYRrkP9mYEOG+Ga3JCPIdJLkEAc5P1ZxdQ=
|
|
```
|
|
|
|
#### 4. Update the Hash
|
|
Replace the placeholder with the correct hash from the error message (the "got:" line):
|
|
```nix
|
|
hash = "sha256-vjy+j91iDCHUf0RE43anK4WZ+rKcyohP/3SykwZGof8="; # Use actual hash
|
|
```
|
|
|
|
#### 5. Update Cargo Dependencies (if needed)
|
|
If Cargo.lock has changed, you may need to update `cargoHash`:
|
|
```bash
|
|
# Build to get cargo hash error
|
|
nix-build --no-out-link --expr 'with import <nixpkgs> {}; rustPlatform.buildRustPackage rec {
|
|
pname = "cm-dashboard";
|
|
version = "0.1.0";
|
|
src = fetchFromGitea {
|
|
domain = "gitea.cmtec.se";
|
|
owner = "cm";
|
|
repo = "cm-dashboard";
|
|
rev = "YOUR_COMMIT_HASH";
|
|
hash = "YOUR_SOURCE_HASH";
|
|
};
|
|
cargoHash = "";
|
|
nativeBuildInputs = [ pkg-config ];
|
|
buildInputs = [ openssl ];
|
|
buildAndTestSubdir = ".";
|
|
cargoBuildFlags = [ "--workspace" ];
|
|
}' 2>&1 | grep "got:"
|
|
```
|
|
|
|
Then update `cargoHash` in the configuration.
|
|
|
|
#### 6. Commit the Changes
|
|
```bash
|
|
git add hosts/common/cm-dashboard.nix
|
|
git commit -m "Update cm-dashboard to latest version"
|
|
git push
|
|
```
|
|
|
|
### Example Update Process
|
|
```bash
|
|
# 1. Get latest commit
|
|
LATEST_COMMIT=$(curl -s "https://gitea.cmtec.se/api/v1/repos/cm/cm-dashboard/commits?sha=main&limit=1" | grep '"sha"' | head -1 | cut -d'"' -f4)
|
|
|
|
# 2. Get source hash
|
|
SOURCE_HASH=$(nix-build --no-out-link -E "with import <nixpkgs> {}; fetchFromGitea { domain = \"gitea.cmtec.se\"; owner = \"cm\"; repo = \"cm-dashboard\"; rev = \"$LATEST_COMMIT\"; hash = \"sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=\"; }" 2>&1 | grep "got:" | cut -d' ' -f12)
|
|
|
|
# 3. Update configuration and commit
|
|
echo "Latest commit: $LATEST_COMMIT"
|
|
echo "Source hash: $SOURCE_HASH"
|
|
``` |