359 lines
12 KiB
Markdown
359 lines
12 KiB
Markdown
# CM Dashboard - Infrastructure Monitoring TUI
|
||
|
||
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for specific monitoring needs and API integrations. Features real-time monitoring of all infrastructure components with intelligent email notifications and automatic status calculation.
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────────┐
|
||
│ CM Dashboard • cmbox │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Storage • ok:1 warn:0 crit:0 │ Services • ok:1 warn:0 fail:0 │
|
||
│ ┌─────────────────────────────────┐ │ ┌─────────────────────────────── │ │
|
||
│ │Drive Temp Wear Spare Hours │ │ │Service memory: 7.1/23899.7 MiB│ │
|
||
│ │nvme0n1 28°C 1% 100% 14489 │ │ │Disk usage: — │ │
|
||
│ │ Capacity Usage │ │ │ Service Memory Disk │ │
|
||
│ │ 954G 77G (8%) │ │ │✔ sshd 7.1 MiB — │ │
|
||
│ └─────────────────────────────────┘ │ └─────────────────────────────── │ │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ CPU / Memory • warn │ Backups │
|
||
│ System memory: 5251.7/23899.7 MiB │ Host cmbox awaiting backup │ │
|
||
│ CPU load (1/5/15): 2.18 2.66 2.56 │ metrics │ │
|
||
│ CPU freq: 1100.1 MHz │ │ │
|
||
│ CPU temp: 47.0°C │ │ │
|
||
├─────────────────────────────────────────────────────────────────────┤
|
||
│ Alerts • ok:0 warn:3 fail:0 │ Status • ZMQ connected │
|
||
│ cmbox: warning: CPU load 2.18 │ Monitoring • hosts: 3 │ │
|
||
│ srv01: pending: awaiting metrics │ Data source: ZMQ – connected │ │
|
||
│ labbox: pending: awaiting metrics │ Active host: cmbox (1/3) │ │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
Keys: [←→] hosts [r]efresh [q]uit
|
||
```
|
||
|
||
## Key Features
|
||
|
||
### Real-time Monitoring
|
||
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
|
||
- **Performance-focused** with minimal resource usage
|
||
- **Keyboard-driven interface** for power users
|
||
- **ZMQ gossip network** for efficient data distribution
|
||
|
||
### Infrastructure Monitoring
|
||
- **NVMe health monitoring** with wear prediction and temperature tracking
|
||
- **CPU/Memory/GPU telemetry** with automatic thresholding
|
||
- **Service resource monitoring** with per-service CPU and RAM usage
|
||
- **Disk usage overview** for root filesystems
|
||
- **Backup status** with detailed metrics and history
|
||
- **C-state monitoring** for CPU power management analysis
|
||
|
||
### Intelligent Alerting
|
||
- **Agent-calculated status** with predefined thresholds
|
||
- **Email notifications** via SMTP with rate limiting
|
||
- **Recovery notifications** with context about original issues
|
||
- **Stockholm timezone** support for email timestamps
|
||
- **Unified alert pipeline** summarizing host health
|
||
|
||
## Architecture
|
||
|
||
### Agent-Dashboard Separation
|
||
The system follows a strict separation of concerns:
|
||
|
||
- **Agent**: Single source of truth for all status calculations using defined thresholds
|
||
- **Dashboard**: Display-only interface that shows agent-provided status
|
||
- **Data Flow**: Agent (calculations) → Status → Dashboard (display) → Colors
|
||
|
||
### Agent Thresholds (Production)
|
||
- **CPU Load**: Warning ≥ 5.0, Critical ≥ 8.0
|
||
- **Memory Usage**: Warning ≥ 80%, Critical ≥ 95%
|
||
- **CPU Temperature**: Warning ≥ 100°C, Critical ≥ 100°C (effectively disabled)
|
||
|
||
### Email Notification System
|
||
- **From**: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se)
|
||
- **To**: `cm@cmtec.se`
|
||
- **SMTP**: localhost:25 (postfix)
|
||
- **Rate Limiting**: 30 minutes (configurable)
|
||
- **Triggers**: Status degradation and recovery with detailed context
|
||
|
||
## Installation
|
||
|
||
### Requirements
|
||
- Rust toolchain 1.75+ (install via [`rustup`](https://rustup.rs))
|
||
- Root privileges for agent (hardware monitoring access)
|
||
- Network access for ZMQ communication (default port 6130)
|
||
- SMTP server for notifications (postfix recommended)
|
||
|
||
### Build from Source
|
||
```bash
|
||
git clone https://github.com/cmtec/cm-dashboard.git
|
||
cd cm-dashboard
|
||
cargo build --release
|
||
```
|
||
|
||
Optimized binaries available at:
|
||
- Dashboard: `target/release/cm-dashboard`
|
||
- Agent: `target/release/cm-dashboard-agent`
|
||
|
||
### Installation
|
||
```bash
|
||
# Install dashboard
|
||
cargo install --path dashboard
|
||
|
||
# Install agent (requires root for hardware access)
|
||
sudo cargo install --path agent
|
||
```
|
||
|
||
## Quick Start
|
||
|
||
### Dashboard
|
||
```bash
|
||
# Run with default configuration
|
||
cm-dashboard
|
||
|
||
# Specify host to monitor
|
||
cm-dashboard --host cmbox
|
||
|
||
# Override ZMQ endpoints
|
||
cm-dashboard --zmq-endpoint tcp://srv01:6130,tcp://labbox:6130
|
||
|
||
# Increase logging verbosity
|
||
cm-dashboard -v
|
||
```
|
||
|
||
### Agent (Pure Auto-Discovery)
|
||
The agent requires **no configuration files** and auto-discovers all system components:
|
||
|
||
```bash
|
||
# Basic agent startup (auto-detects everything)
|
||
sudo cm-dashboard-agent
|
||
|
||
# With verbose logging for troubleshooting
|
||
sudo cm-dashboard-agent -v
|
||
```
|
||
|
||
The agent automatically:
|
||
- **Discovers storage devices** for SMART monitoring
|
||
- **Detects running systemd services** for resource tracking
|
||
- **Configures collection intervals** based on system capabilities
|
||
- **Sets up email notifications** using hostname@cmtec.se
|
||
|
||
## Configuration
|
||
|
||
### Dashboard Configuration
|
||
The dashboard creates `config/dashboard.toml` on first run:
|
||
|
||
```toml
|
||
[hosts]
|
||
default_host = "srv01"
|
||
|
||
[[hosts.hosts]]
|
||
name = "srv01"
|
||
enabled = true
|
||
|
||
[[hosts.hosts]]
|
||
name = "cmbox"
|
||
enabled = true
|
||
|
||
[dashboard]
|
||
tick_rate_ms = 250
|
||
history_duration_minutes = 60
|
||
|
||
[data_source]
|
||
kind = "zmq"
|
||
|
||
[data_source.zmq]
|
||
endpoints = ["tcp://127.0.0.1:6130"]
|
||
```
|
||
|
||
### Agent Configuration (Optional)
|
||
The agent works without configuration but supports optional settings:
|
||
|
||
```bash
|
||
# Generate example configuration
|
||
cm-dashboard-agent --help
|
||
|
||
# Override specific settings
|
||
sudo cm-dashboard-agent \
|
||
--hostname cmbox \
|
||
--bind tcp://*:6130 \
|
||
--interval 5000
|
||
```
|
||
|
||
## Monitoring Components
|
||
|
||
### System Collector
|
||
- **CPU Load**: 1/5/15 minute averages with warning/critical thresholds
|
||
- **Memory Usage**: Used/total with percentage calculation
|
||
- **CPU Temperature**: x86_pkg_temp prioritized for accuracy
|
||
- **C-States**: Power management state distribution (C0-C10)
|
||
|
||
### Service Collector
|
||
- **Systemd Services**: Auto-discovery of interesting services
|
||
- **Resource Usage**: Per-service memory and disk consumption
|
||
- **Service Health**: Running/stopped status with detailed failure info
|
||
|
||
### SMART Collector
|
||
- **NVMe Health**: Temperature, wear leveling, spare blocks
|
||
- **Drive Capacity**: Total/used space with percentage
|
||
- **SMART Attributes**: Critical health indicators
|
||
|
||
### Backup Collector
|
||
- **Restic Integration**: Backup status and history
|
||
- **Health Monitoring**: Success/failure tracking
|
||
- **Storage Metrics**: Backup size and retention
|
||
|
||
## Keyboard Controls
|
||
|
||
| Key | Action |
|
||
|-----|--------|
|
||
| `←` / `h` | Previous host |
|
||
| `→` / `l` / `Tab` | Next host |
|
||
| `?` | Toggle help overlay |
|
||
| `r` | Force refresh |
|
||
| `q` / `Esc` | Quit |
|
||
|
||
## Email Notifications
|
||
|
||
### Notification Triggers
|
||
- **Status Degradation**: Any status change to warning/critical
|
||
- **Recovery**: Warning/critical status returning to ok
|
||
- **Service Failures**: Individual service stop/start events
|
||
|
||
### Example Recovery Email
|
||
```
|
||
✅ RESOLVED: system cpu on cmbox
|
||
|
||
Status Change Alert
|
||
|
||
Host: cmbox
|
||
Component: system
|
||
Metric: cpu
|
||
Status Change: warning → ok
|
||
Time: 2025-10-12 22:15:30 CET
|
||
|
||
Details:
|
||
Recovered from: CPU load (1/5/15min): 6.20 / 5.80 / 4.50
|
||
Current status: CPU load (1/5/15min): 3.30 / 3.17 / 2.84
|
||
|
||
--
|
||
CM Dashboard Agent
|
||
Generated at 2025-10-12 22:15:30 CET
|
||
```
|
||
|
||
### Rate Limiting
|
||
- **Default**: 30 minutes between notifications per component
|
||
- **Testing**: Set to 0 for immediate notifications
|
||
- **Configurable**: Adjustable per deployment needs
|
||
|
||
## Development
|
||
|
||
### Project Structure
|
||
```
|
||
cm-dashboard/
|
||
├── agent/ # Monitoring agent
|
||
│ ├── src/
|
||
│ │ ├── collectors/ # Data collection modules
|
||
│ │ ├── notifications.rs # Email notification system
|
||
│ │ └── simple_agent.rs # Main agent logic
|
||
├── dashboard/ # TUI dashboard
|
||
│ ├── src/
|
||
│ │ ├── ui/ # Widget implementations
|
||
│ │ ├── data/ # Data structures
|
||
│ │ └── app.rs # Application state
|
||
├── shared/ # Common data structures
|
||
└── config/ # Configuration files
|
||
```
|
||
|
||
### Development Commands
|
||
```bash
|
||
# Format code
|
||
cargo fmt
|
||
|
||
# Check all packages
|
||
cargo check
|
||
|
||
# Run tests
|
||
cargo test
|
||
|
||
# Build release
|
||
cargo build --release
|
||
|
||
# Run with logging
|
||
RUST_LOG=debug cargo run -p cm-dashboard-agent
|
||
```
|
||
|
||
### Architecture Principles
|
||
|
||
#### Status Calculation Rules
|
||
- **Agent calculates all status** using predefined thresholds
|
||
- **Dashboard never calculates status** - only displays agent data
|
||
- **No hardcoded thresholds in dashboard** widgets
|
||
- **Use "unknown" when agent status missing** (never default to "ok")
|
||
|
||
#### Data Flow
|
||
```
|
||
System Metrics → Agent Collectors → Status Calculation → ZMQ → Dashboard → Display
|
||
↓
|
||
Email Notifications
|
||
```
|
||
|
||
#### Pure Auto-Discovery
|
||
- **No config files required** for basic operation
|
||
- **Runtime discovery** of system capabilities
|
||
- **Service auto-detection** via systemd patterns
|
||
- **Storage device enumeration** via /sys filesystem
|
||
|
||
## Troubleshooting
|
||
|
||
### Common Issues
|
||
|
||
#### Agent Won't Start
|
||
```bash
|
||
# Check permissions (agent requires root)
|
||
sudo cm-dashboard-agent -v
|
||
|
||
# Verify ZMQ binding
|
||
sudo netstat -tulpn | grep 6130
|
||
|
||
# Check system access
|
||
sudo smartctl --scan
|
||
```
|
||
|
||
#### Dashboard Connection Issues
|
||
```bash
|
||
# Test ZMQ connectivity
|
||
cm-dashboard --zmq-endpoint tcp://target-host:6130 -v
|
||
|
||
# Check network connectivity
|
||
telnet target-host 6130
|
||
```
|
||
|
||
#### Email Notifications Not Working
|
||
```bash
|
||
# Check postfix status
|
||
sudo systemctl status postfix
|
||
|
||
# Test SMTP manually
|
||
telnet localhost 25
|
||
|
||
# Verify notification settings
|
||
sudo cm-dashboard-agent -v | grep notification
|
||
```
|
||
|
||
### Logging
|
||
Set `RUST_LOG=debug` for detailed logging:
|
||
```bash
|
||
RUST_LOG=debug sudo cm-dashboard-agent
|
||
RUST_LOG=debug cm-dashboard
|
||
```
|
||
|
||
## License
|
||
|
||
MIT License - see LICENSE file for details.
|
||
|
||
## Contributing
|
||
|
||
1. Fork the repository
|
||
2. Create feature branch (`git checkout -b feature/amazing-feature`)
|
||
3. Commit changes (`git commit -m 'Add amazing feature'`)
|
||
4. Push to branch (`git push origin feature/amazing-feature`)
|
||
5. Open Pull Request
|
||
|
||
For bugs and feature requests, please use GitHub Issues. |