cm-dashboard/README.md

# CM Dashboard - Infrastructure Monitoring TUI

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for specific monitoring needs and API integrations. Features real-time monitoring of all infrastructure components with intelligent email notifications and automatic status calculation.

### System Widget
```
┌System───────────────────────────────────────────────────────┐
│  Memory usage                                               │
│✔ 3.0 / 7.8 GB                                               │
│  CPU load            CPU temp                               │
│✔ 1.05 • 0.96 • 0.58  64.0°C                                 │
│  C1E    C3     C6     C8     C9     C10                     │
│✔ 0.5%   0.5%   10.4%  10.2%  0.4%   77.9%                   │
│  GPU load  GPU temp                                         │
│✔ —         —                                                │
└─────────────────────────────────────────────────────────────┘
```

### Services Widget (Enhanced)
```
┌Services────────────────────────────────────────────────────┐
│  Service          Memory (GB)  CPU    Disk                 │
│✔ Service Memory   7.1/23899.7 MiB     —                   │
│✔ Disk Usage       —           —       45/100 GB           │
│⚠ CPU Load         —           2.18    —                   │
│✔ CPU Temperature  —           47.0°C  —                   │
│✔ docker-registry  0.0 GB       0.0%   <1 MB               │
│✔ gitea            0.4/4.1 GB   0.2%   970 MB               │
│  1 active connections                                      │
│✔ nginx            0.0/1.0 GB   0.0%   <1 MB                │
│✔  ├─ docker.cmtec.se                                      │
│✔  ├─ git.cmtec.se                                         │
│✔  ├─ gitea.cmtec.se                                       │
│✔  ├─ haasp.cmtec.se                                       │
│✔  ├─ pages.cmtec.se                                       │
│✔  └─ www.kryddorten.se                                    │
│✔ postgresql       0.1 GB       0.0%   378 MB               │
│  1 active connections                                      │
│✔ redis-immich     0.0 GB       0.4%   <1 MB                │
│✔ sshd             0.0 GB       0.0%   <1 MB                │
│  1 SSH connection                                          │
│✔ unifi            0.9/2.0 GB   0.4%   391 MB               │
└────────────────────────────────────────────────────────────┘
```

### Storage Widget
```
┌Storage──────────────────────────────────────────────────────┐
│  Drive    Temp   Wear   Spare  Hours  Capacity  Usage       │
│✔ nvme0n1  57°C   4%     100%   11463  932G      23G (2%)    │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### Backups Widget
```
┌Backups──────────────────────────────────────────────────────┐
│  Backup  Status  Details                                    │
│✔ Latest  3h ago  1.4 GiB                                    │
│  8 archives, 2.4 GiB total                                  │
│✔ Disk    ok      2.4/468 GB (1%)                            │
└─────────────────────────────────────────────────────────────┘
```

### Hosts Widget
```
┌Hosts────────────────────────────────────────────────────────┐
│  Host    Status            Timestamp                        │
│✔ cmbox   ok                2025-10-13 05:45:28              │
│✔ srv01   ok                2025-10-13 05:45:28              │
│? labbox  No data received  —                                │
└─────────────────────────────────────────────────────────────┘
```

**Navigation**: `←→` hosts, `r` refresh, `q` quit

## Key Features

### Real-time Monitoring
- **Multi-host support** for cmbox, labbox, simonbox, steambox, srv01
- **Performance-focused** with minimal resource usage
- **Keyboard-driven interface** for power users
- **ZMQ gossip network** for efficient data distribution

### Infrastructure Monitoring
- **NVMe health monitoring** with wear prediction and temperature tracking
- **CPU/Memory/GPU telemetry** with automatic thresholding
- **Service resource monitoring** with per-service CPU and RAM usage
- **Disk usage overview** for root filesystems
- **Backup status** with detailed metrics and history
- **C-state monitoring** for CPU power management analysis

### Intelligent Alerting
- **Agent-calculated status** with predefined thresholds
- **Email notifications** via SMTP with rate limiting
- **Recovery notifications** with context about original issues
- **Stockholm timezone** support for email timestamps
- **Unified alert pipeline** summarizing host health

## Architecture

### Agent-Dashboard Separation
The system follows a strict separation of concerns:

- **Agent**: Single source of truth for all status calculations using defined thresholds
- **Dashboard**: Display-only interface that shows agent-provided status
- **Data Flow**: Agent (calculations) → Status → Dashboard (display) → Colors

### Agent Thresholds (Production)
- **CPU Load**: Warning ≥ 5.0, Critical ≥ 8.0
- **Memory Usage**: Warning ≥ 80%, Critical ≥ 95%
- **CPU Temperature**: Warning ≥ 100°C, Critical ≥ 100°C (effectively disabled)

### Email Notification System
- **From**: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se)
- **To**: `cm@cmtec.se`
- **SMTP**: localhost:25 (postfix)
- **Rate Limiting**: 30 minutes (configurable)
- **Triggers**: Status degradation and recovery with detailed context

## Installation

### Requirements
- Rust toolchain 1.75+ (install via [`rustup`](https://rustup.rs))
- Root privileges for agent (hardware monitoring access)
- Network access for ZMQ communication (default port 6130)
- SMTP server for notifications (postfix recommended)

### Build from Source
```bash
git clone https://github.com/cmtec/cm-dashboard.git
cd cm-dashboard
cargo build --release
```

Optimized binaries available at:
- Dashboard: `target/release/cm-dashboard`
- Agent: `target/release/cm-dashboard-agent`

### Installation
```bash
# Install dashboard
cargo install --path dashboard

# Install agent (requires root for hardware access)
sudo cargo install --path agent
```

## Quick Start

### Dashboard
```bash
# Run with default configuration
cm-dashboard

# Specify host to monitor
cm-dashboard --host cmbox

# Override ZMQ endpoints
cm-dashboard --zmq-endpoint tcp://srv01:6130,tcp://labbox:6130

# Increase logging verbosity
cm-dashboard -v
```

### Agent (Pure Auto-Discovery)
The agent requires **no configuration files** and auto-discovers all system components:

```bash
# Basic agent startup (auto-detects everything)
sudo cm-dashboard-agent

# With verbose logging for troubleshooting
sudo cm-dashboard-agent -v
```

The agent automatically:
- **Discovers storage devices** for SMART monitoring
- **Detects running systemd services** for resource tracking
- **Configures collection intervals** based on system capabilities
- **Sets up email notifications** using hostname@cmtec.se

## Configuration

### Dashboard Configuration
The dashboard creates `config/dashboard.toml` on first run:

```toml
[hosts]
default_host = "srv01"

[[hosts.hosts]]
name = "srv01"
enabled = true

[[hosts.hosts]]
name = "cmbox"
enabled = true

[dashboard]
tick_rate_ms = 250
history_duration_minutes = 60

[data_source]
kind = "zmq"

[data_source.zmq]
endpoints = ["tcp://127.0.0.1:6130"]
```

### Agent Configuration (Optional)
The agent works without configuration but supports optional settings:

```bash
# Generate example configuration
cm-dashboard-agent --help

# Override specific settings
sudo cm-dashboard-agent \
    --hostname cmbox \
    --bind tcp://*:6130 \
    --interval 5000
```

## Widget Layout

### Services Widget Structure
The Services widget now displays both system metrics and services in a unified table:

```
┌Services────────────────────────────────────────────────────┐
│  Service          Memory (GB)  CPU    Disk                 │
│✔ Service Memory   7.1/23899.7 MiB     —                   │ ← System metric as service row
│✔ Disk Usage       —           —       45/100 GB           │ ← System metric as service row
│⚠ CPU Load         —           2.18    —                   │ ← System metric as service row
│✔ CPU Temperature  —           47.0°C  —                   │ ← System metric as service row
│✔ docker-registry  0.0 GB      0.0%    <1 MB               │ ← Regular service
│✔ nginx            0.0/1.0 GB  0.0%    <1 MB               │ ← Regular service
│✔  ├─ docker.cmtec.se                                      │ ← Nginx site (sub-service)
│✔  ├─ git.cmtec.se                                         │ ← Nginx site (sub-service)
│✔  └─ gitea.cmtec.se                                       │ ← Nginx site (sub-service)
│✔ sshd             0.0 GB      0.0%    <1 MB               │ ← Regular service
│  1 SSH connection                                          │ ← Service description
└────────────────────────────────────────────────────────────┘
```

**Row Types:**
- **System Metrics**: CPU Load, Service Memory, Disk Usage, CPU Temperature with status indicators
- **Regular Services**: Full resource data (memory, CPU, disk) with optional description lines
- **Sub-services**: Nginx sites with tree structure, status indicators only (no resource columns)
- **Description Lines**: Connection counts and service-specific info without status indicators

### Hosts Widget (formerly Alerts)
The Hosts widget provides a summary view of all monitored hosts:

```
┌Hosts────────────────────────────────────────────────────────┐
│  Host    Status            Timestamp                        │
│✔ cmbox   ok                2025-10-13 05:45:28              │
│✔ srv01   ok                2025-10-13 05:45:28              │
│? labbox  No data received  —                                │
└─────────────────────────────────────────────────────────────┘
```

## Monitoring Components

### System Collector
- **CPU Load**: 1/5/15 minute averages with warning/critical thresholds
- **Memory Usage**: Used/total with percentage calculation
- **CPU Temperature**: x86_pkg_temp prioritized for accuracy
- **C-States**: Power management state distribution (C0-C10)

### Service Collector
- **System Metrics as Services**: CPU Load, Service Memory, Disk Usage, CPU Temperature displayed as individual service rows
- **Systemd Services**: Auto-discovery of interesting services with resource monitoring
- **Nginx Site Monitoring**: Individual rows for each nginx virtual host with tree structure (`├─` and `└─`)
- **Resource Usage**: Per-service memory, CPU, and disk consumption
- **Service Health**: Running/stopped/degraded status with detailed failure info
- **Connection Tracking**: SSH connections, database connections as description lines

### SMART Collector
- **NVMe Health**: Temperature, wear leveling, spare blocks
- **Drive Capacity**: Total/used space with percentage
- **SMART Attributes**: Critical health indicators

### Backup Collector
- **Restic Integration**: Backup status and history
- **Health Monitoring**: Success/failure tracking
- **Storage Metrics**: Backup size and retention

## Keyboard Controls

| Key | Action |
|-----|--------|
| `←` / `h` | Previous host |
| `→` / `l` / `Tab` | Next host |
| `?` | Toggle help overlay |
| `r` | Force refresh |
| `q` / `Esc` | Quit |

## Email Notifications

### Notification Triggers
- **Status Degradation**: Any status change to warning/critical
- **Recovery**: Warning/critical status returning to ok
- **Service Failures**: Individual service stop/start events

### Example Recovery Email
```
✅ RESOLVED: system cpu on cmbox

Status Change Alert

Host: cmbox
Component: system
Metric: cpu
Status Change: warning → ok
Time: 2025-10-12 22:15:30 CET

Details:
Recovered from: CPU load (1/5/15min): 6.20 / 5.80 / 4.50
Current status: CPU load (1/5/15min): 3.30 / 3.17 / 2.84

--
CM Dashboard Agent
Generated at 2025-10-12 22:15:30 CET
```

### Rate Limiting
- **Default**: 30 minutes between notifications per component
- **Testing**: Set to 0 for immediate notifications
- **Configurable**: Adjustable per deployment needs

## Development

### Project Structure
```
cm-dashboard/
├── agent/                 # Monitoring agent
│   ├── src/
│   │   ├── collectors/    # Data collection modules
│   │   ├── notifications.rs # Email notification system
│   │   └── simple_agent.rs # Main agent logic
├── dashboard/             # TUI dashboard
│   ├── src/
│   │   ├── ui/           # Widget implementations
│   │   ├── data/         # Data structures
│   │   └── app.rs        # Application state
├── shared/               # Common data structures
└── config/              # Configuration files
```

### Development Commands
```bash
# Format code
cargo fmt

# Check all packages
cargo check

# Run tests
cargo test

# Build release
cargo build --release

# Run with logging
RUST_LOG=debug cargo run -p cm-dashboard-agent
```

### Architecture Principles

#### Status Calculation Rules
- **Agent calculates all status** using predefined thresholds
- **Dashboard never calculates status** - only displays agent data
- **No hardcoded thresholds in dashboard** widgets
- **Use "unknown" when agent status missing** (never default to "ok")

#### Data Flow
```
System Metrics → Agent Collectors → Status Calculation → ZMQ → Dashboard → Display
                                         ↓
                                 Email Notifications
```

#### Pure Auto-Discovery
- **No config files required** for basic operation
- **Runtime discovery** of system capabilities
- **Service auto-detection** via systemd patterns
- **Storage device enumeration** via /sys filesystem

## Troubleshooting

### Common Issues

#### Agent Won't Start
```bash
# Check permissions (agent requires root)
sudo cm-dashboard-agent -v

# Verify ZMQ binding
sudo netstat -tulpn | grep 6130

# Check system access
sudo smartctl --scan
```

#### Dashboard Connection Issues
```bash
# Test ZMQ connectivity
cm-dashboard --zmq-endpoint tcp://target-host:6130 -v

# Check network connectivity
telnet target-host 6130
```

#### Email Notifications Not Working
```bash
# Check postfix status
sudo systemctl status postfix

# Test SMTP manually
telnet localhost 25

# Verify notification settings
sudo cm-dashboard-agent -v | grep notification
```

### Logging
Set `RUST_LOG=debug` for detailed logging:
```bash
RUST_LOG=debug sudo cm-dashboard-agent
RUST_LOG=debug cm-dashboard
```

## License

MIT License - see LICENSE file for details.

## Contributing

1. Fork the repository
2. Create feature branch (`git checkout -b feature/amazing-feature`)
3. Commit changes (`git commit -m 'Add amazing feature'`)
4. Push to branch (`git push origin feature/amazing-feature`)
5. Open Pull Request

For bugs and feature requests, please use GitHub Issues.