Updated readme

This commit is contained in:
Christoffer Martinsson 2025-10-21 20:47:30 +02:00
parent 0417e2c1f1
commit 1315ba1315

View File

@ -5,6 +5,7 @@ A real-time infrastructure monitoring system with intelligent status aggregation
## Current Implementation ## Current Implementation
This is a complete rewrite implementing an **individual metrics architecture** where: This is a complete rewrite implementing an **individual metrics architecture** where:
- **Agent** collects individual metrics (e.g., `cpu_load_1min`, `memory_usage_percent`) and calculates status - **Agent** collects individual metrics (e.g., `cpu_load_1min`, `memory_usage_percent`) and calculates status
- **Dashboard** subscribes to specific metrics and composes widgets - **Dashboard** subscribes to specific metrics and composes widgets
- **Status Aggregation** provides intelligent email notifications with batching - **Status Aggregation** provides intelligent email notifications with batching
@ -14,39 +15,39 @@ This is a complete rewrite implementing an **individual metrics architecture** w
``` ```
cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
┌system───────────────────────────────────────────┐┌services────────────────────────────────────────────────────┐ ┌system──────────────────────────────┐┌services─────────────────────────────────────────┐
│CPU: ││Service: Status: RAM: Disk: │CPU: ││Service: Status: RAM: Disk: │
│● Load: 0.10 0.52 0.88 • 400.0 MHz ││● docker active 27M 496MB │● Load: 0.10 0.52 0.88 • 400.0 MHz ││● docker active 27M 496MB │
│RAM: ││● docker-registry active 19M 496MB │RAM: ││● docker-registry active 19M 496MB │
│● Used: 30% 2.3GB/7.6GB ││● gitea active 579M 2.6GB │● Used: 30% 2.3GB/7.6GB ││● gitea active 579M 2.6GB │
│● tmp: 0.0% 0B/2.0GB ││● gitea-runner-default active 11M 2.6GB │● tmp: 0.0% 0B/2.0GB ││● gitea-runner-default active 11M 2.6GB │
│Disk nvme0n1: ││● haasp-core active 9M 1MB │Disk nvme0n1: ││● haasp-core active 9M 1MB │
│● Health: PASSED ││● haasp-mqtt active 3M 1MB │● Health: PASSED ││● haasp-mqtt active 3M 1MB │
│● Usage @root: 8.3% • 75.4/906.2 GB ││● haasp-webgrid active 10M 1MB │● Usage @root: 8.3% • 75.4/906.2 GB ││● haasp-webgrid active 10M 1MB │
│● Usage @boot: 5.9% • 0.1/1.0 GB ││● immich-server active 240M 45.1GB │● Usage @boot: 5.9% • 0.1/1.0 GB ││● immich-server active 240M 45.1GB │
││● mosquitto active 1M 1MB │ ││● mosquitto active 1M 1MB │
││● mysql active 38M 225MB │ ││● mysql active 38M 225MB │
││● nginx active 28M 24MB │ ││● nginx active 28M 24MB │
││ ├─ ● gitea.cmtec.se 51ms │ ││ ├─ ● gitea.cmtec.se 51ms │
││ ├─ ● haasp.cmtec.se 43ms │ ││ ├─ ● haasp.cmtec.se 43ms │
││ ├─ ● haasp.net 43ms │ ││ ├─ ● haasp.net 43ms │
││ ├─ ● pages.cmtec.se 45ms │ ││ ├─ ● pages.cmtec.se 45ms │
└─────────────────────────────────────────────────┘│ ├─ ● photos.cmtec.se 41ms └────────────────────────────────────┘│ ├─ ● photos.cmtec.se 41ms │
┌backup───────────────────────────────────────────┐│ ├─ ● unifi.cmtec.se 46ms ┌backup──────────────────────────────┐│ ├─ ● unifi.cmtec.se 46ms │
│Latest backup: ││ ├─ ● vault.cmtec.se 47ms │Latest backup: ││ ├─ ● vault.cmtec.se 47ms │
│● Status: OK ││ ├─ ● www.kryddorten.se 81ms │● Status: OK ││ ├─ ● www.kryddorten.se 81ms │
│Duration: 54s • Last: 4h ago ││ ├─ ● www.mariehall2.se 86ms │Duration: 54s • Last: 4h ago ││ ├─ ● www.mariehall2.se 86ms │
│Disk usage: 48.2GB/915.8GB ││● postgresql active 112M 357MB │Disk usage: 48.2GB/915.8GB ││● postgresql active 112M 357MB │
│P/N: Samsung SSD 870 QVO 1TB ││● redis-immich active 8M 45.1GB │P/N: Samsung SSD 870 QVO 1TB ││● redis-immich active 8M 45.1GB │
│S/N: S5RRNF0W800639Y ││● sshd active 2M 0 │S/N: S5RRNF0W800639Y ││● sshd active 2M 0 │
│● gitea 2 archives 2.7GB ││● unifi active 594M 495MB │● gitea 2 archives 2.7GB ││● unifi active 594M 495MB │
│● immich 2 archives 45.0GB ││● vaultwarden active 12M 1MB │● immich 2 archives 45.0GB ││● vaultwarden active 12M 1MB │
│● kryddorten 2 archives 67.6MB ││ │● kryddorten 2 archives 67.6MB ││ │
│● mariehall2 2 archives 321.8MB ││ │● mariehall2 2 archives 321.8MB ││ │
│● nixosbox 2 archives 4.5MB ││ │● nixosbox 2 archives 4.5MB ││ │
│● unifi 2 archives 2.9MB ││ │● unifi 2 archives 2.9MB ││ │
│● vaultwarden 2 archives 305kB ││ │● vaultwarden 2 archives 305kB ││ │
└─────────────────────────────────────────────────┘└────────────────────────────────────────────────────────────┘ └────────────────────────────────────┘└─────────────────────────────────────────────────┘
``` ```
**Navigation**: `←→` switch hosts, `r` refresh, `q` quit **Navigation**: `←→` switch hosts, `r` refresh, `q` quit
@ -54,7 +55,7 @@ cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
## Features ## Features
- **Real-time monitoring** - Dashboard updates every 1-2 seconds - **Real-time monitoring** - Dashboard updates every 1-2 seconds
- **Individual metric collection** - Granular data for flexible dashboard composition - **Individual metric collection** - Granular data for flexible dashboard composition
- **Intelligent status aggregation** - Host-level status calculated from all services - **Intelligent status aggregation** - Host-level status calculated from all services
- **Smart email notifications** - Batched, detailed alerts with service groupings - **Smart email notifications** - Batched, detailed alerts with service groupings
- **Persistent state** - Prevents false notifications on restarts - **Persistent state** - Prevents false notifications on restarts
@ -66,7 +67,7 @@ cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
### Core Components ### Core Components
- **Agent** (`cm-dashboard-agent`) - Collects metrics and sends via ZMQ - **Agent** (`cm-dashboard-agent`) - Collects metrics and sends via ZMQ
- **Dashboard** (`cm-dashboard`) - Real-time TUI display consuming metrics - **Dashboard** (`cm-dashboard`) - Real-time TUI display consuming metrics
- **Shared** (`cm-dashboard-shared`) - Common types and protocol - **Shared** (`cm-dashboard-shared`) - Common types and protocol
- **Status Aggregation** - Intelligent batching and notification management - **Status Aggregation** - Intelligent batching and notification management
- **Persistent Cache** - Maintains state across restarts - **Persistent Cache** - Maintains state across restarts
@ -74,7 +75,7 @@ cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
### Status Levels ### Status Levels
- **🟢 Ok** - Service running normally - **🟢 Ok** - Service running normally
- **🔵 Pending** - Service starting/stopping/reloading - **🔵 Pending** - Service starting/stopping/reloading
- **🟡 Warning** - Service issues (high load, memory, disk usage) - **🟡 Warning** - Service issues (high load, memory, disk usage)
- **🔴 Critical** - Service failed or critical thresholds exceeded - **🔴 Critical** - Service failed or critical thresholds exceeded
- **❓ Unknown** - Service state cannot be determined - **❓ Unknown** - Service state cannot be determined
@ -98,7 +99,7 @@ cargo build --workspace
# Start agent (requires configuration file) # Start agent (requires configuration file)
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml ./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml
# Start dashboard # Start dashboard
./target/debug/cm-dashboard --config /path/to/dashboard.toml ./target/debug/cm-dashboard --config /path/to/dashboard.toml
``` ```
@ -199,30 +200,35 @@ theme = "dark"
The agent implements several specialized collectors: The agent implements several specialized collectors:
### CPU Collector (`cpu.rs`) ### CPU Collector (`cpu.rs`)
- Load average (1, 5, 15 minute) - Load average (1, 5, 15 minute)
- CPU temperature monitoring - CPU temperature monitoring
- Real-time process monitoring (top CPU consumers) - Real-time process monitoring (top CPU consumers)
- Status calculation with configurable thresholds - Status calculation with configurable thresholds
### Memory Collector (`memory.rs`) ### Memory Collector (`memory.rs`)
- RAM usage (total, used, available) - RAM usage (total, used, available)
- Swap monitoring - Swap monitoring
- Real-time process monitoring (top RAM consumers) - Real-time process monitoring (top RAM consumers)
- Memory pressure detection - Memory pressure detection
### Disk Collector (`disk.rs`) ### Disk Collector (`disk.rs`)
- Filesystem usage per mount point - Filesystem usage per mount point
- SMART health monitoring - SMART health monitoring
- Temperature and wear tracking - Temperature and wear tracking
- Configurable filesystem monitoring - Configurable filesystem monitoring
### Systemd Collector (`systemd.rs`) ### Systemd Collector (`systemd.rs`)
- Service status monitoring (`active`, `inactive`, `failed`) - Service status monitoring (`active`, `inactive`, `failed`)
- Memory usage per service - Memory usage per service
- Service filtering and exclusions - Service filtering and exclusions
- Handles transitional states (`Status::Pending`) - Handles transitional states (`Status::Pending`)
### Backup Collector (`backup.rs`) ### Backup Collector (`backup.rs`)
- Reads TOML status files from backup systems - Reads TOML status files from backup systems
- Archive age verification - Archive age verification
- Disk usage tracking - Disk usage tracking
@ -270,6 +276,7 @@ Generated at 2025-10-21 19:42:42 CET
The system follows a **metrics-first architecture**: The system follows a **metrics-first architecture**:
### Agent Side ### Agent Side
```rust ```rust
// Agent collects individual metrics // Agent collects individual metrics
vec![ vec![
@ -280,6 +287,7 @@ vec![
``` ```
### Dashboard Side ### Dashboard Side
```rust ```rust
// Widgets subscribe to specific metrics // Widgets subscribe to specific metrics
impl Widget for CpuWidget { impl Widget for CpuWidget {
@ -337,7 +345,7 @@ cm-dashboard/
# Debug build # Debug build
cargo build --workspace cargo build --workspace
# Release build # Release build
cargo build --workspace --release cargo build --workspace --release
# Run tests # Run tests
@ -395,11 +403,12 @@ sudo nixos-rebuild switch --flake .
- **CPU/Memory**: 2 seconds (real-time monitoring) - **CPU/Memory**: 2 seconds (real-time monitoring)
- **Disk usage**: 300 seconds (5 minutes) - **Disk usage**: 300 seconds (5 minutes)
- **Systemd services**: 10 seconds - **Systemd services**: 10 seconds
- **SMART health**: 600 seconds (10 minutes) - **SMART health**: 600 seconds (10 minutes)
- **Backup status**: 60 seconds (1 minute) - **Backup status**: 60 seconds (1 minute)
- **Email notifications**: 30 seconds (batched) - **Email notifications**: 30 seconds (batched)
- **Dashboard updates**: 1 second (real-time display) - **Dashboard updates**: 1 second (real-time display)
## License ## License
MIT License - see LICENSE file for details MIT License - see LICENSE file for details