Add architectural plan for separating ZMQ sending from data collection to prevent false 'host offline' alerts caused by slow collectors. Key concepts: - Shared cache (Arc<RwLock<AgentData>>) - Independent async collector tasks with different update rates - ZMQ sender always sends every 1s from cache - Fast collectors (1s), medium (5s), slow (60s) - No blocking regardless of collector speed
434 lines
13 KiB
Markdown
434 lines
13 KiB
Markdown
# CM Dashboard - Infrastructure Monitoring TUI
|
||
|
||
## Overview
|
||
|
||
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.
|
||
|
||
## Current Features
|
||
|
||
### Core Functionality
|
||
|
||
- **Real-time Monitoring**: CPU, RAM, Storage, and Service status
|
||
- **Service Management**: Start/stop services with user-stopped tracking
|
||
- **Multi-host Support**: Monitor multiple servers from single dashboard
|
||
- **NixOS Integration**: System rebuild via SSH + tmux popup
|
||
- **Backup Monitoring**: Borgbackup status and scheduling
|
||
|
||
### User-Stopped Service Tracking
|
||
|
||
- Services stopped via dashboard are marked as "user-stopped"
|
||
- User-stopped services report Status::OK instead of Warning
|
||
- Prevents false alerts during intentional maintenance
|
||
- Persistent storage survives agent restarts
|
||
- Automatic flag clearing when services are restarted via dashboard
|
||
|
||
### Custom Service Logs
|
||
|
||
- Configure service-specific log file paths per host in dashboard config
|
||
- Press `L` on any service to view custom log files via `tail -f`
|
||
- Configuration format in dashboard config:
|
||
|
||
```toml
|
||
[service_logs]
|
||
hostname1 = [
|
||
{ service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
|
||
{ service_name = "app", log_file_path = "/var/log/myapp/app.log" }
|
||
]
|
||
hostname2 = [
|
||
{ service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
|
||
]
|
||
```
|
||
|
||
### Service Management
|
||
|
||
- **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services
|
||
- **Service Actions**:
|
||
- `s` - Start service (sends UserStart command)
|
||
- `S` - Stop service (sends UserStop command)
|
||
- `J` - Show service logs (journalctl in tmux popup)
|
||
- `L` - Show custom log files (tail -f custom paths in tmux popup)
|
||
- `R` - Rebuild current host
|
||
- **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
|
||
- **Transitional Icons**: Blue arrows during operations
|
||
|
||
### Navigation
|
||
|
||
- **Tab**: Switch between hosts
|
||
- **↑↓ or j/k**: Select services
|
||
- **s**: Start selected service (UserStart)
|
||
- **S**: Stop selected service (UserStop)
|
||
- **J**: Show service logs (journalctl)
|
||
- **L**: Show custom log files
|
||
- **R**: Rebuild current host
|
||
- **B**: Run backup on current host
|
||
- **q**: Quit dashboard
|
||
|
||
## Core Architecture Principles
|
||
|
||
### Structured Data Architecture (✅ IMPLEMENTED v0.1.131)
|
||
|
||
Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access.
|
||
|
||
**Previous (String Metrics):**
|
||
|
||
- ❌ Agent sent individual metrics with string names like `disk_nvme0n1_temperature`
|
||
- ❌ Dashboard parsed metric names with underscore counting and string splitting
|
||
- ❌ Complex and error-prone metric filtering and extraction logic
|
||
|
||
**Current (Structured Data):**
|
||
|
||
```json
|
||
{
|
||
"hostname": "cmbox",
|
||
"agent_version": "v0.1.131",
|
||
"timestamp": 1763926877,
|
||
"system": {
|
||
"cpu": {
|
||
"load_1min": 3.5,
|
||
"load_5min": 3.57,
|
||
"load_15min": 3.58,
|
||
"frequency_mhz": 1500,
|
||
"temperature_celsius": 45.2
|
||
},
|
||
"memory": {
|
||
"usage_percent": 25.0,
|
||
"total_gb": 23.3,
|
||
"used_gb": 5.9,
|
||
"swap_total_gb": 10.7,
|
||
"swap_used_gb": 0.99,
|
||
"tmpfs": [
|
||
{
|
||
"mount": "/tmp",
|
||
"usage_percent": 15.0,
|
||
"used_gb": 0.3,
|
||
"total_gb": 2.0
|
||
}
|
||
]
|
||
},
|
||
"storage": {
|
||
"drives": [
|
||
{
|
||
"name": "nvme0n1",
|
||
"health": "PASSED",
|
||
"temperature_celsius": 29.0,
|
||
"wear_percent": 1.0,
|
||
"filesystems": [
|
||
{
|
||
"mount": "/",
|
||
"usage_percent": 24.0,
|
||
"used_gb": 224.9,
|
||
"total_gb": 928.2
|
||
}
|
||
]
|
||
}
|
||
],
|
||
"pools": [
|
||
{
|
||
"name": "srv_media",
|
||
"mount": "/srv/media",
|
||
"type": "mergerfs",
|
||
"health": "healthy",
|
||
"usage_percent": 63.0,
|
||
"used_gb": 2355.2,
|
||
"total_gb": 3686.4,
|
||
"data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }],
|
||
"parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }]
|
||
}
|
||
]
|
||
}
|
||
},
|
||
"services": [
|
||
{ "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 }
|
||
],
|
||
"backup": {
|
||
"status": "completed",
|
||
"last_run": 1763920000,
|
||
"next_scheduled": 1764006400,
|
||
"total_size_gb": 150.5,
|
||
"repository_health": "ok"
|
||
}
|
||
}
|
||
```
|
||
|
||
- ✅ Agent sends structured JSON over ZMQ (no legacy support)
|
||
- ✅ Type-safe data access: `data.system.storage.drives[0].temperature_celsius`
|
||
- ✅ Complete metric coverage: CPU, memory, storage, services, backup
|
||
- ✅ Backward compatibility via bridge conversion to existing UI widgets
|
||
- ✅ All string parsing bugs eliminated
|
||
|
||
### Cached Collector Architecture (🚧 PLANNED)
|
||
|
||
**Problem:** Blocking collectors prevent timely ZMQ transmission, causing false "host offline" alerts.
|
||
|
||
**Previous (Sequential Blocking):**
|
||
```
|
||
Every 1 second:
|
||
└─ collect_all_data() [BLOCKS for 2-10+ seconds]
|
||
├─ CPU (fast: 10ms)
|
||
├─ Memory (fast: 20ms)
|
||
├─ Disk SMART (slow: 3s per drive × 4 drives = 12s)
|
||
├─ Service disk usage (slow: 2-8s per service)
|
||
└─ Docker (medium: 500ms)
|
||
└─ send_via_zmq() [Only after ALL collection completes]
|
||
|
||
Result: If any collector takes >10s → "host offline" false alert
|
||
```
|
||
|
||
**New (Cached Independent Collectors):**
|
||
```
|
||
Shared Cache: Arc<RwLock<AgentData>>
|
||
|
||
Background Collectors (independent async tasks):
|
||
├─ Fast collectors (CPU, RAM, Network)
|
||
│ └─ Update cache every 1 second
|
||
├─ Medium collectors (Services, Docker)
|
||
│ └─ Update cache every 5 seconds
|
||
└─ Slow collectors (Disk usage, SMART data)
|
||
└─ Update cache every 60 seconds
|
||
|
||
ZMQ Sender (separate async task):
|
||
Every 1 second:
|
||
└─ Read current cache
|
||
└─ Send via ZMQ [Always instant, never blocked]
|
||
```
|
||
|
||
**Benefits:**
|
||
- ✅ ZMQ sends every 1 second regardless of collector speed
|
||
- ✅ No false "host offline" alerts from slow collectors
|
||
- ✅ Different update rates for different metrics (CPU=1s, SMART=60s)
|
||
- ✅ System stays responsive even with slow operations
|
||
- ✅ Slow collectors can use longer timeouts without blocking
|
||
|
||
**Implementation:**
|
||
- Shared `AgentData` cache wrapped in `Arc<RwLock<>>`
|
||
- Each collector spawned as independent tokio task
|
||
- Collectors update their section of cache at their own rate
|
||
- ZMQ sender reads cache every 1s and transmits
|
||
- Stale data acceptable for slow-changing metrics (disk usage, SMART)
|
||
|
||
### Maintenance Mode
|
||
|
||
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
|
||
- File presence suppresses all email notifications while continuing monitoring
|
||
- Dashboard continues to show real status, only notifications are blocked
|
||
|
||
Usage:
|
||
|
||
```bash
|
||
# Enable maintenance mode
|
||
touch /tmp/cm-maintenance
|
||
|
||
# Run maintenance tasks
|
||
systemctl stop service
|
||
# ... maintenance work ...
|
||
systemctl start service
|
||
|
||
# Disable maintenance mode
|
||
rm /tmp/cm-maintenance
|
||
```
|
||
|
||
## Development and Deployment Architecture
|
||
|
||
### Development Path
|
||
|
||
- **Location:** `~/projects/cm-dashboard`
|
||
- **Purpose:** Development workflow only - for committing new code
|
||
- **Access:** Only for developers to commit changes
|
||
|
||
### Deployment Path
|
||
|
||
- **Location:** `/var/lib/cm-dashboard/nixos-config`
|
||
- **Purpose:** Production deployment only - agent clones/pulls from git
|
||
- **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild
|
||
|
||
### Git Flow
|
||
|
||
```
|
||
Development: ~/projects/cm-dashboard → git commit → git push
|
||
Deployment: git pull → /var/lib/cm-dashboard/nixos-config → rebuild
|
||
```
|
||
|
||
## Automated Binary Release System
|
||
|
||
CM Dashboard uses automated binary releases instead of source builds.
|
||
|
||
### Creating New Releases
|
||
|
||
```bash
|
||
cd ~/projects/cm-dashboard
|
||
git tag v0.1.X
|
||
git push origin v0.1.X
|
||
```
|
||
|
||
This automatically:
|
||
|
||
- Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
|
||
- Creates GitHub-style release with tarball
|
||
- Uploads binaries via Gitea API
|
||
|
||
### NixOS Configuration Updates
|
||
|
||
Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:
|
||
|
||
```nix
|
||
version = "v0.1.X";
|
||
src = pkgs.fetchurl {
|
||
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
|
||
sha256 = "sha256-NEW_HASH_HERE";
|
||
};
|
||
```
|
||
|
||
### Get Release Hash
|
||
|
||
```bash
|
||
cd ~/projects/nixosbox
|
||
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
|
||
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
|
||
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
|
||
}' 2>&1 | grep "got:"
|
||
```
|
||
|
||
### Building
|
||
|
||
**Testing & Building:**
|
||
|
||
- **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"`
|
||
- **Clean compilation**: Remove `target/` between major changes
|
||
|
||
## Enhanced Storage Pool Visualization
|
||
|
||
### Auto-Discovery Architecture
|
||
|
||
The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping.
|
||
|
||
### Discovery Process
|
||
|
||
**At Agent Startup:**
|
||
|
||
1. Parse `/proc/mounts` to identify all mounted filesystems
|
||
2. Detect MergerFS pools by analyzing `fuse.mergerfs` mount sources
|
||
3. Identify member disks and potential parity relationships via heuristics
|
||
4. Store discovered storage topology for continuous monitoring
|
||
5. Generate pool-aware metrics with hierarchical relationships
|
||
|
||
**Continuous Monitoring:**
|
||
|
||
- Use stored discovery data for efficient metric collection
|
||
- Monitor individual drives for SMART data, temperature, wear
|
||
- Calculate pool-level health based on member drive status
|
||
- Generate enhanced metrics for dashboard visualization
|
||
|
||
### Supported Storage Types
|
||
|
||
**Single Disks:**
|
||
|
||
- ext4, xfs, btrfs mounted directly
|
||
- Individual drive monitoring with SMART data
|
||
- Traditional single-disk display for root, boot, etc.
|
||
|
||
**MergerFS Pools:**
|
||
|
||
- Auto-detect from `/proc/mounts` fuse.mergerfs entries
|
||
- Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
|
||
- Heuristic parity disk detection (sequential device names, "parity" in path)
|
||
- Pool health calculation (healthy/degraded/critical)
|
||
- Hierarchical tree display with data/parity disk grouping
|
||
|
||
**Future Extensions Ready:**
|
||
|
||
- RAID arrays via `/proc/mdstat` parsing
|
||
- ZFS pools via `zpool status` integration
|
||
- LVM logical volumes via `lvs` discovery
|
||
|
||
### Configuration
|
||
|
||
```toml
|
||
[collectors.disk]
|
||
enabled = true
|
||
auto_discover = true # Default: true
|
||
# Optional exclusions for special filesystems
|
||
exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"]
|
||
exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]
|
||
```
|
||
|
||
### Display Format
|
||
|
||
```
|
||
Network:
|
||
● eno1:
|
||
├─ ip: 192.168.30.105
|
||
└─ tailscale0: 100.125.108.16
|
||
● eno2:
|
||
└─ ip: 192.168.32.105
|
||
CPU:
|
||
● Load: 0.23 0.21 0.13
|
||
└─ Freq: 1048 MHz
|
||
RAM:
|
||
● Usage: 25% 5.8GB/23.3GB
|
||
├─ ● /tmp: 2% 0.5GB/2GB
|
||
└─ ● /var/tmp: 0% 0GB/1.0GB
|
||
Storage:
|
||
● 844B9A25 T: 25C W: 4%
|
||
├─ ● /: 55% 250.5GB/456.4GB
|
||
└─ ● /boot: 26% 0.3GB/1.0GB
|
||
● mergerfs /srv/media:
|
||
├─ ● 63% 2355.2GB/3686.4GB
|
||
├─ ● Data_1: WDZQ8H8D T: 28°C
|
||
├─ ● Data_2: GGA04461 T: 28°C
|
||
└─ ● Parity: WDZS8RY0 T: 29°C
|
||
Backup:
|
||
● WD-WCC7K1234567 T: 32°C W: 12%
|
||
├─ Last: 2h ago (12.3GB)
|
||
├─ Next: in 22h
|
||
└─ ● Usage: 45% 678GB/1.5TB
|
||
```
|
||
|
||
## Important Communication Guidelines
|
||
|
||
Keep responses concise and focused. Avoid extensive implementation summaries unless requested.
|
||
|
||
## Commit Message Guidelines
|
||
|
||
**NEVER mention:**
|
||
|
||
- Claude or any AI assistant names
|
||
- Automation or AI-generated content
|
||
- Any reference to automated code generation
|
||
|
||
**ALWAYS:**
|
||
|
||
- Focus purely on technical changes and their purpose
|
||
- Use standard software development commit message format
|
||
- Describe what was changed and why, not how it was created
|
||
- Write from the perspective of a human developer
|
||
|
||
**Examples:**
|
||
|
||
- ❌ "Generated with Claude Code"
|
||
- ❌ "AI-assisted implementation"
|
||
- ❌ "Automated refactoring"
|
||
- ✅ "Implement maintenance mode for backup operations"
|
||
- ✅ "Restructure storage widget with improved layout"
|
||
- ✅ "Update CPU thresholds to production values"
|
||
|
||
## Implementation Rules
|
||
|
||
1. **Agent Status Authority**: Agent calculates status for each metric using thresholds
|
||
2. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
|
||
3. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status
|
||
|
||
**NEVER:**
|
||
|
||
- Copy/paste ANY code from legacy implementations
|
||
- Calculate status in dashboard widgets
|
||
- Hardcode metric names in widgets (use const arrays)
|
||
- Create files unless absolutely necessary for achieving goals
|
||
- Create documentation files unless explicitly requested
|
||
|
||
**ALWAYS:**
|
||
|
||
- Prefer editing existing files to creating new ones
|
||
- Follow existing code conventions and patterns
|
||
- Use existing libraries and utilities
|
||
- Follow security best practices
|