cm-dashboard/CLAUDE.md
Christoffer Martinsson da37e28b6a
All checks were successful
Build and Release / build-and-release (push) Successful in 2m5s
Integrate backup metrics into system widget with enhanced disk monitoring
Replace standalone backup widget with compact backup section in system widget displaying disk serial, temperature, wear level, timing, and usage information.

Changes:
- Remove standalone backup widget and integrate into system widget
- Update backup collector to read TOML format from backup script
- Add BackupDiskData structure with serial, usage, temperature, wear fields
- Implement compact backup display matching specification format
- Add time formatting utilities for backup timing display
- Update backup data extraction from TOML with disk space parsing

Version bump to v0.1.149
2025-11-24 23:55:35 +01:00

476 lines
15 KiB
Markdown

# CM Dashboard - Infrastructure Monitoring TUI
## Overview
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.
## Current Features
### Core Functionality
- **Real-time Monitoring**: CPU, RAM, Storage, and Service status
- **Service Management**: Start/stop services with user-stopped tracking
- **Multi-host Support**: Monitor multiple servers from single dashboard
- **NixOS Integration**: System rebuild via SSH + tmux popup
- **Backup Monitoring**: Borgbackup status and scheduling
### User-Stopped Service Tracking
- Services stopped via dashboard are marked as "user-stopped"
- User-stopped services report Status::OK instead of Warning
- Prevents false alerts during intentional maintenance
- Persistent storage survives agent restarts
- Automatic flag clearing when services are restarted via dashboard
### Custom Service Logs
- Configure service-specific log file paths per host in dashboard config
- Press `L` on any service to view custom log files via `tail -f`
- Configuration format in dashboard config:
```toml
[service_logs]
hostname1 = [
{ service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
{ service_name = "app", log_file_path = "/var/log/myapp/app.log" }
]
hostname2 = [
{ service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
]
```
### Service Management
- **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services
- **Service Actions**:
- `s` - Start service (sends UserStart command)
- `S` - Stop service (sends UserStop command)
- `J` - Show service logs (journalctl in tmux popup)
- `L` - Show custom log files (tail -f custom paths in tmux popup)
- `R` - Rebuild current host
- **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
- **Transitional Icons**: Blue arrows during operations
### Navigation
- **Tab**: Switch between hosts
- **↑↓ or j/k**: Select services
- **s**: Start selected service (UserStart)
- **S**: Stop selected service (UserStop)
- **J**: Show service logs (journalctl)
- **L**: Show custom log files
- **R**: Rebuild current host
- **B**: Run backup on current host
- **q**: Quit dashboard
## Core Architecture Principles
### Structured Data Architecture (✅ IMPLEMENTED v0.1.131)
Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access.
**Previous (String Metrics):**
- ❌ Agent sent individual metrics with string names like `disk_nvme0n1_temperature`
- ❌ Dashboard parsed metric names with underscore counting and string splitting
- ❌ Complex and error-prone metric filtering and extraction logic
**Current (Structured Data):**
```json
{
"hostname": "cmbox",
"agent_version": "v0.1.131",
"timestamp": 1763926877,
"system": {
"cpu": {
"load_1min": 3.5,
"load_5min": 3.57,
"load_15min": 3.58,
"frequency_mhz": 1500,
"temperature_celsius": 45.2
},
"memory": {
"usage_percent": 25.0,
"total_gb": 23.3,
"used_gb": 5.9,
"swap_total_gb": 10.7,
"swap_used_gb": 0.99,
"tmpfs": [
{
"mount": "/tmp",
"usage_percent": 15.0,
"used_gb": 0.3,
"total_gb": 2.0
}
]
},
"storage": {
"drives": [
{
"name": "nvme0n1",
"health": "PASSED",
"temperature_celsius": 29.0,
"wear_percent": 1.0,
"filesystems": [
{
"mount": "/",
"usage_percent": 24.0,
"used_gb": 224.9,
"total_gb": 928.2
}
]
}
],
"pools": [
{
"name": "srv_media",
"mount": "/srv/media",
"type": "mergerfs",
"health": "healthy",
"usage_percent": 63.0,
"used_gb": 2355.2,
"total_gb": 3686.4,
"data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }],
"parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }]
}
]
}
},
"services": [
{ "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 }
],
"backup": {
"status": "completed",
"last_run": 1763920000,
"next_scheduled": 1764006400,
"total_size_gb": 150.5,
"repository_health": "ok"
}
}
```
- ✅ Agent sends structured JSON over ZMQ (no legacy support)
- ✅ Type-safe data access: `data.system.storage.drives[0].temperature_celsius`
- ✅ Complete metric coverage: CPU, memory, storage, services, backup
- ✅ Backward compatibility via bridge conversion to existing UI widgets
- ✅ All string parsing bugs eliminated
### Maintenance Mode
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
- File presence suppresses all email notifications while continuing monitoring
- Dashboard continues to show real status, only notifications are blocked
Usage:
```bash
# Enable maintenance mode
touch /tmp/cm-maintenance
# Run maintenance tasks
systemctl stop service
# ... maintenance work ...
systemctl start service
# Disable maintenance mode
rm /tmp/cm-maintenance
```
## Development and Deployment Architecture
### Development Path
- **Location:** `~/projects/cm-dashboard`
- **Purpose:** Development workflow only - for committing new code
- **Access:** Only for developers to commit changes
### Deployment Path
- **Location:** `/var/lib/cm-dashboard/nixos-config`
- **Purpose:** Production deployment only - agent clones/pulls from git
- **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild
### Git Flow
```
Development: ~/projects/cm-dashboard → git commit → git push
Deployment: git pull → /var/lib/cm-dashboard/nixos-config → rebuild
```
## Automated Binary Release System
CM Dashboard uses automated binary releases instead of source builds.
### Creating New Releases
```bash
cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X
```
This automatically:
- Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
- Creates GitHub-style release with tarball
- Uploads binaries via Gitea API
### NixOS Configuration Updates
Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:
```nix
version = "v0.1.X";
src = pkgs.fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-NEW_HASH_HERE";
};
```
### Get Release Hash
```bash
cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
```
### Building
**Testing & Building:**
- **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"`
- **Clean compilation**: Remove `target/` between major changes
## Enhanced Storage Pool Visualization
### Auto-Discovery Architecture
The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping.
### Discovery Process
**At Agent Startup:**
1. Parse `/proc/mounts` to identify all mounted filesystems
2. Detect MergerFS pools by analyzing `fuse.mergerfs` mount sources
3. Identify member disks and potential parity relationships via heuristics
4. Store discovered storage topology for continuous monitoring
5. Generate pool-aware metrics with hierarchical relationships
**Continuous Monitoring:**
- Use stored discovery data for efficient metric collection
- Monitor individual drives for SMART data, temperature, wear
- Calculate pool-level health based on member drive status
- Generate enhanced metrics for dashboard visualization
### Supported Storage Types
**Single Disks:**
- ext4, xfs, btrfs mounted directly
- Individual drive monitoring with SMART data
- Traditional single-disk display for root, boot, etc.
**MergerFS Pools:**
- Auto-detect from `/proc/mounts` fuse.mergerfs entries
- Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
- Heuristic parity disk detection (sequential device names, "parity" in path)
- Pool health calculation (healthy/degraded/critical)
- Hierarchical tree display with data/parity disk grouping
**Future Extensions Ready:**
- RAID arrays via `/proc/mdstat` parsing
- ZFS pools via `zpool status` integration
- LVM logical volumes via `lvs` discovery
### Configuration
```toml
[collectors.disk]
enabled = true
auto_discover = true # Default: true
# Optional exclusions for special filesystems
exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"]
exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]
```
### Display Format
```
CPU:
● Load: 0.23 0.21 0.13
└─ Freq: 1048 MHz
RAM:
● Usage: 25% 5.8GB/23.3GB
├─ ● /tmp: 2% 0.5GB/2GB
└─ ● /var/tmp: 0% 0GB/1.0GB
Storage:
● mergerfs (2+1):
├─ Total: ● 63% 2355.2GB/3686.4GB
├─ Data Disks:
│ ├─ ● sdb T: 24°C W: 5%
│ └─ ● sdd T: 27°C W: 5%
├─ Parity: ● sdc T: 24°C W: 5%
└─ Mount: /srv/media
● nvme0n1 T: 25C W: 4%
├─ ● /: 55% 250.5GB/456.4GB
└─ ● /boot: 26% 0.3GB/1.0GB
Backup:
● WD-WCC7K1234567 T: 32°C W: 12%
├─ Last: 2h ago (12.3GB)
├─ Next: in 22h
└─ ● Usage: 45% 678GB/1.5TB
```
## Important Communication Guidelines
Keep responses concise and focused. Avoid extensive implementation summaries unless requested.
## Commit Message Guidelines
**NEVER mention:**
- Claude or any AI assistant names
- Automation or AI-generated content
- Any reference to automated code generation
**ALWAYS:**
- Focus purely on technical changes and their purpose
- Use standard software development commit message format
- Describe what was changed and why, not how it was created
- Write from the perspective of a human developer
**Examples:**
- ❌ "Generated with Claude Code"
- ❌ "AI-assisted implementation"
- ❌ "Automated refactoring"
- ✅ "Implement maintenance mode for backup operations"
- ✅ "Restructure storage widget with improved layout"
- ✅ "Update CPU thresholds to production values"
## Completed Architecture Migration (v0.1.131)
## ✅ COMPLETE MONITORING SYSTEM RESTORATION (v0.1.141)
**🎉 SUCCESS: All Issues Fixed - Complete Functional Monitoring System**
### ✅ Completed Implementation (v0.1.141)
**All Major Issues Resolved:**
```
✅ Data Collection: Agent collects structured data correctly
✅ Storage Display: Perfect format with correct mount points and temperature/wear
✅ Status Evaluation: All metrics properly evaluated against thresholds
✅ Notifications: Working email alerts on status changes
✅ Thresholds: All collectors using configured thresholds for status calculation
✅ Build Information: NixOS version displayed correctly
✅ Mount Point Consistency: Stable, sorted display order
```
### ✅ All Phases Completed Successfully
#### ✅ Phase 1: Storage Display - COMPLETED
- ✅ Use `lsblk` instead of `findmnt` (eliminated `/nix/store` bind mount issue)
- ✅ Add `sudo smartctl` for permissions (SMART data collection working)
- ✅ Fix NVMe SMART parsing (`Temperature:` and `Percentage Used:` fields)
- ✅ Consistent filesystem/tmpfs sorting (no more random order swapping)
-**VERIFIED**: Dashboard shows `● nvme0n1 T: 28°C W: 1%` correctly
#### ✅ Phase 2: Status Evaluation System - COMPLETED
-**CPU Status**: Load averages and temperature evaluated against `HysteresisThresholds`
-**Memory Status**: Usage percentage evaluated against thresholds
-**Storage Status**: Drive temperature, health, and filesystem usage evaluated
-**Service Status**: Service states properly tracked and evaluated
-**Status Fields**: All AgentData structures include status information
-**Threshold Integration**: All collectors use their configured thresholds
#### ✅ Phase 3: Notification System - COMPLETED
-**Status Change Detection**: Agent tracks status between collection cycles
-**Email Notifications**: Alerts sent on degradation (OK→Warning/Critical, Warning→Critical)
-**Notification Content**: Detailed alerts with metric values and timestamps
-**NotificationManager Integration**: Fully restored and operational
-**Maintenance Mode**: `/tmp/cm-maintenance` file support maintained
#### ✅ Phase 4: Integration & Testing - COMPLETED
-**AgentData Status Fields**: All structured data includes status evaluation
-**Status Processing**: Agent applies thresholds at collection time
-**End-to-End Flow**: Collection → Evaluation → Notification → Display
-**Dynamic Versioning**: Agent version from `CARGO_PKG_VERSION`
-**Build Information**: NixOS generation display restored
### ✅ Final Architecture - WORKING
**Complete Operational Flow:**
```
Collectors → AgentData (with Status) → NotificationManager → Email Alerts
↘ ↗
ZMQ → Dashboard → Perfect Display
```
**Operational Components:**
1.**Collectors**: Populate AgentData with metrics AND status evaluation
2.**Status Evaluation**: `HysteresisThresholds.evaluate()` applied per collector
3.**Notifications**: Email alerts on status change detection
4.**Display**: Correct mount points, temperature, wear, and build information
### ✅ Success Criteria - ALL MET
**Display Requirements:**
- ✅ Dashboard shows `● nvme0n1 T: 28°C W: 1%` format perfectly
- ✅ Mount points show `/` and `/boot` (not `root`/`boot`)
- ✅ Build information shows actual NixOS version (not "unknown")
- ✅ Consistent sorting eliminates random order changes
**Monitoring Requirements:**
- ✅ High CPU load triggers Warning/Critical status and email alert
- ✅ High memory usage triggers Warning/Critical status and email alert
- ✅ High disk temperature triggers Warning/Critical status and email alert
- ✅ Failed services trigger Warning/Critical status and email alert
- ✅ Maintenance mode suppresses notifications as expected
### 🚀 Production Ready
**CM Dashboard v0.1.141 is a complete, functional infrastructure monitoring system:**
- **Real-time Monitoring**: All system components with 1-second intervals
- **Intelligent Alerting**: Email notifications on threshold violations
- **Perfect Display**: Accurate mount points, temperatures, and system information
- **Status-Aware**: All metrics evaluated against configurable thresholds
- **Production Ready**: Full monitoring capabilities restored
**The monitoring system is fully operational and ready for production use.**
## Implementation Rules
1. **Agent Status Authority**: Agent calculates status for each metric using thresholds
2. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
3. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status
**NEVER:**
- Copy/paste ANY code from legacy implementations
- Calculate status in dashboard widgets
- Hardcode metric names in widgets (use const arrays)
- Create files unless absolutely necessary for achieving goals
- Create documentation files unless explicitly requested
**ALWAYS:**
- Prefer editing existing files to creating new ones
- Follow existing code conventions and patterns
- Use existing libraries and utilities
- Follow security best practices