Replace standalone backup widget with compact backup section in system widget displaying disk serial, temperature, wear level, timing, and usage information. Changes: - Remove standalone backup widget and integrate into system widget - Update backup collector to read TOML format from backup script - Add BackupDiskData structure with serial, usage, temperature, wear fields - Implement compact backup display matching specification format - Add time formatting utilities for backup timing display - Update backup data extraction from TOML with disk space parsing Version bump to v0.1.149
15 KiB
CM Dashboard - Infrastructure Monitoring TUI
Overview
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.
Current Features
Core Functionality
- Real-time Monitoring: CPU, RAM, Storage, and Service status
- Service Management: Start/stop services with user-stopped tracking
- Multi-host Support: Monitor multiple servers from single dashboard
- NixOS Integration: System rebuild via SSH + tmux popup
- Backup Monitoring: Borgbackup status and scheduling
User-Stopped Service Tracking
- Services stopped via dashboard are marked as "user-stopped"
- User-stopped services report Status::OK instead of Warning
- Prevents false alerts during intentional maintenance
- Persistent storage survives agent restarts
- Automatic flag clearing when services are restarted via dashboard
Custom Service Logs
- Configure service-specific log file paths per host in dashboard config
- Press
Lon any service to view custom log files viatail -f - Configuration format in dashboard config:
[service_logs]
hostname1 = [
{ service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
{ service_name = "app", log_file_path = "/var/log/myapp/app.log" }
]
hostname2 = [
{ service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
]
Service Management
- Direct Control: Arrow keys (↑↓) or vim keys (j/k) navigate services
- Service Actions:
s- Start service (sends UserStart command)S- Stop service (sends UserStop command)J- Show service logs (journalctl in tmux popup)L- Show custom log files (tail -f custom paths in tmux popup)R- Rebuild current host
- Visual Status: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
- Transitional Icons: Blue arrows during operations
Navigation
- Tab: Switch between hosts
- ↑↓ or j/k: Select services
- s: Start selected service (UserStart)
- S: Stop selected service (UserStop)
- J: Show service logs (journalctl)
- L: Show custom log files
- R: Rebuild current host
- B: Run backup on current host
- q: Quit dashboard
Core Architecture Principles
Structured Data Architecture (✅ IMPLEMENTED v0.1.131)
Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access.
Previous (String Metrics):
- ❌ Agent sent individual metrics with string names like
disk_nvme0n1_temperature - ❌ Dashboard parsed metric names with underscore counting and string splitting
- ❌ Complex and error-prone metric filtering and extraction logic
Current (Structured Data):
{
"hostname": "cmbox",
"agent_version": "v0.1.131",
"timestamp": 1763926877,
"system": {
"cpu": {
"load_1min": 3.5,
"load_5min": 3.57,
"load_15min": 3.58,
"frequency_mhz": 1500,
"temperature_celsius": 45.2
},
"memory": {
"usage_percent": 25.0,
"total_gb": 23.3,
"used_gb": 5.9,
"swap_total_gb": 10.7,
"swap_used_gb": 0.99,
"tmpfs": [
{
"mount": "/tmp",
"usage_percent": 15.0,
"used_gb": 0.3,
"total_gb": 2.0
}
]
},
"storage": {
"drives": [
{
"name": "nvme0n1",
"health": "PASSED",
"temperature_celsius": 29.0,
"wear_percent": 1.0,
"filesystems": [
{
"mount": "/",
"usage_percent": 24.0,
"used_gb": 224.9,
"total_gb": 928.2
}
]
}
],
"pools": [
{
"name": "srv_media",
"mount": "/srv/media",
"type": "mergerfs",
"health": "healthy",
"usage_percent": 63.0,
"used_gb": 2355.2,
"total_gb": 3686.4,
"data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }],
"parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }]
}
]
}
},
"services": [
{ "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 }
],
"backup": {
"status": "completed",
"last_run": 1763920000,
"next_scheduled": 1764006400,
"total_size_gb": 150.5,
"repository_health": "ok"
}
}
- ✅ Agent sends structured JSON over ZMQ (no legacy support)
- ✅ Type-safe data access:
data.system.storage.drives[0].temperature_celsius - ✅ Complete metric coverage: CPU, memory, storage, services, backup
- ✅ Backward compatibility via bridge conversion to existing UI widgets
- ✅ All string parsing bugs eliminated
Maintenance Mode
- Agent checks for
/tmp/cm-maintenancefile before sending notifications - File presence suppresses all email notifications while continuing monitoring
- Dashboard continues to show real status, only notifications are blocked
Usage:
# Enable maintenance mode
touch /tmp/cm-maintenance
# Run maintenance tasks
systemctl stop service
# ... maintenance work ...
systemctl start service
# Disable maintenance mode
rm /tmp/cm-maintenance
Development and Deployment Architecture
Development Path
- Location:
~/projects/cm-dashboard - Purpose: Development workflow only - for committing new code
- Access: Only for developers to commit changes
Deployment Path
- Location:
/var/lib/cm-dashboard/nixos-config - Purpose: Production deployment only - agent clones/pulls from git
- Workflow: git pull →
/var/lib/cm-dashboard/nixos-config→ nixos-rebuild
Git Flow
Development: ~/projects/cm-dashboard → git commit → git push
Deployment: git pull → /var/lib/cm-dashboard/nixos-config → rebuild
Automated Binary Release System
CM Dashboard uses automated binary releases instead of source builds.
Creating New Releases
cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X
This automatically:
- Builds static binaries with
RUSTFLAGS="-C target-feature=+crt-static" - Creates GitHub-style release with tarball
- Uploads binaries via Gitea API
NixOS Configuration Updates
Edit ~/projects/nixosbox/hosts/services/cm-dashboard.nix:
version = "v0.1.X";
src = pkgs.fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-NEW_HASH_HERE";
};
Get Release Hash
cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
Building
Testing & Building:
- Workspace builds:
nix-shell -p openssl pkg-config --run "cargo build --workspace" - Clean compilation: Remove
target/between major changes
Enhanced Storage Pool Visualization
Auto-Discovery Architecture
The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping.
Discovery Process
At Agent Startup:
- Parse
/proc/mountsto identify all mounted filesystems - Detect MergerFS pools by analyzing
fuse.mergerfsmount sources - Identify member disks and potential parity relationships via heuristics
- Store discovered storage topology for continuous monitoring
- Generate pool-aware metrics with hierarchical relationships
Continuous Monitoring:
- Use stored discovery data for efficient metric collection
- Monitor individual drives for SMART data, temperature, wear
- Calculate pool-level health based on member drive status
- Generate enhanced metrics for dashboard visualization
Supported Storage Types
Single Disks:
- ext4, xfs, btrfs mounted directly
- Individual drive monitoring with SMART data
- Traditional single-disk display for root, boot, etc.
MergerFS Pools:
- Auto-detect from
/proc/mountsfuse.mergerfs entries - Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
- Heuristic parity disk detection (sequential device names, "parity" in path)
- Pool health calculation (healthy/degraded/critical)
- Hierarchical tree display with data/parity disk grouping
Future Extensions Ready:
- RAID arrays via
/proc/mdstatparsing - ZFS pools via
zpool statusintegration - LVM logical volumes via
lvsdiscovery
Configuration
[collectors.disk]
enabled = true
auto_discover = true # Default: true
# Optional exclusions for special filesystems
exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"]
exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]
Display Format
CPU:
● Load: 0.23 0.21 0.13
└─ Freq: 1048 MHz
RAM:
● Usage: 25% 5.8GB/23.3GB
├─ ● /tmp: 2% 0.5GB/2GB
└─ ● /var/tmp: 0% 0GB/1.0GB
Storage:
● mergerfs (2+1):
├─ Total: ● 63% 2355.2GB/3686.4GB
├─ Data Disks:
│ ├─ ● sdb T: 24°C W: 5%
│ └─ ● sdd T: 27°C W: 5%
├─ Parity: ● sdc T: 24°C W: 5%
└─ Mount: /srv/media
● nvme0n1 T: 25C W: 4%
├─ ● /: 55% 250.5GB/456.4GB
└─ ● /boot: 26% 0.3GB/1.0GB
Backup:
● WD-WCC7K1234567 T: 32°C W: 12%
├─ Last: 2h ago (12.3GB)
├─ Next: in 22h
└─ ● Usage: 45% 678GB/1.5TB
Important Communication Guidelines
Keep responses concise and focused. Avoid extensive implementation summaries unless requested.
Commit Message Guidelines
NEVER mention:
- Claude or any AI assistant names
- Automation or AI-generated content
- Any reference to automated code generation
ALWAYS:
- Focus purely on technical changes and their purpose
- Use standard software development commit message format
- Describe what was changed and why, not how it was created
- Write from the perspective of a human developer
Examples:
- ❌ "Generated with Claude Code"
- ❌ "AI-assisted implementation"
- ❌ "Automated refactoring"
- ✅ "Implement maintenance mode for backup operations"
- ✅ "Restructure storage widget with improved layout"
- ✅ "Update CPU thresholds to production values"
Completed Architecture Migration (v0.1.131)
✅ COMPLETE MONITORING SYSTEM RESTORATION (v0.1.141)
🎉 SUCCESS: All Issues Fixed - Complete Functional Monitoring System
✅ Completed Implementation (v0.1.141)
All Major Issues Resolved:
✅ Data Collection: Agent collects structured data correctly
✅ Storage Display: Perfect format with correct mount points and temperature/wear
✅ Status Evaluation: All metrics properly evaluated against thresholds
✅ Notifications: Working email alerts on status changes
✅ Thresholds: All collectors using configured thresholds for status calculation
✅ Build Information: NixOS version displayed correctly
✅ Mount Point Consistency: Stable, sorted display order
✅ All Phases Completed Successfully
✅ Phase 1: Storage Display - COMPLETED
- ✅ Use
lsblkinstead offindmnt(eliminated/nix/storebind mount issue) - ✅ Add
sudo smartctlfor permissions (SMART data collection working) - ✅ Fix NVMe SMART parsing (
Temperature:andPercentage Used:fields) - ✅ Consistent filesystem/tmpfs sorting (no more random order swapping)
- ✅ VERIFIED: Dashboard shows
● nvme0n1 T: 28°C W: 1%correctly
✅ Phase 2: Status Evaluation System - COMPLETED
- ✅ CPU Status: Load averages and temperature evaluated against
HysteresisThresholds - ✅ Memory Status: Usage percentage evaluated against thresholds
- ✅ Storage Status: Drive temperature, health, and filesystem usage evaluated
- ✅ Service Status: Service states properly tracked and evaluated
- ✅ Status Fields: All AgentData structures include status information
- ✅ Threshold Integration: All collectors use their configured thresholds
✅ Phase 3: Notification System - COMPLETED
- ✅ Status Change Detection: Agent tracks status between collection cycles
- ✅ Email Notifications: Alerts sent on degradation (OK→Warning/Critical, Warning→Critical)
- ✅ Notification Content: Detailed alerts with metric values and timestamps
- ✅ NotificationManager Integration: Fully restored and operational
- ✅ Maintenance Mode:
/tmp/cm-maintenancefile support maintained
✅ Phase 4: Integration & Testing - COMPLETED
- ✅ AgentData Status Fields: All structured data includes status evaluation
- ✅ Status Processing: Agent applies thresholds at collection time
- ✅ End-to-End Flow: Collection → Evaluation → Notification → Display
- ✅ Dynamic Versioning: Agent version from
CARGO_PKG_VERSION - ✅ Build Information: NixOS generation display restored
✅ Final Architecture - WORKING
Complete Operational Flow:
Collectors → AgentData (with Status) → NotificationManager → Email Alerts
↘ ↗
ZMQ → Dashboard → Perfect Display
Operational Components:
- ✅ Collectors: Populate AgentData with metrics AND status evaluation
- ✅ Status Evaluation:
HysteresisThresholds.evaluate()applied per collector - ✅ Notifications: Email alerts on status change detection
- ✅ Display: Correct mount points, temperature, wear, and build information
✅ Success Criteria - ALL MET
Display Requirements:
- ✅ Dashboard shows
● nvme0n1 T: 28°C W: 1%format perfectly - ✅ Mount points show
/and/boot(notroot/boot) - ✅ Build information shows actual NixOS version (not "unknown")
- ✅ Consistent sorting eliminates random order changes
Monitoring Requirements:
- ✅ High CPU load triggers Warning/Critical status and email alert
- ✅ High memory usage triggers Warning/Critical status and email alert
- ✅ High disk temperature triggers Warning/Critical status and email alert
- ✅ Failed services trigger Warning/Critical status and email alert
- ✅ Maintenance mode suppresses notifications as expected
🚀 Production Ready
CM Dashboard v0.1.141 is a complete, functional infrastructure monitoring system:
- Real-time Monitoring: All system components with 1-second intervals
- Intelligent Alerting: Email notifications on threshold violations
- Perfect Display: Accurate mount points, temperatures, and system information
- Status-Aware: All metrics evaluated against configurable thresholds
- Production Ready: Full monitoring capabilities restored
The monitoring system is fully operational and ready for production use.
Implementation Rules
- Agent Status Authority: Agent calculates status for each metric using thresholds
- Dashboard Composition: Dashboard widgets subscribe to specific metrics by name
- Status Aggregation: Dashboard aggregates individual metric statuses for widget status
NEVER:
- Copy/paste ANY code from legacy implementations
- Calculate status in dashboard widgets
- Hardcode metric names in widgets (use const arrays)
- Create files unless absolutely necessary for achieving goals
- Create documentation files unless explicitly requested
ALWAYS:
- Prefer editing existing files to creating new ones
- Follow existing code conventions and patterns
- Use existing libraries and utilities
- Follow security best practices