Phase 1 fixes for storage display: - Replace findmnt with lsblk to eliminate bind mount issues (/nix/store) - Add sudo to smartctl commands for permission access - Fix NVMe SMART parsing for Temperature: and Percentage Used: fields - Use dynamic version from CARGO_PKG_VERSION instead of hardcoded strings Storage display should now show correct mount points and temperature/wear. Status evaluation and notifications still need restoration in subsequent phases.
14 KiB
CM Dashboard - Infrastructure Monitoring TUI
Overview
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.
Current Features
Core Functionality
- Real-time Monitoring: CPU, RAM, Storage, and Service status
- Service Management: Start/stop services with user-stopped tracking
- Multi-host Support: Monitor multiple servers from single dashboard
- NixOS Integration: System rebuild via SSH + tmux popup
- Backup Monitoring: Borgbackup status and scheduling
User-Stopped Service Tracking
- Services stopped via dashboard are marked as "user-stopped"
- User-stopped services report Status::OK instead of Warning
- Prevents false alerts during intentional maintenance
- Persistent storage survives agent restarts
- Automatic flag clearing when services are restarted via dashboard
Custom Service Logs
- Configure service-specific log file paths per host in dashboard config
- Press
Lon any service to view custom log files viatail -f - Configuration format in dashboard config:
[service_logs]
hostname1 = [
{ service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
{ service_name = "app", log_file_path = "/var/log/myapp/app.log" }
]
hostname2 = [
{ service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
]
Service Management
- Direct Control: Arrow keys (↑↓) or vim keys (j/k) navigate services
- Service Actions:
s- Start service (sends UserStart command)S- Stop service (sends UserStop command)J- Show service logs (journalctl in tmux popup)L- Show custom log files (tail -f custom paths in tmux popup)R- Rebuild current host
- Visual Status: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
- Transitional Icons: Blue arrows during operations
Navigation
- Tab: Switch between hosts
- ↑↓ or j/k: Select services
- s: Start selected service (UserStart)
- S: Stop selected service (UserStop)
- J: Show service logs (journalctl)
- L: Show custom log files
- R: Rebuild current host
- B: Run backup on current host
- q: Quit dashboard
Core Architecture Principles
Structured Data Architecture (✅ IMPLEMENTED v0.1.131)
Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access.
Previous (String Metrics):
- ❌ Agent sent individual metrics with string names like
disk_nvme0n1_temperature - ❌ Dashboard parsed metric names with underscore counting and string splitting
- ❌ Complex and error-prone metric filtering and extraction logic
Current (Structured Data):
{
"hostname": "cmbox",
"agent_version": "v0.1.131",
"timestamp": 1763926877,
"system": {
"cpu": {
"load_1min": 3.5,
"load_5min": 3.57,
"load_15min": 3.58,
"frequency_mhz": 1500,
"temperature_celsius": 45.2
},
"memory": {
"usage_percent": 25.0,
"total_gb": 23.3,
"used_gb": 5.9,
"swap_total_gb": 10.7,
"swap_used_gb": 0.99,
"tmpfs": [
{
"mount": "/tmp",
"usage_percent": 15.0,
"used_gb": 0.3,
"total_gb": 2.0
}
]
},
"storage": {
"drives": [
{
"name": "nvme0n1",
"health": "PASSED",
"temperature_celsius": 29.0,
"wear_percent": 1.0,
"filesystems": [
{
"mount": "/",
"usage_percent": 24.0,
"used_gb": 224.9,
"total_gb": 928.2
}
]
}
],
"pools": [
{
"name": "srv_media",
"mount": "/srv/media",
"type": "mergerfs",
"health": "healthy",
"usage_percent": 63.0,
"used_gb": 2355.2,
"total_gb": 3686.4,
"data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }],
"parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }]
}
]
}
},
"services": [
{ "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 }
],
"backup": {
"status": "completed",
"last_run": 1763920000,
"next_scheduled": 1764006400,
"total_size_gb": 150.5,
"repository_health": "ok"
}
}
- ✅ Agent sends structured JSON over ZMQ (no legacy support)
- ✅ Type-safe data access:
data.system.storage.drives[0].temperature_celsius - ✅ Complete metric coverage: CPU, memory, storage, services, backup
- ✅ Backward compatibility via bridge conversion to existing UI widgets
- ✅ All string parsing bugs eliminated
Maintenance Mode
- Agent checks for
/tmp/cm-maintenancefile before sending notifications - File presence suppresses all email notifications while continuing monitoring
- Dashboard continues to show real status, only notifications are blocked
Usage:
# Enable maintenance mode
touch /tmp/cm-maintenance
# Run maintenance tasks
systemctl stop service
# ... maintenance work ...
systemctl start service
# Disable maintenance mode
rm /tmp/cm-maintenance
Development and Deployment Architecture
Development Path
- Location:
~/projects/cm-dashboard - Purpose: Development workflow only - for committing new code
- Access: Only for developers to commit changes
Deployment Path
- Location:
/var/lib/cm-dashboard/nixos-config - Purpose: Production deployment only - agent clones/pulls from git
- Workflow: git pull →
/var/lib/cm-dashboard/nixos-config→ nixos-rebuild
Git Flow
Development: ~/projects/cm-dashboard → git commit → git push
Deployment: git pull → /var/lib/cm-dashboard/nixos-config → rebuild
Automated Binary Release System
CM Dashboard uses automated binary releases instead of source builds.
Creating New Releases
cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X
This automatically:
- Builds static binaries with
RUSTFLAGS="-C target-feature=+crt-static" - Creates GitHub-style release with tarball
- Uploads binaries via Gitea API
NixOS Configuration Updates
Edit ~/projects/nixosbox/hosts/services/cm-dashboard.nix:
version = "v0.1.X";
src = pkgs.fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-NEW_HASH_HERE";
};
Get Release Hash
cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
Building
Testing & Building:
- Workspace builds:
nix-shell -p openssl pkg-config --run "cargo build --workspace" - Clean compilation: Remove
target/between major changes
Enhanced Storage Pool Visualization
Auto-Discovery Architecture
The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping.
Discovery Process
At Agent Startup:
- Parse
/proc/mountsto identify all mounted filesystems - Detect MergerFS pools by analyzing
fuse.mergerfsmount sources - Identify member disks and potential parity relationships via heuristics
- Store discovered storage topology for continuous monitoring
- Generate pool-aware metrics with hierarchical relationships
Continuous Monitoring:
- Use stored discovery data for efficient metric collection
- Monitor individual drives for SMART data, temperature, wear
- Calculate pool-level health based on member drive status
- Generate enhanced metrics for dashboard visualization
Supported Storage Types
Single Disks:
- ext4, xfs, btrfs mounted directly
- Individual drive monitoring with SMART data
- Traditional single-disk display for root, boot, etc.
MergerFS Pools:
- Auto-detect from
/proc/mountsfuse.mergerfs entries - Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
- Heuristic parity disk detection (sequential device names, "parity" in path)
- Pool health calculation (healthy/degraded/critical)
- Hierarchical tree display with data/parity disk grouping
Future Extensions Ready:
- RAID arrays via
/proc/mdstatparsing - ZFS pools via
zpool statusintegration - LVM logical volumes via
lvsdiscovery
Configuration
[collectors.disk]
enabled = true
auto_discover = true # Default: true
# Optional exclusions for special filesystems
exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"]
exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]
Display Format
CPU:
● Load: 0.23 0.21 0.13
└─ Freq: 1048 MHz
RAM:
● Usage: 25% 5.8GB/23.3GB
├─ ● /tmp: 2% 0.5GB/2GB
└─ ● /var/tmp: 0% 0GB/1.0GB
Storage:
● mergerfs (2+1):
├─ Total: ● 63% 2355.2GB/3686.4GB
├─ Data Disks:
│ ├─ ● sdb T: 24°C W: 5%
│ └─ ● sdd T: 27°C W: 5%
├─ Parity: ● sdc T: 24°C W: 5%
└─ Mount: /srv/media
● nvme0n1 T: 25C W: 4%
├─ ● /: 55% 250.5GB/456.4GB
└─ ● /boot: 26% 0.3GB/1.0GB
Important Communication Guidelines
Keep responses concise and focused. Avoid extensive implementation summaries unless requested.
Commit Message Guidelines
NEVER mention:
- Claude or any AI assistant names
- Automation or AI-generated content
- Any reference to automated code generation
ALWAYS:
- Focus purely on technical changes and their purpose
- Use standard software development commit message format
- Describe what was changed and why, not how it was created
- Write from the perspective of a human developer
Examples:
- ❌ "Generated with Claude Code"
- ❌ "AI-assisted implementation"
- ❌ "Automated refactoring"
- ✅ "Implement maintenance mode for backup operations"
- ✅ "Restructure storage widget with improved layout"
- ✅ "Update CPU thresholds to production values"
Completed Architecture Migration (v0.1.131)
Complete Fix Plan (v0.1.140)
🎯 Goal: Fix ALL Issues - Display AND Core Functionality
Current Broken State (v0.1.139)
❌ What's Broken:
✅ Data Collection: Agent collects structured data correctly
❌ Storage Display: Shows wrong mount points, missing temperature/wear
❌ Status Evaluation: Everything shows "OK" regardless of actual values
❌ Notifications: Not working - can't send alerts when systems fail
❌ Thresholds: Not being evaluated (CPU load, memory usage, disk temperature)
Root Cause: During atomic migration, I removed core monitoring functionality and only fixed data collection, making the dashboard useless as a monitoring tool.
Complete Fix Plan - Do Everything Right
Phase 1: Fix Storage Display (CURRENT)
- ✅ Use
lsblkinstead offindmnt(eliminates/nix/storebind mount issue) - ✅ Add
sudo smartctlfor permissions - ✅ Fix NVMe SMART parsing (
Temperature:andPercentage Used:) - 🔄 Test that dashboard shows:
● nvme0n1 T: 28°C W: 1%correctly
Phase 2: Restore Status Evaluation System
- CPU Status: Evaluate load averages against thresholds → Status::Warning/Critical
- Memory Status: Evaluate usage_percent against thresholds → Status::Warning/Critical
- Storage Status: Evaluate temperature & usage against thresholds → Status::Warning/Critical
- Service Status: Evaluate service states → Status::Warning if inactive
- Overall Host Status: Aggregate component statuses → host-level status
Phase 3: Restore Notification System
- Status Change Detection: Track when component status changes from OK→Warning/Critical
- Email Notifications: Send alerts when status degrades
- Notification Rate Limiting: Prevent spam (existing logic)
- Maintenance Mode: Honor
/tmp/cm-maintenanceto suppress alerts - Batched Notifications: Group multiple alerts into single email
Phase 4: Integration & Testing
- AgentData Status Fields: Add status fields to structured data
- Dashboard Status Display: Show colored indicators based on actual status
- End-to-End Testing: Verify alerts fire when thresholds exceeded
- Verify All Thresholds: CPU load, memory usage, disk temperature, service states
Target Architecture (CORRECT)
Complete Flow:
Collectors → AgentData → StatusEvaluator → Notifications
↘ ↗
ZMQ → Dashboard → Status Display
Key Components:
- Collectors: Populate AgentData with raw metrics
- StatusEvaluator: Apply thresholds to AgentData → Status enum values
- Notifications: Send emails on status changes (OK→Warning/Critical)
- Dashboard: Display data with correct status colors/indicators
Implementation Rules
MUST COMPLETE ALL:
- Fix storage display to show correct mount points and temperature
- Restore working status evaluation (thresholds → Status enum)
- Restore working notifications (email alerts on status changes)
- Test that monitoring actually works (alerts fire when appropriate)
NO SHORTCUTS:
- Don't commit partial fixes
- Don't claim functionality works when it doesn't
- Test every component thoroughly
- Keep existing configuration and thresholds working
Success Criteria:
- Dashboard shows
● nvme0n1 T: 28°C W: 1%format - High CPU load triggers Warning status and email alert
- High memory usage triggers Warning status and email alert
- High disk temperature triggers Warning status and email alert
- Failed services trigger Warning status and email alert
- Maintenance mode suppresses notifications as expected
Implementation Rules
- Agent Status Authority: Agent calculates status for each metric using thresholds
- Dashboard Composition: Dashboard widgets subscribe to specific metrics by name
- Status Aggregation: Dashboard aggregates individual metric statuses for widget status
NEVER:
- Copy/paste ANY code from legacy implementations
- Calculate status in dashboard widgets
- Hardcode metric names in widgets (use const arrays)
- Create files unless absolutely necessary for achieving goals
- Create documentation files unless explicitly requested
ALWAYS:
- Prefer editing existing files to creating new ones
- Follow existing code conventions and patterns
- Use existing libraries and utilities
- Follow security best practices