All checks were successful
Build and Release / build-and-release (push) Successful in 1m7s
Phase 1 fixes for storage display: - Replace findmnt with lsblk to eliminate bind mount issues (/nix/store) - Add sudo to smartctl commands for permission access - Fix NVMe SMART parsing for Temperature: and Percentage Used: fields - Use dynamic version from CARGO_PKG_VERSION instead of hardcoded strings Storage display should now show correct mount points and temperature/wear. Status evaluation and notifications still need restoration in subsequent phases.
463 lines
14 KiB
Markdown
463 lines
14 KiB
Markdown
# CM Dashboard - Infrastructure Monitoring TUI
|
|
|
|
## Overview
|
|
|
|
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.
|
|
|
|
## Current Features
|
|
|
|
### Core Functionality
|
|
|
|
- **Real-time Monitoring**: CPU, RAM, Storage, and Service status
|
|
- **Service Management**: Start/stop services with user-stopped tracking
|
|
- **Multi-host Support**: Monitor multiple servers from single dashboard
|
|
- **NixOS Integration**: System rebuild via SSH + tmux popup
|
|
- **Backup Monitoring**: Borgbackup status and scheduling
|
|
|
|
### User-Stopped Service Tracking
|
|
|
|
- Services stopped via dashboard are marked as "user-stopped"
|
|
- User-stopped services report Status::OK instead of Warning
|
|
- Prevents false alerts during intentional maintenance
|
|
- Persistent storage survives agent restarts
|
|
- Automatic flag clearing when services are restarted via dashboard
|
|
|
|
### Custom Service Logs
|
|
|
|
- Configure service-specific log file paths per host in dashboard config
|
|
- Press `L` on any service to view custom log files via `tail -f`
|
|
- Configuration format in dashboard config:
|
|
|
|
```toml
|
|
[service_logs]
|
|
hostname1 = [
|
|
{ service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
|
|
{ service_name = "app", log_file_path = "/var/log/myapp/app.log" }
|
|
]
|
|
hostname2 = [
|
|
{ service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
|
|
]
|
|
```
|
|
|
|
### Service Management
|
|
|
|
- **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services
|
|
- **Service Actions**:
|
|
- `s` - Start service (sends UserStart command)
|
|
- `S` - Stop service (sends UserStop command)
|
|
- `J` - Show service logs (journalctl in tmux popup)
|
|
- `L` - Show custom log files (tail -f custom paths in tmux popup)
|
|
- `R` - Rebuild current host
|
|
- **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
|
|
- **Transitional Icons**: Blue arrows during operations
|
|
|
|
### Navigation
|
|
|
|
- **Tab**: Switch between hosts
|
|
- **↑↓ or j/k**: Select services
|
|
- **s**: Start selected service (UserStart)
|
|
- **S**: Stop selected service (UserStop)
|
|
- **J**: Show service logs (journalctl)
|
|
- **L**: Show custom log files
|
|
- **R**: Rebuild current host
|
|
- **B**: Run backup on current host
|
|
- **q**: Quit dashboard
|
|
|
|
## Core Architecture Principles
|
|
|
|
### Structured Data Architecture (✅ IMPLEMENTED v0.1.131)
|
|
|
|
Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access.
|
|
|
|
**Previous (String Metrics):**
|
|
|
|
- ❌ Agent sent individual metrics with string names like `disk_nvme0n1_temperature`
|
|
- ❌ Dashboard parsed metric names with underscore counting and string splitting
|
|
- ❌ Complex and error-prone metric filtering and extraction logic
|
|
|
|
**Current (Structured Data):**
|
|
|
|
```json
|
|
{
|
|
"hostname": "cmbox",
|
|
"agent_version": "v0.1.131",
|
|
"timestamp": 1763926877,
|
|
"system": {
|
|
"cpu": {
|
|
"load_1min": 3.5,
|
|
"load_5min": 3.57,
|
|
"load_15min": 3.58,
|
|
"frequency_mhz": 1500,
|
|
"temperature_celsius": 45.2
|
|
},
|
|
"memory": {
|
|
"usage_percent": 25.0,
|
|
"total_gb": 23.3,
|
|
"used_gb": 5.9,
|
|
"swap_total_gb": 10.7,
|
|
"swap_used_gb": 0.99,
|
|
"tmpfs": [
|
|
{
|
|
"mount": "/tmp",
|
|
"usage_percent": 15.0,
|
|
"used_gb": 0.3,
|
|
"total_gb": 2.0
|
|
}
|
|
]
|
|
},
|
|
"storage": {
|
|
"drives": [
|
|
{
|
|
"name": "nvme0n1",
|
|
"health": "PASSED",
|
|
"temperature_celsius": 29.0,
|
|
"wear_percent": 1.0,
|
|
"filesystems": [
|
|
{
|
|
"mount": "/",
|
|
"usage_percent": 24.0,
|
|
"used_gb": 224.9,
|
|
"total_gb": 928.2
|
|
}
|
|
]
|
|
}
|
|
],
|
|
"pools": [
|
|
{
|
|
"name": "srv_media",
|
|
"mount": "/srv/media",
|
|
"type": "mergerfs",
|
|
"health": "healthy",
|
|
"usage_percent": 63.0,
|
|
"used_gb": 2355.2,
|
|
"total_gb": 3686.4,
|
|
"data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }],
|
|
"parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }]
|
|
}
|
|
]
|
|
}
|
|
},
|
|
"services": [
|
|
{ "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 }
|
|
],
|
|
"backup": {
|
|
"status": "completed",
|
|
"last_run": 1763920000,
|
|
"next_scheduled": 1764006400,
|
|
"total_size_gb": 150.5,
|
|
"repository_health": "ok"
|
|
}
|
|
}
|
|
```
|
|
|
|
- ✅ Agent sends structured JSON over ZMQ (no legacy support)
|
|
- ✅ Type-safe data access: `data.system.storage.drives[0].temperature_celsius`
|
|
- ✅ Complete metric coverage: CPU, memory, storage, services, backup
|
|
- ✅ Backward compatibility via bridge conversion to existing UI widgets
|
|
- ✅ All string parsing bugs eliminated
|
|
|
|
### Maintenance Mode
|
|
|
|
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
|
|
- File presence suppresses all email notifications while continuing monitoring
|
|
- Dashboard continues to show real status, only notifications are blocked
|
|
|
|
Usage:
|
|
|
|
```bash
|
|
# Enable maintenance mode
|
|
touch /tmp/cm-maintenance
|
|
|
|
# Run maintenance tasks
|
|
systemctl stop service
|
|
# ... maintenance work ...
|
|
systemctl start service
|
|
|
|
# Disable maintenance mode
|
|
rm /tmp/cm-maintenance
|
|
```
|
|
|
|
## Development and Deployment Architecture
|
|
|
|
### Development Path
|
|
|
|
- **Location:** `~/projects/cm-dashboard`
|
|
- **Purpose:** Development workflow only - for committing new code
|
|
- **Access:** Only for developers to commit changes
|
|
|
|
### Deployment Path
|
|
|
|
- **Location:** `/var/lib/cm-dashboard/nixos-config`
|
|
- **Purpose:** Production deployment only - agent clones/pulls from git
|
|
- **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild
|
|
|
|
### Git Flow
|
|
|
|
```
|
|
Development: ~/projects/cm-dashboard → git commit → git push
|
|
Deployment: git pull → /var/lib/cm-dashboard/nixos-config → rebuild
|
|
```
|
|
|
|
## Automated Binary Release System
|
|
|
|
CM Dashboard uses automated binary releases instead of source builds.
|
|
|
|
### Creating New Releases
|
|
|
|
```bash
|
|
cd ~/projects/cm-dashboard
|
|
git tag v0.1.X
|
|
git push origin v0.1.X
|
|
```
|
|
|
|
This automatically:
|
|
|
|
- Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
|
|
- Creates GitHub-style release with tarball
|
|
- Uploads binaries via Gitea API
|
|
|
|
### NixOS Configuration Updates
|
|
|
|
Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:
|
|
|
|
```nix
|
|
version = "v0.1.X";
|
|
src = pkgs.fetchurl {
|
|
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
|
|
sha256 = "sha256-NEW_HASH_HERE";
|
|
};
|
|
```
|
|
|
|
### Get Release Hash
|
|
|
|
```bash
|
|
cd ~/projects/nixosbox
|
|
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
|
|
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
|
|
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
|
|
}' 2>&1 | grep "got:"
|
|
```
|
|
|
|
### Building
|
|
|
|
**Testing & Building:**
|
|
|
|
- **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"`
|
|
- **Clean compilation**: Remove `target/` between major changes
|
|
|
|
## Enhanced Storage Pool Visualization
|
|
|
|
### Auto-Discovery Architecture
|
|
|
|
The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping.
|
|
|
|
### Discovery Process
|
|
|
|
**At Agent Startup:**
|
|
|
|
1. Parse `/proc/mounts` to identify all mounted filesystems
|
|
2. Detect MergerFS pools by analyzing `fuse.mergerfs` mount sources
|
|
3. Identify member disks and potential parity relationships via heuristics
|
|
4. Store discovered storage topology for continuous monitoring
|
|
5. Generate pool-aware metrics with hierarchical relationships
|
|
|
|
**Continuous Monitoring:**
|
|
|
|
- Use stored discovery data for efficient metric collection
|
|
- Monitor individual drives for SMART data, temperature, wear
|
|
- Calculate pool-level health based on member drive status
|
|
- Generate enhanced metrics for dashboard visualization
|
|
|
|
### Supported Storage Types
|
|
|
|
**Single Disks:**
|
|
|
|
- ext4, xfs, btrfs mounted directly
|
|
- Individual drive monitoring with SMART data
|
|
- Traditional single-disk display for root, boot, etc.
|
|
|
|
**MergerFS Pools:**
|
|
|
|
- Auto-detect from `/proc/mounts` fuse.mergerfs entries
|
|
- Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
|
|
- Heuristic parity disk detection (sequential device names, "parity" in path)
|
|
- Pool health calculation (healthy/degraded/critical)
|
|
- Hierarchical tree display with data/parity disk grouping
|
|
|
|
**Future Extensions Ready:**
|
|
|
|
- RAID arrays via `/proc/mdstat` parsing
|
|
- ZFS pools via `zpool status` integration
|
|
- LVM logical volumes via `lvs` discovery
|
|
|
|
### Configuration
|
|
|
|
```toml
|
|
[collectors.disk]
|
|
enabled = true
|
|
auto_discover = true # Default: true
|
|
# Optional exclusions for special filesystems
|
|
exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"]
|
|
exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]
|
|
```
|
|
|
|
### Display Format
|
|
|
|
```
|
|
CPU:
|
|
● Load: 0.23 0.21 0.13
|
|
└─ Freq: 1048 MHz
|
|
|
|
RAM:
|
|
● Usage: 25% 5.8GB/23.3GB
|
|
├─ ● /tmp: 2% 0.5GB/2GB
|
|
└─ ● /var/tmp: 0% 0GB/1.0GB
|
|
|
|
Storage:
|
|
● mergerfs (2+1):
|
|
├─ Total: ● 63% 2355.2GB/3686.4GB
|
|
├─ Data Disks:
|
|
│ ├─ ● sdb T: 24°C W: 5%
|
|
│ └─ ● sdd T: 27°C W: 5%
|
|
├─ Parity: ● sdc T: 24°C W: 5%
|
|
└─ Mount: /srv/media
|
|
|
|
● nvme0n1 T: 25C W: 4%
|
|
├─ ● /: 55% 250.5GB/456.4GB
|
|
└─ ● /boot: 26% 0.3GB/1.0GB
|
|
```
|
|
|
|
## Important Communication Guidelines
|
|
|
|
Keep responses concise and focused. Avoid extensive implementation summaries unless requested.
|
|
|
|
## Commit Message Guidelines
|
|
|
|
**NEVER mention:**
|
|
|
|
- Claude or any AI assistant names
|
|
- Automation or AI-generated content
|
|
- Any reference to automated code generation
|
|
|
|
**ALWAYS:**
|
|
|
|
- Focus purely on technical changes and their purpose
|
|
- Use standard software development commit message format
|
|
- Describe what was changed and why, not how it was created
|
|
- Write from the perspective of a human developer
|
|
|
|
**Examples:**
|
|
|
|
- ❌ "Generated with Claude Code"
|
|
- ❌ "AI-assisted implementation"
|
|
- ❌ "Automated refactoring"
|
|
- ✅ "Implement maintenance mode for backup operations"
|
|
- ✅ "Restructure storage widget with improved layout"
|
|
- ✅ "Update CPU thresholds to production values"
|
|
|
|
## Completed Architecture Migration (v0.1.131)
|
|
|
|
## Complete Fix Plan (v0.1.140)
|
|
|
|
**🎯 Goal: Fix ALL Issues - Display AND Core Functionality**
|
|
|
|
### Current Broken State (v0.1.139)
|
|
|
|
**❌ What's Broken:**
|
|
```
|
|
✅ Data Collection: Agent collects structured data correctly
|
|
❌ Storage Display: Shows wrong mount points, missing temperature/wear
|
|
❌ Status Evaluation: Everything shows "OK" regardless of actual values
|
|
❌ Notifications: Not working - can't send alerts when systems fail
|
|
❌ Thresholds: Not being evaluated (CPU load, memory usage, disk temperature)
|
|
```
|
|
|
|
**Root Cause:**
|
|
During atomic migration, I removed core monitoring functionality and only fixed data collection, making the dashboard useless as a monitoring tool.
|
|
|
|
### Complete Fix Plan - Do Everything Right
|
|
|
|
#### Phase 1: Fix Storage Display (CURRENT)
|
|
- ✅ Use `lsblk` instead of `findmnt` (eliminates `/nix/store` bind mount issue)
|
|
- ✅ Add `sudo smartctl` for permissions
|
|
- ✅ Fix NVMe SMART parsing (`Temperature:` and `Percentage Used:`)
|
|
- 🔄 Test that dashboard shows: `● nvme0n1 T: 28°C W: 1%` correctly
|
|
|
|
#### Phase 2: Restore Status Evaluation System
|
|
- **CPU Status**: Evaluate load averages against thresholds → Status::Warning/Critical
|
|
- **Memory Status**: Evaluate usage_percent against thresholds → Status::Warning/Critical
|
|
- **Storage Status**: Evaluate temperature & usage against thresholds → Status::Warning/Critical
|
|
- **Service Status**: Evaluate service states → Status::Warning if inactive
|
|
- **Overall Host Status**: Aggregate component statuses → host-level status
|
|
|
|
#### Phase 3: Restore Notification System
|
|
- **Status Change Detection**: Track when component status changes from OK→Warning/Critical
|
|
- **Email Notifications**: Send alerts when status degrades
|
|
- **Notification Rate Limiting**: Prevent spam (existing logic)
|
|
- **Maintenance Mode**: Honor `/tmp/cm-maintenance` to suppress alerts
|
|
- **Batched Notifications**: Group multiple alerts into single email
|
|
|
|
#### Phase 4: Integration & Testing
|
|
- **AgentData Status Fields**: Add status fields to structured data
|
|
- **Dashboard Status Display**: Show colored indicators based on actual status
|
|
- **End-to-End Testing**: Verify alerts fire when thresholds exceeded
|
|
- **Verify All Thresholds**: CPU load, memory usage, disk temperature, service states
|
|
|
|
### Target Architecture (CORRECT)
|
|
|
|
**Complete Flow:**
|
|
```
|
|
Collectors → AgentData → StatusEvaluator → Notifications
|
|
↘ ↗
|
|
ZMQ → Dashboard → Status Display
|
|
```
|
|
|
|
**Key Components:**
|
|
1. **Collectors**: Populate AgentData with raw metrics
|
|
2. **StatusEvaluator**: Apply thresholds to AgentData → Status enum values
|
|
3. **Notifications**: Send emails on status changes (OK→Warning/Critical)
|
|
4. **Dashboard**: Display data with correct status colors/indicators
|
|
|
|
### Implementation Rules
|
|
|
|
**MUST COMPLETE ALL:**
|
|
- Fix storage display to show correct mount points and temperature
|
|
- Restore working status evaluation (thresholds → Status enum)
|
|
- Restore working notifications (email alerts on status changes)
|
|
- Test that monitoring actually works (alerts fire when appropriate)
|
|
|
|
**NO SHORTCUTS:**
|
|
- Don't commit partial fixes
|
|
- Don't claim functionality works when it doesn't
|
|
- Test every component thoroughly
|
|
- Keep existing configuration and thresholds working
|
|
|
|
**Success Criteria:**
|
|
- Dashboard shows `● nvme0n1 T: 28°C W: 1%` format
|
|
- High CPU load triggers Warning status and email alert
|
|
- High memory usage triggers Warning status and email alert
|
|
- High disk temperature triggers Warning status and email alert
|
|
- Failed services trigger Warning status and email alert
|
|
- Maintenance mode suppresses notifications as expected
|
|
|
|
## Implementation Rules
|
|
|
|
1. **Agent Status Authority**: Agent calculates status for each metric using thresholds
|
|
2. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
|
|
3. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status
|
|
|
|
**NEVER:**
|
|
|
|
- Copy/paste ANY code from legacy implementations
|
|
- Calculate status in dashboard widgets
|
|
- Hardcode metric names in widgets (use const arrays)
|
|
- Create files unless absolutely necessary for achieving goals
|
|
- Create documentation files unless explicitly requested
|
|
|
|
**ALWAYS:**
|
|
|
|
- Prefer editing existing files to creating new ones
|
|
- Follow existing code conventions and patterns
|
|
- Use existing libraries and utilities
|
|
- Follow security best practices
|