Fix storage display issues and use dynamic versioning
All checks were successful
Build and Release / build-and-release (push) Successful in 1m7s

Phase 1 fixes for storage display:
- Replace findmnt with lsblk to eliminate bind mount issues (/nix/store)
- Add sudo to smartctl commands for permission access
- Fix NVMe SMART parsing for Temperature: and Percentage Used: fields
- Use dynamic version from CARGO_PKG_VERSION instead of hardcoded strings

Storage display should now show correct mount points and temperature/wear.
Status evaluation and notifications still need restoration in subsequent phases.
This commit is contained in:
2025-11-24 19:26:09 +01:00
parent 2b2cb2da3e
commit fd7ad23205
7 changed files with 127 additions and 53 deletions

103
CLAUDE.md
View File

@@ -357,53 +357,88 @@ Keep responses concise and focused. Avoid extensive implementation summaries unl
## Completed Architecture Migration (v0.1.131)
## Agent Architecture Migration Plan (v0.1.139)
## Complete Fix Plan (v0.1.140)
**🎯 Goal: Eliminate String Metrics Bridge, Direct Structured Data Collection**
**🎯 Goal: Fix ALL Issues - Display AND Core Functionality**
### Current Architecture (v0.1.138)
### Current Broken State (v0.1.139)
**Current Flow:**
**❌ What's Broken:**
```
Collectors → String Metrics → MetricManager.cache
process_metrics() → HostStatusManager → Notifications
broadcast_all_metrics() → Bridge Conversion → AgentData → ZMQ
✅ Data Collection: Agent collects structured data correctly
❌ Storage Display: Shows wrong mount points, missing temperature/wear
❌ Status Evaluation: Everything shows "OK" regardless of actual values
❌ Notifications: Not working - can't send alerts when systems fail
❌ Thresholds: Not being evaluated (CPU load, memory usage, disk temperature)
```
**Issues:**
- Bridge conversion loses mount point information (`/` becomes `root`, `/boot` becomes `boot`)
- Tmpfs mounts not properly displayed in RAM section
- Unnecessary string parsing complexity and potential bugs
- String-to-JSON conversion introduces data transformation errors
**Root Cause:**
During atomic migration, I removed core monitoring functionality and only fixed data collection, making the dashboard useless as a monitoring tool.
### Target Architecture
### Complete Fix Plan - Do Everything Right
**Target Flow:**
#### Phase 1: Fix Storage Display (CURRENT)
- ✅ Use `lsblk` instead of `findmnt` (eliminates `/nix/store` bind mount issue)
- ✅ Add `sudo smartctl` for permissions
- ✅ Fix NVMe SMART parsing (`Temperature:` and `Percentage Used:`)
- 🔄 Test that dashboard shows: `● nvme0n1 T: 28°C W: 1%` correctly
#### Phase 2: Restore Status Evaluation System
- **CPU Status**: Evaluate load averages against thresholds → Status::Warning/Critical
- **Memory Status**: Evaluate usage_percent against thresholds → Status::Warning/Critical
- **Storage Status**: Evaluate temperature & usage against thresholds → Status::Warning/Critical
- **Service Status**: Evaluate service states → Status::Warning if inactive
- **Overall Host Status**: Aggregate component statuses → host-level status
#### Phase 3: Restore Notification System
- **Status Change Detection**: Track when component status changes from OK→Warning/Critical
- **Email Notifications**: Send alerts when status degrades
- **Notification Rate Limiting**: Prevent spam (existing logic)
- **Maintenance Mode**: Honor `/tmp/cm-maintenance` to suppress alerts
- **Batched Notifications**: Group multiple alerts into single email
#### Phase 4: Integration & Testing
- **AgentData Status Fields**: Add status fields to structured data
- **Dashboard Status Display**: Show colored indicators based on actual status
- **End-to-End Testing**: Verify alerts fire when thresholds exceeded
- **Verify All Thresholds**: CPU load, memory usage, disk temperature, service states
### Target Architecture (CORRECT)
**Complete Flow:**
```
Collectors → AgentData → HostStatusManager → Notifications
Direct ZMQ Transmission
Collectors → AgentData → StatusEvaluator → Notifications
ZMQ → Dashboard → Status Display
```
### Implementation Plan
**Key Components:**
1. **Collectors**: Populate AgentData with raw metrics
2. **StatusEvaluator**: Apply thresholds to AgentData → Status enum values
3. **Notifications**: Send emails on status changes (OK→Warning/Critical)
4. **Dashboard**: Display data with correct status colors/indicators
#### Atomic Migration (v0.1.139) - Single Complete Rewrite
- **Complete removal** of string metrics system - no legacy support
- **Collectors output structured data directly** - populate `AgentData` with correct mount points
- **HostStatusManager operates on `AgentData`** - status evaluation on structured fields
- **Notifications process structured data** - preserve all notification logic
- **Direct ZMQ transmission** - no bridge conversion code
- **Service tracking preserved** - user-stopped flags, thresholds, all functionality intact
- **Zero backward compatibility** - clean break from string metric architecture
### Implementation Rules
### Benefits
- **Correct Display**: `/` and `/boot` mount points, proper tmpfs in RAM section
- **Performance**: Eliminate string parsing overhead
- **Maintainability**: Type-safe data flow, no string parsing bugs
- **Functionality Preserved**: Status evaluation, notifications, service tracking intact
- **Clean Architecture**: NO legacy fallback code, complete migration to structured data
**MUST COMPLETE ALL:**
- Fix storage display to show correct mount points and temperature
- Restore working status evaluation (thresholds → Status enum)
- Restore working notifications (email alerts on status changes)
- Test that monitoring actually works (alerts fire when appropriate)
**NO SHORTCUTS:**
- Don't commit partial fixes
- Don't claim functionality works when it doesn't
- Test every component thoroughly
- Keep existing configuration and thresholds working
**Success Criteria:**
- Dashboard shows `● nvme0n1 T: 28°C W: 1%` format
- High CPU load triggers Warning status and email alert
- High memory usage triggers Warning status and email alert
- High disk temperature triggers Warning status and email alert
- Failed services trigger Warning status and email alert
- Maintenance mode suppresses notifications as expected
## Implementation Rules