Fix storage display issues and use dynamic versioning

Phase 1 fixes for storage display: - Replace findmnt with lsblk to eliminate bind mount issues (/nix/store) - Add sudo to smartctl commands for permission access - Fix NVMe SMART parsing for Temperature: and Percentage Used: fields - Use dynamic version from CARGO_PKG_VERSION instead of hardcoded strings Storage display should now show correct mount points and temperature/wear. Status evaluation and notifications still need restoration in subsequent phases.
2025-11-24 19:26:09 +01:00
parent 2b2cb2da3e
commit fd7ad23205
7 changed files with 127 additions and 53 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -357,53 +357,88 @@ Keep responses concise and focused. Avoid extensive implementation summaries unl

 ## Completed Architecture Migration (v0.1.131)

-## Agent Architecture Migration Plan (v0.1.139)
+## Complete Fix Plan (v0.1.140)

-**🎯 Goal: Eliminate String Metrics Bridge, Direct Structured Data Collection**
+**🎯 Goal: Fix ALL Issues - Display AND Core Functionality**

-### Current Architecture (v0.1.138)
+### Current Broken State (v0.1.139)

-**Current Flow:**
+**❌ What's Broken:**
 ```
-Collectors → String Metrics → MetricManager.cache
-                           ↘
-                           process_metrics() → HostStatusManager → Notifications
-                           ↘  
-                           broadcast_all_metrics() → Bridge Conversion → AgentData → ZMQ
+✅ Data Collection: Agent collects structured data correctly
+❌ Storage Display: Shows wrong mount points, missing temperature/wear
+❌ Status Evaluation: Everything shows "OK" regardless of actual values
+❌ Notifications: Not working - can't send alerts when systems fail
+❌ Thresholds: Not being evaluated (CPU load, memory usage, disk temperature)
 ```

-**Issues:**
- Bridge conversion loses mount point information (`/` becomes `root`, `/boot` becomes `boot`)
- Tmpfs mounts not properly displayed in RAM section
- Unnecessary string parsing complexity and potential bugs
- String-to-JSON conversion introduces data transformation errors
+**Root Cause:**
+During atomic migration, I removed core monitoring functionality and only fixed data collection, making the dashboard useless as a monitoring tool.

-### Target Architecture
+### Complete Fix Plan - Do Everything Right

-**Target Flow:**
+#### Phase 1: Fix Storage Display (CURRENT)
+- ✅ Use `lsblk` instead of `findmnt` (eliminates `/nix/store` bind mount issue)
+- ✅ Add `sudo smartctl` for permissions
+- ✅ Fix NVMe SMART parsing (`Temperature:` and `Percentage Used:`)
+- 🔄 Test that dashboard shows: `● nvme0n1 T: 28°C W: 1%` correctly
+
+#### Phase 2: Restore Status Evaluation System
+- **CPU Status**: Evaluate load averages against thresholds → Status::Warning/Critical
+- **Memory Status**: Evaluate usage_percent against thresholds → Status::Warning/Critical  
+- **Storage Status**: Evaluate temperature & usage against thresholds → Status::Warning/Critical
+- **Service Status**: Evaluate service states → Status::Warning if inactive
+- **Overall Host Status**: Aggregate component statuses → host-level status
+
+#### Phase 3: Restore Notification System
+- **Status Change Detection**: Track when component status changes from OK→Warning/Critical
+- **Email Notifications**: Send alerts when status degrades
+- **Notification Rate Limiting**: Prevent spam (existing logic)
+- **Maintenance Mode**: Honor `/tmp/cm-maintenance` to suppress alerts
+- **Batched Notifications**: Group multiple alerts into single email
+
+#### Phase 4: Integration & Testing
+- **AgentData Status Fields**: Add status fields to structured data
+- **Dashboard Status Display**: Show colored indicators based on actual status
+- **End-to-End Testing**: Verify alerts fire when thresholds exceeded
+- **Verify All Thresholds**: CPU load, memory usage, disk temperature, service states
+
+### Target Architecture (CORRECT)
+
+**Complete Flow:**
 ```
-Collectors → AgentData → HostStatusManager → Notifications
-                      ↘
-                      Direct ZMQ Transmission
+Collectors → AgentData → StatusEvaluator → Notifications
+                      ↘                 ↗  
+                      ZMQ → Dashboard → Status Display
 ```

-### Implementation Plan
+**Key Components:**
+1. **Collectors**: Populate AgentData with raw metrics
+2. **StatusEvaluator**: Apply thresholds to AgentData → Status enum values  
+3. **Notifications**: Send emails on status changes (OK→Warning/Critical)
+4. **Dashboard**: Display data with correct status colors/indicators

-#### Atomic Migration (v0.1.139) - Single Complete Rewrite
- **Complete removal** of string metrics system - no legacy support
- **Collectors output structured data directly** - populate `AgentData` with correct mount points
- **HostStatusManager operates on `AgentData`** - status evaluation on structured fields  
- **Notifications process structured data** - preserve all notification logic
- **Direct ZMQ transmission** - no bridge conversion code
- **Service tracking preserved** - user-stopped flags, thresholds, all functionality intact
- **Zero backward compatibility** - clean break from string metric architecture
+### Implementation Rules

-### Benefits
- **Correct Display**: `/` and `/boot` mount points, proper tmpfs in RAM section
- **Performance**: Eliminate string parsing overhead
- **Maintainability**: Type-safe data flow, no string parsing bugs
- **Functionality Preserved**: Status evaluation, notifications, service tracking intact
- **Clean Architecture**: NO legacy fallback code, complete migration to structured data
+**MUST COMPLETE ALL:**
+- Fix storage display to show correct mount points and temperature
+- Restore working status evaluation (thresholds → Status enum)
+- Restore working notifications (email alerts on status changes)  
+- Test that monitoring actually works (alerts fire when appropriate)
+
+**NO SHORTCUTS:**
+- Don't commit partial fixes
+- Don't claim functionality works when it doesn't
+- Test every component thoroughly
+- Keep existing configuration and thresholds working
+
+**Success Criteria:**
+- Dashboard shows `● nvme0n1 T: 28°C W: 1%` format
+- High CPU load triggers Warning status and email alert
+- High memory usage triggers Warning status and email alert  
+- High disk temperature triggers Warning status and email alert
+- Failed services trigger Warning status and email alert
+- Maintenance mode suppresses notifications as expected

 ## Implementation Rules