diff --git a/CLAUDE.md b/CLAUDE.md index 36953e3..f81f3aa 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -357,53 +357,88 @@ Keep responses concise and focused. Avoid extensive implementation summaries unl ## Completed Architecture Migration (v0.1.131) -## Agent Architecture Migration Plan (v0.1.139) +## Complete Fix Plan (v0.1.140) -**🎯 Goal: Eliminate String Metrics Bridge, Direct Structured Data Collection** +**🎯 Goal: Fix ALL Issues - Display AND Core Functionality** -### Current Architecture (v0.1.138) +### Current Broken State (v0.1.139) -**Current Flow:** +**❌ What's Broken:** ``` -Collectors β†’ String Metrics β†’ MetricManager.cache - β†˜ - process_metrics() β†’ HostStatusManager β†’ Notifications - β†˜ - broadcast_all_metrics() β†’ Bridge Conversion β†’ AgentData β†’ ZMQ +βœ… Data Collection: Agent collects structured data correctly +❌ Storage Display: Shows wrong mount points, missing temperature/wear +❌ Status Evaluation: Everything shows "OK" regardless of actual values +❌ Notifications: Not working - can't send alerts when systems fail +❌ Thresholds: Not being evaluated (CPU load, memory usage, disk temperature) ``` -**Issues:** -- Bridge conversion loses mount point information (`/` becomes `root`, `/boot` becomes `boot`) -- Tmpfs mounts not properly displayed in RAM section -- Unnecessary string parsing complexity and potential bugs -- String-to-JSON conversion introduces data transformation errors +**Root Cause:** +During atomic migration, I removed core monitoring functionality and only fixed data collection, making the dashboard useless as a monitoring tool. -### Target Architecture +### Complete Fix Plan - Do Everything Right -**Target Flow:** +#### Phase 1: Fix Storage Display (CURRENT) +- βœ… Use `lsblk` instead of `findmnt` (eliminates `/nix/store` bind mount issue) +- βœ… Add `sudo smartctl` for permissions +- βœ… Fix NVMe SMART parsing (`Temperature:` and `Percentage Used:`) +- πŸ”„ Test that dashboard shows: `● nvme0n1 T: 28Β°C W: 1%` correctly + +#### Phase 2: Restore Status Evaluation System +- **CPU Status**: Evaluate load averages against thresholds β†’ Status::Warning/Critical +- **Memory Status**: Evaluate usage_percent against thresholds β†’ Status::Warning/Critical +- **Storage Status**: Evaluate temperature & usage against thresholds β†’ Status::Warning/Critical +- **Service Status**: Evaluate service states β†’ Status::Warning if inactive +- **Overall Host Status**: Aggregate component statuses β†’ host-level status + +#### Phase 3: Restore Notification System +- **Status Change Detection**: Track when component status changes from OKβ†’Warning/Critical +- **Email Notifications**: Send alerts when status degrades +- **Notification Rate Limiting**: Prevent spam (existing logic) +- **Maintenance Mode**: Honor `/tmp/cm-maintenance` to suppress alerts +- **Batched Notifications**: Group multiple alerts into single email + +#### Phase 4: Integration & Testing +- **AgentData Status Fields**: Add status fields to structured data +- **Dashboard Status Display**: Show colored indicators based on actual status +- **End-to-End Testing**: Verify alerts fire when thresholds exceeded +- **Verify All Thresholds**: CPU load, memory usage, disk temperature, service states + +### Target Architecture (CORRECT) + +**Complete Flow:** ``` -Collectors β†’ AgentData β†’ HostStatusManager β†’ Notifications - β†˜ - Direct ZMQ Transmission +Collectors β†’ AgentData β†’ StatusEvaluator β†’ Notifications + β†˜ β†— + ZMQ β†’ Dashboard β†’ Status Display ``` -### Implementation Plan +**Key Components:** +1. **Collectors**: Populate AgentData with raw metrics +2. **StatusEvaluator**: Apply thresholds to AgentData β†’ Status enum values +3. **Notifications**: Send emails on status changes (OKβ†’Warning/Critical) +4. **Dashboard**: Display data with correct status colors/indicators -#### Atomic Migration (v0.1.139) - Single Complete Rewrite -- **Complete removal** of string metrics system - no legacy support -- **Collectors output structured data directly** - populate `AgentData` with correct mount points -- **HostStatusManager operates on `AgentData`** - status evaluation on structured fields -- **Notifications process structured data** - preserve all notification logic -- **Direct ZMQ transmission** - no bridge conversion code -- **Service tracking preserved** - user-stopped flags, thresholds, all functionality intact -- **Zero backward compatibility** - clean break from string metric architecture +### Implementation Rules -### Benefits -- **Correct Display**: `/` and `/boot` mount points, proper tmpfs in RAM section -- **Performance**: Eliminate string parsing overhead -- **Maintainability**: Type-safe data flow, no string parsing bugs -- **Functionality Preserved**: Status evaluation, notifications, service tracking intact -- **Clean Architecture**: NO legacy fallback code, complete migration to structured data +**MUST COMPLETE ALL:** +- Fix storage display to show correct mount points and temperature +- Restore working status evaluation (thresholds β†’ Status enum) +- Restore working notifications (email alerts on status changes) +- Test that monitoring actually works (alerts fire when appropriate) + +**NO SHORTCUTS:** +- Don't commit partial fixes +- Don't claim functionality works when it doesn't +- Test every component thoroughly +- Keep existing configuration and thresholds working + +**Success Criteria:** +- Dashboard shows `● nvme0n1 T: 28Β°C W: 1%` format +- High CPU load triggers Warning status and email alert +- High memory usage triggers Warning status and email alert +- High disk temperature triggers Warning status and email alert +- Failed services trigger Warning status and email alert +- Maintenance mode suppresses notifications as expected ## Implementation Rules diff --git a/Cargo.lock b/Cargo.lock index 28a70a8..d5c314a 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -279,7 +279,7 @@ checksum = "a1d728cc89cf3aee9ff92b05e62b19ee65a02b5702cff7d5a377e32c6ae29d8d" [[package]] name = "cm-dashboard" -version = "0.1.138" +version = "0.1.140" dependencies = [ "anyhow", "chrono", @@ -301,7 +301,7 @@ dependencies = [ [[package]] name = "cm-dashboard-agent" -version = "0.1.138" +version = "0.1.140" dependencies = [ "anyhow", "async-trait", @@ -324,7 +324,7 @@ dependencies = [ [[package]] name = "cm-dashboard-shared" -version = "0.1.138" +version = "0.1.140" dependencies = [ "chrono", "serde", diff --git a/agent/Cargo.toml b/agent/Cargo.toml index af4cc54..3cb1c8e 100644 --- a/agent/Cargo.toml +++ b/agent/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "cm-dashboard-agent" -version = "0.1.139" +version = "0.1.140" edition = "2021" [dependencies] diff --git a/agent/src/agent.rs b/agent/src/agent.rs index 4cf6c88..24b236c 100644 --- a/agent/src/agent.rs +++ b/agent/src/agent.rs @@ -147,7 +147,7 @@ impl Agent { debug!("Starting structured data collection"); // Initialize empty AgentData - let mut agent_data = AgentData::new(self.hostname.clone(), "v0.1.139".to_string()); + let mut agent_data = AgentData::new(self.hostname.clone(), env!("CARGO_PKG_VERSION").to_string()); // Collect data from all collectors for collector in &self.collectors { diff --git a/agent/src/collectors/disk.rs b/agent/src/collectors/disk.rs index b073586..2394817 100644 --- a/agent/src/collectors/disk.rs +++ b/agent/src/collectors/disk.rs @@ -105,13 +105,13 @@ impl DiskCollector { Ok(()) } - /// Get mount devices mapping from /proc/mounts + /// Get block devices and their mount points using lsblk async fn get_mount_devices(&self) -> Result, CollectorError> { - let output = Command::new("findmnt") - .args(&["-rn", "-o", "TARGET,SOURCE"]) + let output = Command::new("lsblk") + .args(&["-rn", "-o", "NAME,MOUNTPOINT"]) .output() .map_err(|e| CollectorError::SystemRead { - path: "mount points".to_string(), + path: "block devices".to_string(), error: e.to_string(), })?; @@ -119,18 +119,21 @@ impl DiskCollector { for line in String::from_utf8_lossy(&output.stdout).lines() { let parts: Vec<&str> = line.split_whitespace().collect(); if parts.len() >= 2 { - let mount_point = parts[0]; - let device = parts[1]; + let device_name = parts[0]; + let mount_point = parts[1]; - // Skip special filesystems - if !device.starts_with('/') || device.contains("loop") { + // Skip swap partitions and unmounted devices + if mount_point == "[SWAP]" || mount_point.is_empty() { continue; } - mount_devices.insert(mount_point.to_string(), device.to_string()); + // Convert device name to full path + let device_path = format!("/dev/{}", device_name); + mount_devices.insert(mount_point.to_string(), device_path); } } + debug!("Found {} mounted block devices", mount_devices.len()); Ok(mount_devices) } @@ -319,8 +322,8 @@ impl DiskCollector { /// Get SMART data for a single drive async fn get_smart_data(&self, drive_name: &str) -> Result { - let output = Command::new("smartctl") - .args(&["-a", &format!("/dev/{}", drive_name)]) + let output = Command::new("sudo") + .args(&["smartctl", "-a", &format!("/dev/{}", drive_name)]) .output() .map_err(|e| CollectorError::SystemRead { path: format!("SMART data for {}", drive_name), @@ -328,6 +331,21 @@ impl DiskCollector { })?; let output_str = String::from_utf8_lossy(&output.stdout); + let error_str = String::from_utf8_lossy(&output.stderr); + + // Debug logging for SMART command results + debug!("SMART output for {}: status={}, stdout_len={}, stderr={}", + drive_name, output.status, output_str.len(), error_str); + + if !output.status.success() { + debug!("SMART command failed for {}: {}", drive_name, error_str); + // Return unknown data rather than failing completely + return Ok(SmartData { + health: "UNKNOWN".to_string(), + temperature_celsius: None, + wear_percent: None, + }); + } let mut health = "UNKNOWN".to_string(); let mut temperature = None; @@ -342,13 +360,22 @@ impl DiskCollector { } } - // Temperature parsing + // Temperature parsing for different drive types if line.contains("Temperature_Celsius") || line.contains("Airflow_Temperature_Cel") { + // Traditional SATA drives: attribute table format if let Some(temp_str) = line.split_whitespace().nth(9) { if let Ok(temp) = temp_str.parse::() { temperature = Some(temp); } } + } else if line.starts_with("Temperature:") { + // NVMe drives: simple "Temperature: 27 Celsius" format + let parts: Vec<&str> = line.split_whitespace().collect(); + if parts.len() >= 2 { + if let Ok(temp) = parts[1].parse::() { + temperature = Some(temp); + } + } } // Wear level parsing for SSDs @@ -359,6 +386,18 @@ impl DiskCollector { } } } + // NVMe wear parsing: "Percentage Used: 1%" + if line.contains("Percentage Used:") { + if let Some(percent_part) = line.split("Percentage Used:").nth(1) { + if let Some(percent_str) = percent_part.split_whitespace().next() { + if let Some(percent_clean) = percent_str.strip_suffix('%') { + if let Ok(wear) = percent_clean.parse::() { + wear_percent = Some(wear); + } + } + } + } + } } Ok(SmartData { diff --git a/dashboard/Cargo.toml b/dashboard/Cargo.toml index 4d85575..2b5cb41 100644 --- a/dashboard/Cargo.toml +++ b/dashboard/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "cm-dashboard" -version = "0.1.139" +version = "0.1.140" edition = "2021" [dependencies] diff --git a/shared/Cargo.toml b/shared/Cargo.toml index d032fe6..289fe75 100644 --- a/shared/Cargo.toml +++ b/shared/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "cm-dashboard-shared" -version = "0.1.139" +version = "0.1.140" edition = "2021" [dependencies]