Complete atomic migration to structured data architecture

Implements clean structured data collection eliminating all string metric parsing bugs. Collectors now populate AgentData directly with type-safe field access. Key improvements: - Mount points preserved correctly (/ and /boot instead of root/boot) - Tmpfs discovery added to memory collector - Temperature data flows as typed f32 fields - Zero string parsing overhead - Complete removal of MetricCollectionManager bridge - Direct ZMQ transmission of structured JSON All functionality maintained: service tracking, notifications, status evaluation, and multi-host monitoring.
2025-11-24 18:53:31 +01:00
parent 11d1c2dc94
commit 2b2cb2da3e
17 changed files with 1952 additions and 3205 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -7,6 +7,7 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
 ## Current Features

 ### Core Functionality
+
 - **Real-time Monitoring**: CPU, RAM, Storage, and Service status
 - **Service Management**: Start/stop services with user-stopped tracking
 - **Multi-host Support**: Monitor multiple servers from single dashboard
@@ -14,6 +15,7 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
 - **Backup Monitoring**: Borgbackup status and scheduling

 ### User-Stopped Service Tracking
+
 - Services stopped via dashboard are marked as "user-stopped"
 - User-stopped services report Status::OK instead of Warning
 - Prevents false alerts during intentional maintenance
@@ -21,9 +23,11 @@ A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure.
 - Automatic flag clearing when services are restarted via dashboard

 ### Custom Service Logs
+
 - Configure service-specific log file paths per host in dashboard config
 - Press `L` on any service to view custom log files via `tail -f`
 - Configuration format in dashboard config:
+
 ```toml
 [service_logs]
 hostname1 = [
@@ -36,8 +40,9 @@ hostname2 = [
 ```

 ### Service Management
+
 - **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services
- **Service Actions**: 
+- **Service Actions**:
  - `s` - Start service (sends UserStart command)
  - `S` - Stop service (sends UserStop command)
  - `J` - Show service logs (journalctl in tmux popup)
@@ -47,6 +52,7 @@ hostname2 = [
 - **Transitional Icons**: Blue arrows during operations

 ### Navigation
+
 - **Tab**: Switch between hosts
 - **↑↓ or j/k**: Select services
 - **s**: Start selected service (UserStart)
@@ -60,14 +66,17 @@ hostname2 = [
 ## Core Architecture Principles

 ### Structured Data Architecture (✅ IMPLEMENTED v0.1.131)
+
 Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access.

 **Previous (String Metrics):**
+
 - ❌ Agent sent individual metrics with string names like `disk_nvme0n1_temperature`
 - ❌ Dashboard parsed metric names with underscore counting and string splitting
 - ❌ Complex and error-prone metric filtering and extraction logic

 **Current (Structured Data):**
+
 ```json
 {
  "hostname": "cmbox",
@@ -75,7 +84,7 @@ Complete migration from string-based metrics to structured JSON data. Eliminates
  "timestamp": 1763926877,
  "system": {
    "cpu": {
-      "load_1min": 3.50,
+      "load_1min": 3.5,
      "load_5min": 3.57,
      "load_15min": 3.58,
      "frequency_mhz": 1500,
@@ -88,7 +97,12 @@ Complete migration from string-based metrics to structured JSON data. Eliminates
      "swap_total_gb": 10.7,
      "swap_used_gb": 0.99,
      "tmpfs": [
-        {"mount": "/tmp", "usage_percent": 15.0, "used_gb": 0.3, "total_gb": 2.0}
+        {
+          "mount": "/tmp",
+          "usage_percent": 15.0,
+          "used_gb": 0.3,
+          "total_gb": 2.0
+        }
      ]
    },
    "storage": {
@@ -99,7 +113,12 @@ Complete migration from string-based metrics to structured JSON data. Eliminates
          "temperature_celsius": 29.0,
          "wear_percent": 1.0,
          "filesystems": [
-            {"mount": "/", "usage_percent": 24.0, "used_gb": 224.9, "total_gb": 928.2}
+            {
+              "mount": "/",
+              "usage_percent": 24.0,
+              "used_gb": 224.9,
+              "total_gb": 928.2
+            }
          ]
        }
      ],
@@ -112,18 +131,14 @@ Complete migration from string-based metrics to structured JSON data. Eliminates
          "usage_percent": 63.0,
          "used_gb": 2355.2,
          "total_gb": 3686.4,
-          "data_drives": [
-            {"name": "sdb", "temperature_celsius": 24.0}
-          ],
-          "parity_drives": [
-            {"name": "sdc", "temperature_celsius": 24.0}
-          ]
+          "data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }],
+          "parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }]
        }
      ]
    }
  },
  "services": [
-    {"name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0}
+    { "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 }
  ],
  "backup": {
    "status": "completed",
@@ -134,19 +149,21 @@ Complete migration from string-based metrics to structured JSON data. Eliminates
  }
 }
 ```
+
 - ✅ Agent sends structured JSON over ZMQ (no legacy support)
 - ✅ Type-safe data access: `data.system.storage.drives[0].temperature_celsius`
 - ✅ Complete metric coverage: CPU, memory, storage, services, backup
 - ✅ Backward compatibility via bridge conversion to existing UI widgets
 - ✅ All string parsing bugs eliminated

-
 ### Maintenance Mode
+
 - Agent checks for `/tmp/cm-maintenance` file before sending notifications
 - File presence suppresses all email notifications while continuing monitoring
 - Dashboard continues to show real status, only notifications are blocked

 Usage:
+
 ```bash
 # Enable maintenance mode
 touch /tmp/cm-maintenance
@@ -163,16 +180,19 @@ rm /tmp/cm-maintenance
 ## Development and Deployment Architecture

 ### Development Path
- **Location:** `~/projects/cm-dashboard` 
+
+- **Location:** `~/projects/cm-dashboard`
 - **Purpose:** Development workflow only - for committing new code
 - **Access:** Only for developers to commit changes

-### Deployment Path  
+### Deployment Path
+
 - **Location:** `/var/lib/cm-dashboard/nixos-config`
 - **Purpose:** Production deployment only - agent clones/pulls from git
 - **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild

 ### Git Flow
+
 ```
 Development: ~/projects/cm-dashboard → git commit → git push
 Deployment:  git pull → /var/lib/cm-dashboard/nixos-config → rebuild
@@ -183,6 +203,7 @@ Deployment:  git pull → /var/lib/cm-dashboard/nixos-config → rebuild
 CM Dashboard uses automated binary releases instead of source builds.

 ### Creating New Releases
+
 ```bash
 cd ~/projects/cm-dashboard
 git tag v0.1.X
@@ -190,11 +211,13 @@ git push origin v0.1.X
 ```

 This automatically:
+
 - Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
 - Creates GitHub-style release with tarball
 - Uploads binaries via Gitea API

 ### NixOS Configuration Updates
+
 Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:

 ```nix
@@ -206,6 +229,7 @@ src = pkgs.fetchurl {
 ```

 ### Get Release Hash
+
 ```bash
 cd ~/projects/nixosbox
 nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
@@ -217,6 +241,7 @@ nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
 ### Building

 **Testing & Building:**
+
 - **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"`
 - **Clean compilation**: Remove `target/` between major changes

@@ -229,6 +254,7 @@ The dashboard uses automatic storage discovery to eliminate manual configuration
 ### Discovery Process

 **At Agent Startup:**
+
 1. Parse `/proc/mounts` to identify all mounted filesystems
 2. Detect MergerFS pools by analyzing `fuse.mergerfs` mount sources
 3. Identify member disks and potential parity relationships via heuristics
@@ -236,6 +262,7 @@ The dashboard uses automatic storage discovery to eliminate manual configuration
 5. Generate pool-aware metrics with hierarchical relationships

 **Continuous Monitoring:**
+
 - Use stored discovery data for efficient metric collection
 - Monitor individual drives for SMART data, temperature, wear
 - Calculate pool-level health based on member drive status
@@ -244,11 +271,13 @@ The dashboard uses automatic storage discovery to eliminate manual configuration
 ### Supported Storage Types

 **Single Disks:**
+
 - ext4, xfs, btrfs mounted directly
 - Individual drive monitoring with SMART data
 - Traditional single-disk display for root, boot, etc.

 **MergerFS Pools:**
+
 - Auto-detect from `/proc/mounts` fuse.mergerfs entries
 - Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
 - Heuristic parity disk detection (sequential device names, "parity" in path)
@@ -256,6 +285,7 @@ The dashboard uses automatic storage discovery to eliminate manual configuration
 - Hierarchical tree display with data/parity disk grouping

 **Future Extensions Ready:**
+
 - RAID arrays via `/proc/mdstat` parsing
 - ZFS pools via `zpool status` integration
 - LVM logical volumes via `lvs` discovery
@@ -274,76 +304,29 @@ exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]
 ### Display Format

 ```
+CPU:
+● Load: 0.23 0.21 0.13
+  └─ Freq: 1048 MHz
+
+RAM:
+● Usage: 25% 5.8GB/23.3GB
+  ├─ ● /tmp: 2% 0.5GB/2GB
+  └─ ● /var/tmp: 0% 0GB/1.0GB
+
 Storage:
-● /srv/media (mergerfs (2+1)):
-  ├─ Pool Status: ● Healthy (3 drives)
+● mergerfs (2+1):
  ├─ Total: ● 63% 2355.2GB/3686.4GB
  ├─ Data Disks:
-  │  ├─ ● sdb T: 24°C
-  │  └─ ● sdd T: 27°C
-  └─ Parity: ● sdc T: 24°C
-● /:
-  ├─ ● nvme0n1 W: 13%
-  └─ ● 7% 14.5GB/218.5GB
+  │  ├─ ● sdb T: 24°C W: 5%
+  │  └─ ● sdd T: 27°C W: 5%
+  ├─ Parity: ● sdc T: 24°C W: 5%
+  └─ Mount: /srv/media
+
+● nvme0n1 T: 25C W: 4%
+  ├─ ● /: 55% 250.5GB/456.4GB
+  └─ ● /boot: 26% 0.3GB/1.0GB
 ```

-### Implementation Benefits
-
- **Zero Configuration**: No manual pool definitions required
- **Always Accurate**: Reflects actual system state automatically
- **Scales Automatically**: Handles any number of pools without config changes
- **Backwards Compatible**: Single disks continue working unchanged
- **Future Ready**: Easy extension for additional storage technologies
-
-### Current Status (v0.1.100)
-
-**✅ Completed:**
- Auto-discovery system implemented and deployed
- `/proc/mounts` parsing with smart heuristics for parity detection
- Storage topology stored at agent startup for efficient monitoring
- Universal zero-configuration for all hosts (cmbox, steambox, simonbox, srv01, srv02, srv03)
- Enhanced pool health calculation (healthy/degraded/critical)
- Hierarchical tree visualization with data/parity disk separation
-
-**🔄 In Progress - Complete Disk Collector Rewrite:**
-
-The current disk collector has grown complex with mixed legacy/auto-discovery approaches. Planning complete rewrite with clean, simple workflow supporting both physical drives and mergerfs pools.
-
-**New Clean Architecture:**
-
-**Discovery Workflow:**
-1. **`lsblk`** to detect all mount points and backing devices
-2. **`df`** to get filesystem usage for each mount point
-3. **Group by physical drive** (nvme0n1, sda, etc.)
-4. **Parse `/proc/mounts`** for mergerfs pools
-5. **Generate unified metrics** for both storage types
-
-**Physical Drive Display:**
-```
-● nvme0n1:
-  ├─ ● Drive: T: 35°C W: 1%
-  ├─ ● Total: 23% 218.0GB/928.2GB
-  ├─ ● /boot: 11% 0.1GB/1.0GB
-  └─ ● /: 23% 214.9GB/928.2GB
-```
-
-**MergerFS Pool Display:**
-```
-● /srv/media (mergerfs):
-  ├─ ● Pool: 63% 2355.2GB/3686.4GB
-  ├─ Data Disks:
-  │  ├─ ● sdb T: 24°C
-  │  └─ ● sdd T: 27°C  
-  └─ ● sdc T: 24°C (parity)
-```
-
-**Implementation Benefits:**
- **Pure auto-discovery**: No configuration needed
- **Clean code paths**: Single workflow for all storage types
- **Consistent display**: Status icons on every line, no redundant text
- **Simple pipeline**: lsblk → df → group → metrics
- **Support for both**: Physical drives and mergerfs pools
-
 ## Important Communication Guidelines

 Keep responses concise and focused. Avoid extensive implementation summaries unless requested.
@@ -351,17 +334,20 @@ Keep responses concise and focused. Avoid extensive implementation summaries unl
 ## Commit Message Guidelines

 **NEVER mention:**
+
 - Claude or any AI assistant names
 - Automation or AI-generated content
 - Any reference to automated code generation

 **ALWAYS:**
+
 - Focus purely on technical changes and their purpose
 - Use standard software development commit message format
 - Describe what was changed and why, not how it was created
 - Write from the perspective of a human developer

 **Examples:**
+
 - ❌ "Generated with Claude Code"
 - ❌ "AI-assisted implementation"
 - ❌ "Automated refactoring"
@@ -371,47 +357,53 @@ Keep responses concise and focused. Avoid extensive implementation summaries unl

 ## Completed Architecture Migration (v0.1.131)

-### ✅ Phase 1: Structured Data Types (Shared Crate) - COMPLETED
- ✅ Created AgentData struct matching JSON structure
- ✅ Added complete type hierarchy: CPU, memory, storage, services, backup
- ✅ Implemented serde serialization/deserialization
- ✅ Updated ZMQ protocol for structured data transmission
+## Agent Architecture Migration Plan (v0.1.139)

-### ✅ Phase 2: Agent Refactor - COMPLETED  
- ✅ Agent converts all metrics to structured AgentData
- ✅ Comprehensive metric parsing: storage (drives, temp, wear), services, backup
- ✅ Structured JSON transmission over ZMQ (no legacy support)
- ✅ Type-safe data flow throughout agent pipeline
+**🎯 Goal: Eliminate String Metrics Bridge, Direct Structured Data Collection**

-### ✅ Phase 3: Dashboard Refactor - COMPLETED
- ✅ Dashboard receives structured data and bridges to existing UI
- ✅ Bridge conversion maintains compatibility with current widgets
- ✅ All metric types converted: storage, services, backup, CPU, memory
- ✅ Foundation ready for direct structured data widget migration
+### Current Architecture (v0.1.138)

-### 🚀 Next Phase: Direct Widget Migration
- Replace metric bridge with direct structured data access in widgets
- Eliminate temporary conversion layer
- Full end-to-end type safety from agent to UI
+**Current Flow:**
+```
+Collectors → String Metrics → MetricManager.cache
+                           ↘
+                           process_metrics() → HostStatusManager → Notifications
+                           ↘  
+                           broadcast_all_metrics() → Bridge Conversion → AgentData → ZMQ
+```

-## Key Achievements (v0.1.131)
+**Issues:**
+- Bridge conversion loses mount point information (`/` becomes `root`, `/boot` becomes `boot`)
+- Tmpfs mounts not properly displayed in RAM section
+- Unnecessary string parsing complexity and potential bugs
+- String-to-JSON conversion introduces data transformation errors

-**✅ NVMe Temperature Issue SOLVED**
- Temperature data now flows as typed field: `agent_data.system.storage.drives[0].temperature_celsius: f32`
- Eliminates string parsing bugs: no more `"disk_nvme0n1_temperature"` extraction failures
- Type-safe access prevents all similar parsing issues across the system
+### Target Architecture

-**✅ Complete Structured Data Implementation**  
- Agent: Collects metrics → structured JSON → ZMQ transmission
- Dashboard: Receives JSON → bridge conversion → existing UI widgets  
- Full metric coverage: CPU, memory, storage (drives, pools), services, backup
- Zero legacy support - clean architecture with no compatibility cruft
+**Target Flow:**
+```
+Collectors → AgentData → HostStatusManager → Notifications
+                      ↘
+                      Direct ZMQ Transmission
+```

-**✅ Foundation for Future Enhancements**
- Type-safe data structures enable easy feature additions
- Self-documenting JSON schema shows all available metrics
- Direct field access eliminates entire class of parsing bugs
- Ready for next phase: direct widget migration for ultimate performance
+### Implementation Plan
+
+#### Atomic Migration (v0.1.139) - Single Complete Rewrite
+- **Complete removal** of string metrics system - no legacy support
+- **Collectors output structured data directly** - populate `AgentData` with correct mount points
+- **HostStatusManager operates on `AgentData`** - status evaluation on structured fields  
+- **Notifications process structured data** - preserve all notification logic
+- **Direct ZMQ transmission** - no bridge conversion code
+- **Service tracking preserved** - user-stopped flags, thresholds, all functionality intact
+- **Zero backward compatibility** - clean break from string metric architecture
+
+### Benefits
+- **Correct Display**: `/` and `/boot` mount points, proper tmpfs in RAM section
+- **Performance**: Eliminate string parsing overhead
+- **Maintainability**: Type-safe data flow, no string parsing bugs
+- **Functionality Preserved**: Status evaluation, notifications, service tracking intact
+- **Clean Architecture**: NO legacy fallback code, complete migration to structured data

 ## Implementation Rules

@@ -420,6 +412,7 @@ Keep responses concise and focused. Avoid extensive implementation summaries unl
 3. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status

 **NEVER:**
+
 - Copy/paste ANY code from legacy implementations
 - Calculate status in dashboard widgets
 - Hardcode metric names in widgets (use const arrays)
@@ -427,7 +420,8 @@ Keep responses concise and focused. Avoid extensive implementation summaries unl
 - Create documentation files unless explicitly requested

 **ALWAYS:**
+
 - Prefer editing existing files to creating new ones
 - Follow existing code conventions and patterns
 - Use existing libraries and utilities
- Follow security best practices
+- Follow security best practices