cm-dashboard/CLAUDE.md

# CM Dashboard - Infrastructure Monitoring TUI

## Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.

## Current Features

### Core Functionality
- **Real-time Monitoring**: CPU, RAM, Storage, and Service status
- **Service Management**: Start/stop services with user-stopped tracking
- **Multi-host Support**: Monitor multiple servers from single dashboard
- **NixOS Integration**: System rebuild via SSH + tmux popup
- **Backup Monitoring**: Borgbackup status and scheduling

### User-Stopped Service Tracking
- Services stopped via dashboard are marked as "user-stopped"
- User-stopped services report Status::OK instead of Warning
- Prevents false alerts during intentional maintenance
- Persistent storage survives agent restarts
- Automatic flag clearing when services are restarted via dashboard

### Custom Service Logs
- Configure service-specific log file paths per host in dashboard config
- Press `L` on any service to view custom log files via `tail -f`
- Configuration format in dashboard config:
```toml
[service_logs]
hostname1 = [
  { service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
  { service_name = "app", log_file_path = "/var/log/myapp/app.log" }
]
hostname2 = [
  { service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
]
```

### Service Management
- **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services
- **Service Actions**:
  - `s` - Start service (sends UserStart command)
  - `S` - Stop service (sends UserStop command)
  - `J` - Show service logs (journalctl in tmux popup)
  - `L` - Show custom log files (tail -f custom paths in tmux popup)
  - `R` - Rebuild current host
- **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
- **Transitional Icons**: Blue arrows during operations

### Navigation
- **Tab**: Switch between hosts
- **↑↓ or j/k**: Select services
- **s**: Start selected service (UserStart)
- **S**: Stop selected service (UserStop)
- **J**: Show service logs (journalctl)
- **L**: Show custom log files
- **R**: Rebuild current host
- **B**: Run backup on current host
- **q**: Quit dashboard

## Core Architecture Principles

### Structured Data Architecture (✅ IMPLEMENTED v0.1.131)
Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access.

**Previous (String Metrics):**
- ❌ Agent sent individual metrics with string names like `disk_nvme0n1_temperature`
- ❌ Dashboard parsed metric names with underscore counting and string splitting
- ❌ Complex and error-prone metric filtering and extraction logic

**Current (Structured Data):**
```json
{
  "hostname": "cmbox",
  "agent_version": "v0.1.131",
  "timestamp": 1763926877,
  "system": {
    "cpu": {
      "load_1min": 3.50,
      "load_5min": 3.57,
      "load_15min": 3.58,
      "frequency_mhz": 1500,
      "temperature_celsius": 45.2
    },
    "memory": {
      "usage_percent": 25.0,
      "total_gb": 23.3,
      "used_gb": 5.9,
      "swap_total_gb": 10.7,
      "swap_used_gb": 0.99,
      "tmpfs": [
        {"mount": "/tmp", "usage_percent": 15.0, "used_gb": 0.3, "total_gb": 2.0}
      ]
    },
    "storage": {
      "drives": [
        {
          "name": "nvme0n1",
          "health": "PASSED",
          "temperature_celsius": 29.0,
          "wear_percent": 1.0,
          "filesystems": [
            {"mount": "/", "usage_percent": 24.0, "used_gb": 224.9, "total_gb": 928.2}
          ]
        }
      ],
      "pools": [
        {
          "name": "srv_media",
          "mount": "/srv/media",
          "type": "mergerfs",
          "health": "healthy",
          "usage_percent": 63.0,
          "used_gb": 2355.2,
          "total_gb": 3686.4,
          "data_drives": [
            {"name": "sdb", "temperature_celsius": 24.0}
          ],
          "parity_drives": [
            {"name": "sdc", "temperature_celsius": 24.0}
          ]
        }
      ]
    }
  },
  "services": [
    {"name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0}
  ],
  "backup": {
    "status": "completed",
    "last_run": 1763920000,
    "next_scheduled": 1764006400,
    "total_size_gb": 150.5,
    "repository_health": "ok"
  }
}
```
- ✅ Agent sends structured JSON over ZMQ (no legacy support)
- ✅ Type-safe data access: `data.system.storage.drives[0].temperature_celsius`
- ✅ Complete metric coverage: CPU, memory, storage, services, backup
- ✅ Backward compatibility via bridge conversion to existing UI widgets
- ✅ All string parsing bugs eliminated


### Maintenance Mode
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
- File presence suppresses all email notifications while continuing monitoring
- Dashboard continues to show real status, only notifications are blocked

Usage:
```bash
# Enable maintenance mode
touch /tmp/cm-maintenance

# Run maintenance tasks
systemctl stop service
# ... maintenance work ...
systemctl start service

# Disable maintenance mode
rm /tmp/cm-maintenance
```

## Development and Deployment Architecture

### Development Path
- **Location:** `~/projects/cm-dashboard`
- **Purpose:** Development workflow only - for committing new code
- **Access:** Only for developers to commit changes

### Deployment Path
- **Location:** `/var/lib/cm-dashboard/nixos-config`
- **Purpose:** Production deployment only - agent clones/pulls from git
- **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild

### Git Flow
```
Development: ~/projects/cm-dashboard → git commit → git push
Deployment:  git pull → /var/lib/cm-dashboard/nixos-config → rebuild
```

## Automated Binary Release System

CM Dashboard uses automated binary releases instead of source builds.

### Creating New Releases
```bash
cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X
```

This automatically:
- Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
- Creates GitHub-style release with tarball
- Uploads binaries via Gitea API

### NixOS Configuration Updates
Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:

```nix
version = "v0.1.X";
src = pkgs.fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-NEW_HASH_HERE";
};
```

### Get Release Hash
```bash
cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
```

### Building

**Testing & Building:**
- **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"`
- **Clean compilation**: Remove `target/` between major changes

## Enhanced Storage Pool Visualization

### Auto-Discovery Architecture

The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping.

### Discovery Process

**At Agent Startup:**
1. Parse `/proc/mounts` to identify all mounted filesystems
2. Detect MergerFS pools by analyzing `fuse.mergerfs` mount sources
3. Identify member disks and potential parity relationships via heuristics
4. Store discovered storage topology for continuous monitoring
5. Generate pool-aware metrics with hierarchical relationships

**Continuous Monitoring:**
- Use stored discovery data for efficient metric collection
- Monitor individual drives for SMART data, temperature, wear
- Calculate pool-level health based on member drive status
- Generate enhanced metrics for dashboard visualization

### Supported Storage Types

**Single Disks:**
- ext4, xfs, btrfs mounted directly
- Individual drive monitoring with SMART data
- Traditional single-disk display for root, boot, etc.

**MergerFS Pools:**
- Auto-detect from `/proc/mounts` fuse.mergerfs entries
- Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
- Heuristic parity disk detection (sequential device names, "parity" in path)
- Pool health calculation (healthy/degraded/critical)
- Hierarchical tree display with data/parity disk grouping

**Future Extensions Ready:**
- RAID arrays via `/proc/mdstat` parsing
- ZFS pools via `zpool status` integration
- LVM logical volumes via `lvs` discovery

### Configuration

```toml
[collectors.disk]
enabled = true
auto_discover = true  # Default: true
# Optional exclusions for special filesystems
exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"]
exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]
```

### Display Format

```
Storage:
● /srv/media (mergerfs (2+1)):
  ├─ Pool Status: ● Healthy (3 drives)
  ├─ Total: ● 63% 2355.2GB/3686.4GB
  ├─ Data Disks:
  │  ├─ ● sdb T: 24°C
  │  └─ ● sdd T: 27°C
  └─ Parity: ● sdc T: 24°C
● /:
  ├─ ● nvme0n1 W: 13%
  └─ ● 7% 14.5GB/218.5GB
```

### Implementation Benefits

- **Zero Configuration**: No manual pool definitions required
- **Always Accurate**: Reflects actual system state automatically
- **Scales Automatically**: Handles any number of pools without config changes
- **Backwards Compatible**: Single disks continue working unchanged
- **Future Ready**: Easy extension for additional storage technologies

### Current Status (v0.1.100)

**✅ Completed:**
- Auto-discovery system implemented and deployed
- `/proc/mounts` parsing with smart heuristics for parity detection
- Storage topology stored at agent startup for efficient monitoring
- Universal zero-configuration for all hosts (cmbox, steambox, simonbox, srv01, srv02, srv03)
- Enhanced pool health calculation (healthy/degraded/critical)
- Hierarchical tree visualization with data/parity disk separation

**🔄 In Progress - Complete Disk Collector Rewrite:**

The current disk collector has grown complex with mixed legacy/auto-discovery approaches. Planning complete rewrite with clean, simple workflow supporting both physical drives and mergerfs pools.

**New Clean Architecture:**

**Discovery Workflow:**
1. **`lsblk`** to detect all mount points and backing devices
2. **`df`** to get filesystem usage for each mount point
3. **Group by physical drive** (nvme0n1, sda, etc.)
4. **Parse `/proc/mounts`** for mergerfs pools
5. **Generate unified metrics** for both storage types

**Physical Drive Display:**
```
● nvme0n1:
  ├─ ● Drive: T: 35°C W: 1%
  ├─ ● Total: 23% 218.0GB/928.2GB
  ├─ ● /boot: 11% 0.1GB/1.0GB
  └─ ● /: 23% 214.9GB/928.2GB
```

**MergerFS Pool Display:**
```
● /srv/media (mergerfs):
  ├─ ● Pool: 63% 2355.2GB/3686.4GB
  ├─ Data Disks:
  │  ├─ ● sdb T: 24°C
  │  └─ ● sdd T: 27°C
  └─ ● sdc T: 24°C (parity)
```

**Implementation Benefits:**
- **Pure auto-discovery**: No configuration needed
- **Clean code paths**: Single workflow for all storage types
- **Consistent display**: Status icons on every line, no redundant text
- **Simple pipeline**: lsblk → df → group → metrics
- **Support for both**: Physical drives and mergerfs pools

## Important Communication Guidelines

Keep responses concise and focused. Avoid extensive implementation summaries unless requested.

## Commit Message Guidelines

**NEVER mention:**
- Claude or any AI assistant names
- Automation or AI-generated content
- Any reference to automated code generation

**ALWAYS:**
- Focus purely on technical changes and their purpose
- Use standard software development commit message format
- Describe what was changed and why, not how it was created
- Write from the perspective of a human developer

**Examples:**
- ❌ "Generated with Claude Code"
- ❌ "AI-assisted implementation"
- ❌ "Automated refactoring"
- ✅ "Implement maintenance mode for backup operations"
- ✅ "Restructure storage widget with improved layout"
- ✅ "Update CPU thresholds to production values"

## Completed Architecture Migration (v0.1.131)

### ✅ Phase 1: Structured Data Types (Shared Crate) - COMPLETED
- ✅ Created AgentData struct matching JSON structure
- ✅ Added complete type hierarchy: CPU, memory, storage, services, backup
- ✅ Implemented serde serialization/deserialization
- ✅ Updated ZMQ protocol for structured data transmission

### ✅ Phase 2: Agent Refactor - COMPLETED
- ✅ Agent converts all metrics to structured AgentData
- ✅ Comprehensive metric parsing: storage (drives, temp, wear), services, backup
- ✅ Structured JSON transmission over ZMQ (no legacy support)
- ✅ Type-safe data flow throughout agent pipeline

### ✅ Phase 3: Dashboard Refactor - COMPLETED
- ✅ Dashboard receives structured data and bridges to existing UI
- ✅ Bridge conversion maintains compatibility with current widgets
- ✅ All metric types converted: storage, services, backup, CPU, memory
- ✅ Foundation ready for direct structured data widget migration

### 🚀 Next Phase: Direct Widget Migration
- Replace metric bridge with direct structured data access in widgets
- Eliminate temporary conversion layer
- Full end-to-end type safety from agent to UI

## Key Achievements (v0.1.131)

**✅ NVMe Temperature Issue SOLVED**
- Temperature data now flows as typed field: `agent_data.system.storage.drives[0].temperature_celsius: f32`
- Eliminates string parsing bugs: no more `"disk_nvme0n1_temperature"` extraction failures
- Type-safe access prevents all similar parsing issues across the system

**✅ Complete Structured Data Implementation**
- Agent: Collects metrics → structured JSON → ZMQ transmission
- Dashboard: Receives JSON → bridge conversion → existing UI widgets
- Full metric coverage: CPU, memory, storage (drives, pools), services, backup
- Zero legacy support - clean architecture with no compatibility cruft

**✅ Foundation for Future Enhancements**
- Type-safe data structures enable easy feature additions
- Self-documenting JSON schema shows all available metrics
- Direct field access eliminates entire class of parsing bugs
- Ready for next phase: direct widget migration for ultimate performance

## Implementation Rules

1. **Agent Status Authority**: Agent calculates status for each metric using thresholds
2. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
3. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status

**NEVER:**
- Copy/paste ANY code from legacy implementations
- Calculate status in dashboard widgets
- Hardcode metric names in widgets (use const arrays)
- Create files unless absolutely necessary for achieving goals
- Create documentation files unless explicitly requested

**ALWAYS:**
- Prefer editing existing files to creating new ones
- Follow existing code conventions and patterns
- Use existing libraries and utilities
- Follow security best practices