cm-dashboard/CLAUDE.md

# CM Dashboard - Infrastructure Monitoring TUI

## Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.

## Current Features

### Core Functionality

- **Real-time Monitoring**: CPU, RAM, Storage, and Service status
- **Service Management**: Start/stop services with user-stopped tracking
- **Multi-host Support**: Monitor multiple servers from single dashboard
- **NixOS Integration**: System rebuild via SSH + tmux popup
- **Backup Monitoring**: Borgbackup status and scheduling

### User-Stopped Service Tracking

- Services stopped via dashboard are marked as "user-stopped"
- User-stopped services report Status::OK instead of Warning
- Prevents false alerts during intentional maintenance
- Persistent storage survives agent restarts
- Automatic flag clearing when services are restarted via dashboard

### Custom Service Logs

- Configure service-specific log file paths per host in dashboard config
- Press `L` on any service to view custom log files via `tail -f`
- Configuration format in dashboard config:

```toml
[service_logs]
hostname1 = [
  { service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
  { service_name = "app", log_file_path = "/var/log/myapp/app.log" }
]
hostname2 = [
  { service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
]
```

### Service Management

- **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services
- **Service Actions**:
  - `s` - Start service (sends UserStart command)
  - `S` - Stop service (sends UserStop command)
  - `J` - Show service logs (journalctl in tmux popup)
  - `L` - Show custom log files (tail -f custom paths in tmux popup)
  - `R` - Rebuild current host
- **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
- **Transitional Icons**: Blue arrows during operations

### Navigation

- **Tab**: Switch between hosts
- **↑↓ or j/k**: Select services
- **s**: Start selected service (UserStart)
- **S**: Stop selected service (UserStop)
- **J**: Show service logs (journalctl)
- **L**: Show custom log files
- **R**: Rebuild current host
- **B**: Run backup on current host
- **q**: Quit dashboard

## Core Architecture Principles

### Structured Data Architecture (✅ IMPLEMENTED v0.1.131)

Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access.

**Previous (String Metrics):**

- ❌ Agent sent individual metrics with string names like `disk_nvme0n1_temperature`
- ❌ Dashboard parsed metric names with underscore counting and string splitting
- ❌ Complex and error-prone metric filtering and extraction logic

**Current (Structured Data):**

```json
{
  "hostname": "cmbox",
  "agent_version": "v0.1.131",
  "timestamp": 1763926877,
  "system": {
    "cpu": {
      "load_1min": 3.5,
      "load_5min": 3.57,
      "load_15min": 3.58,
      "frequency_mhz": 1500,
      "temperature_celsius": 45.2
    },
    "memory": {
      "usage_percent": 25.0,
      "total_gb": 23.3,
      "used_gb": 5.9,
      "swap_total_gb": 10.7,
      "swap_used_gb": 0.99,
      "tmpfs": [
        {
          "mount": "/tmp",
          "usage_percent": 15.0,
          "used_gb": 0.3,
          "total_gb": 2.0
        }
      ]
    },
    "storage": {
      "drives": [
        {
          "name": "nvme0n1",
          "health": "PASSED",
          "temperature_celsius": 29.0,
          "wear_percent": 1.0,
          "filesystems": [
            {
              "mount": "/",
              "usage_percent": 24.0,
              "used_gb": 224.9,
              "total_gb": 928.2
            }
          ]
        }
      ],
      "pools": [
        {
          "name": "srv_media",
          "mount": "/srv/media",
          "type": "mergerfs",
          "health": "healthy",
          "usage_percent": 63.0,
          "used_gb": 2355.2,
          "total_gb": 3686.4,
          "data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }],
          "parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }]
        }
      ]
    }
  },
  "services": [
    { "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 }
  ],
  "backup": {
    "status": "completed",
    "last_run": 1763920000,
    "next_scheduled": 1764006400,
    "total_size_gb": 150.5,
    "repository_health": "ok"
  }
}
```

- ✅ Agent sends structured JSON over ZMQ (no legacy support)
- ✅ Type-safe data access: `data.system.storage.drives[0].temperature_celsius`
- ✅ Complete metric coverage: CPU, memory, storage, services, backup
- ✅ Backward compatibility via bridge conversion to existing UI widgets
- ✅ All string parsing bugs eliminated

### Maintenance Mode

- Agent checks for `/tmp/cm-maintenance` file before sending notifications
- File presence suppresses all email notifications while continuing monitoring
- Dashboard continues to show real status, only notifications are blocked

Usage:

```bash
# Enable maintenance mode
touch /tmp/cm-maintenance

# Run maintenance tasks
systemctl stop service
# ... maintenance work ...
systemctl start service

# Disable maintenance mode
rm /tmp/cm-maintenance
```

## Development and Deployment Architecture

### Development Path

- **Location:** `~/projects/cm-dashboard`
- **Purpose:** Development workflow only - for committing new code
- **Access:** Only for developers to commit changes

### Deployment Path

- **Location:** `/var/lib/cm-dashboard/nixos-config`
- **Purpose:** Production deployment only - agent clones/pulls from git
- **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild

### Git Flow

```
Development: ~/projects/cm-dashboard → git commit → git push
Deployment:  git pull → /var/lib/cm-dashboard/nixos-config → rebuild
```

## Automated Binary Release System

CM Dashboard uses automated binary releases instead of source builds.

### Creating New Releases

```bash
cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X
```

This automatically:

- Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
- Creates GitHub-style release with tarball
- Uploads binaries via Gitea API

### NixOS Configuration Updates

Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:

```nix
version = "v0.1.X";
src = pkgs.fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-NEW_HASH_HERE";
};
```

### Get Release Hash

```bash
cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
```

### Building

**Testing & Building:**

- **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"`
- **Clean compilation**: Remove `target/` between major changes

## Enhanced Storage Pool Visualization

### Auto-Discovery Architecture

The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping.

### Discovery Process

**At Agent Startup:**

1. Parse `/proc/mounts` to identify all mounted filesystems
2. Detect MergerFS pools by analyzing `fuse.mergerfs` mount sources
3. Identify member disks and potential parity relationships via heuristics
4. Store discovered storage topology for continuous monitoring
5. Generate pool-aware metrics with hierarchical relationships

**Continuous Monitoring:**

- Use stored discovery data for efficient metric collection
- Monitor individual drives for SMART data, temperature, wear
- Calculate pool-level health based on member drive status
- Generate enhanced metrics for dashboard visualization

### Supported Storage Types

**Single Disks:**

- ext4, xfs, btrfs mounted directly
- Individual drive monitoring with SMART data
- Traditional single-disk display for root, boot, etc.

**MergerFS Pools:**

- Auto-detect from `/proc/mounts` fuse.mergerfs entries
- Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
- Heuristic parity disk detection (sequential device names, "parity" in path)
- Pool health calculation (healthy/degraded/critical)
- Hierarchical tree display with data/parity disk grouping

**Future Extensions Ready:**

- RAID arrays via `/proc/mdstat` parsing
- ZFS pools via `zpool status` integration
- LVM logical volumes via `lvs` discovery

### Configuration

```toml
[collectors.disk]
enabled = true
auto_discover = true  # Default: true
# Optional exclusions for special filesystems
exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"]
exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]
```

### Display Format

```
CPU:
● Load: 0.23 0.21 0.13
  └─ Freq: 1048 MHz

RAM:
● Usage: 25% 5.8GB/23.3GB
  ├─ ● /tmp: 2% 0.5GB/2GB
  └─ ● /var/tmp: 0% 0GB/1.0GB

Storage:
● mergerfs (2+1):
  ├─ Total: ● 63% 2355.2GB/3686.4GB
  ├─ Data Disks:
  │  ├─ ● sdb T: 24°C W: 5%
  │  └─ ● sdd T: 27°C W: 5%
  ├─ Parity: ● sdc T: 24°C W: 5%
  └─ Mount: /srv/media

● nvme0n1 T: 25C W: 4%
  ├─ ● /: 55% 250.5GB/456.4GB
  └─ ● /boot: 26% 0.3GB/1.0GB
```

## Important Communication Guidelines

Keep responses concise and focused. Avoid extensive implementation summaries unless requested.

## Commit Message Guidelines

**NEVER mention:**

- Claude or any AI assistant names
- Automation or AI-generated content
- Any reference to automated code generation

**ALWAYS:**

- Focus purely on technical changes and their purpose
- Use standard software development commit message format
- Describe what was changed and why, not how it was created
- Write from the perspective of a human developer

**Examples:**

- ❌ "Generated with Claude Code"
- ❌ "AI-assisted implementation"
- ❌ "Automated refactoring"
- ✅ "Implement maintenance mode for backup operations"
- ✅ "Restructure storage widget with improved layout"
- ✅ "Update CPU thresholds to production values"

## Completed Architecture Migration (v0.1.131)

## Complete Fix Plan (v0.1.140)

**🎯 Goal: Fix ALL Issues - Display AND Core Functionality**

### Current Broken State (v0.1.139)

**❌ What's Broken:**
```
✅ Data Collection: Agent collects structured data correctly
❌ Storage Display: Shows wrong mount points, missing temperature/wear
❌ Status Evaluation: Everything shows "OK" regardless of actual values
❌ Notifications: Not working - can't send alerts when systems fail
❌ Thresholds: Not being evaluated (CPU load, memory usage, disk temperature)
```

**Root Cause:**
During atomic migration, I removed core monitoring functionality and only fixed data collection, making the dashboard useless as a monitoring tool.

### Complete Fix Plan - Do Everything Right

#### Phase 1: Fix Storage Display (CURRENT)
- ✅ Use `lsblk` instead of `findmnt` (eliminates `/nix/store` bind mount issue)
- ✅ Add `sudo smartctl` for permissions
- ✅ Fix NVMe SMART parsing (`Temperature:` and `Percentage Used:`)
- 🔄 Test that dashboard shows: `● nvme0n1 T: 28°C W: 1%` correctly

#### Phase 2: Restore Status Evaluation System
- **CPU Status**: Evaluate load averages against thresholds → Status::Warning/Critical
- **Memory Status**: Evaluate usage_percent against thresholds → Status::Warning/Critical
- **Storage Status**: Evaluate temperature & usage against thresholds → Status::Warning/Critical
- **Service Status**: Evaluate service states → Status::Warning if inactive
- **Overall Host Status**: Aggregate component statuses → host-level status

#### Phase 3: Restore Notification System
- **Status Change Detection**: Track when component status changes from OK→Warning/Critical
- **Email Notifications**: Send alerts when status degrades
- **Notification Rate Limiting**: Prevent spam (existing logic)
- **Maintenance Mode**: Honor `/tmp/cm-maintenance` to suppress alerts
- **Batched Notifications**: Group multiple alerts into single email

#### Phase 4: Integration & Testing
- **AgentData Status Fields**: Add status fields to structured data
- **Dashboard Status Display**: Show colored indicators based on actual status
- **End-to-End Testing**: Verify alerts fire when thresholds exceeded
- **Verify All Thresholds**: CPU load, memory usage, disk temperature, service states

### Target Architecture (CORRECT)

**Complete Flow:**
```
Collectors → AgentData → StatusEvaluator → Notifications
                      ↘                 ↗
                      ZMQ → Dashboard → Status Display
```

**Key Components:**
1. **Collectors**: Populate AgentData with raw metrics
2. **StatusEvaluator**: Apply thresholds to AgentData → Status enum values
3. **Notifications**: Send emails on status changes (OK→Warning/Critical)
4. **Dashboard**: Display data with correct status colors/indicators

### Implementation Rules

**MUST COMPLETE ALL:**
- Fix storage display to show correct mount points and temperature
- Restore working status evaluation (thresholds → Status enum)
- Restore working notifications (email alerts on status changes)
- Test that monitoring actually works (alerts fire when appropriate)

**NO SHORTCUTS:**
- Don't commit partial fixes
- Don't claim functionality works when it doesn't
- Test every component thoroughly
- Keep existing configuration and thresholds working

**Success Criteria:**
- Dashboard shows `● nvme0n1 T: 28°C W: 1%` format
- High CPU load triggers Warning status and email alert
- High memory usage triggers Warning status and email alert
- High disk temperature triggers Warning status and email alert
- Failed services trigger Warning status and email alert
- Maintenance mode suppresses notifications as expected

## Implementation Rules

1. **Agent Status Authority**: Agent calculates status for each metric using thresholds
2. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
3. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status

**NEVER:**

- Copy/paste ANY code from legacy implementations
- Calculate status in dashboard widgets
- Hardcode metric names in widgets (use const arrays)
- Create files unless absolutely necessary for achieving goals
- Create documentation files unless explicitly requested

**ALWAYS:**

- Prefer editing existing files to creating new ones
- Follow existing code conventions and patterns
- Use existing libraries and utilities
- Follow security best practices