cm-dashboard/CLAUDE.md

# CM Dashboard - Infrastructure Monitoring TUI

## Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.

## Current Features

### Core Functionality

- **Real-time Monitoring**: CPU, RAM, Storage, and Service status
- **Service Management**: Start/stop services with user-stopped tracking
- **Multi-host Support**: Monitor multiple servers from single dashboard
- **NixOS Integration**: System rebuild via SSH + tmux popup
- **Backup Monitoring**: Borgbackup status and scheduling

### User-Stopped Service Tracking

- Services stopped via dashboard are marked as "user-stopped"
- User-stopped services report Status::OK instead of Warning
- Prevents false alerts during intentional maintenance
- Persistent storage survives agent restarts
- Automatic flag clearing when services are restarted via dashboard

### Custom Service Logs

- Configure service-specific log file paths per host in dashboard config
- Press `L` on any service to view custom log files via `tail -f`
- Configuration format in dashboard config:

```toml
[service_logs]
hostname1 = [
  { service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
  { service_name = "app", log_file_path = "/var/log/myapp/app.log" }
]
hostname2 = [
  { service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
]
```

### Service Management

- **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services
- **Service Actions**:
  - `s` - Start service (sends UserStart command)
  - `S` - Stop service (sends UserStop command)
  - `J` - Show service logs (journalctl in tmux popup)
  - `L` - Show custom log files (tail -f custom paths in tmux popup)
  - `R` - Rebuild current host
- **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
- **Transitional Icons**: Blue arrows during operations

### Navigation

- **Tab**: Switch between hosts
- **↑↓ or j/k**: Select services
- **s**: Start selected service (UserStart)
- **S**: Stop selected service (UserStop)
- **J**: Show service logs (journalctl)
- **L**: Show custom log files
- **R**: Rebuild current host
- **B**: Run backup on current host
- **q**: Quit dashboard

## Core Architecture Principles

### Structured Data Architecture (✅ IMPLEMENTED v0.1.131)

Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access.

**Previous (String Metrics):**

- ❌ Agent sent individual metrics with string names like `disk_nvme0n1_temperature`
- ❌ Dashboard parsed metric names with underscore counting and string splitting
- ❌ Complex and error-prone metric filtering and extraction logic

**Current (Structured Data):**

```json
{
  "hostname": "cmbox",
  "agent_version": "v0.1.131",
  "timestamp": 1763926877,
  "system": {
    "cpu": {
      "load_1min": 3.5,
      "load_5min": 3.57,
      "load_15min": 3.58,
      "frequency_mhz": 1500,
      "temperature_celsius": 45.2
    },
    "memory": {
      "usage_percent": 25.0,
      "total_gb": 23.3,
      "used_gb": 5.9,
      "swap_total_gb": 10.7,
      "swap_used_gb": 0.99,
      "tmpfs": [
        {
          "mount": "/tmp",
          "usage_percent": 15.0,
          "used_gb": 0.3,
          "total_gb": 2.0
        }
      ]
    },
    "storage": {
      "drives": [
        {
          "name": "nvme0n1",
          "health": "PASSED",
          "temperature_celsius": 29.0,
          "wear_percent": 1.0,
          "filesystems": [
            {
              "mount": "/",
              "usage_percent": 24.0,
              "used_gb": 224.9,
              "total_gb": 928.2
            }
          ]
        }
      ],
      "pools": [
        {
          "name": "srv_media",
          "mount": "/srv/media",
          "type": "mergerfs",
          "health": "healthy",
          "usage_percent": 63.0,
          "used_gb": 2355.2,
          "total_gb": 3686.4,
          "data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }],
          "parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }]
        }
      ]
    }
  },
  "services": [
    { "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 }
  ],
  "backup": {
    "status": "completed",
    "last_run": 1763920000,
    "next_scheduled": 1764006400,
    "total_size_gb": 150.5,
    "repository_health": "ok"
  }
}
```

- ✅ Agent sends structured JSON over ZMQ (no legacy support)
- ✅ Type-safe data access: `data.system.storage.drives[0].temperature_celsius`
- ✅ Complete metric coverage: CPU, memory, storage, services, backup
- ✅ Backward compatibility via bridge conversion to existing UI widgets
- ✅ All string parsing bugs eliminated

### Cached Collector Architecture (🚧 PLANNED)

**Problem:** Blocking collectors prevent timely ZMQ transmission, causing false "host offline" alerts.

**Previous (Sequential Blocking):**
```
Every 1 second:
  └─ collect_all_data() [BLOCKS for 2-10+ seconds]
      ├─ CPU (fast: 10ms)
      ├─ Memory (fast: 20ms)
      ├─ Disk SMART (slow: 3s per drive × 4 drives = 12s)
      ├─ Service disk usage (slow: 2-8s per service)
      └─ Docker (medium: 500ms)
  └─ send_via_zmq()  [Only after ALL collection completes]

Result: If any collector takes >10s → "host offline" false alert
```

**New (Cached Independent Collectors):**
```
Shared Cache: Arc<RwLock<AgentData>>

Background Collectors (independent async tasks):
├─ Fast collectors (CPU, RAM, Network)
│   └─ Update cache every 1 second
├─ Medium collectors (Services, Docker)
│   └─ Update cache every 5 seconds
└─ Slow collectors (Disk usage, SMART data)
    └─ Update cache every 60 seconds

ZMQ Sender (separate async task):
Every 1 second:
  └─ Read current cache
  └─ Send via ZMQ [Always instant, never blocked]
```

**Benefits:**
- ✅ ZMQ sends every 1 second regardless of collector speed
- ✅ No false "host offline" alerts from slow collectors
- ✅ Different update rates for different metrics (CPU=1s, SMART=60s)
- ✅ System stays responsive even with slow operations
- ✅ Slow collectors can use longer timeouts without blocking

**Implementation:**
- Shared `AgentData` cache wrapped in `Arc<RwLock<>>`
- Each collector spawned as independent tokio task
- Collectors update their section of cache at their own rate
- ZMQ sender reads cache every 1s and transmits
- Stale data acceptable for slow-changing metrics (disk usage, SMART)

### Maintenance Mode

- Agent checks for `/tmp/cm-maintenance` file before sending notifications
- File presence suppresses all email notifications while continuing monitoring
- Dashboard continues to show real status, only notifications are blocked

Usage:

```bash
# Enable maintenance mode
touch /tmp/cm-maintenance

# Run maintenance tasks
systemctl stop service
# ... maintenance work ...
systemctl start service

# Disable maintenance mode
rm /tmp/cm-maintenance
```

## Development and Deployment Architecture

### Development Path

- **Location:** `~/projects/cm-dashboard`
- **Purpose:** Development workflow only - for committing new code
- **Access:** Only for developers to commit changes

### Deployment Path

- **Location:** `/var/lib/cm-dashboard/nixos-config`
- **Purpose:** Production deployment only - agent clones/pulls from git
- **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild

### Git Flow

```
Development: ~/projects/cm-dashboard → git commit → git push
Deployment:  git pull → /var/lib/cm-dashboard/nixos-config → rebuild
```

## Automated Binary Release System

CM Dashboard uses automated binary releases instead of source builds.

### Creating New Releases

```bash
cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X
```

This automatically:

- Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
- Creates GitHub-style release with tarball
- Uploads binaries via Gitea API

### NixOS Configuration Updates

Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:

```nix
version = "v0.1.X";
src = pkgs.fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-NEW_HASH_HERE";
};
```

### Get Release Hash

```bash
cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
```

### Building

**Testing & Building:**

- **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"`
- **Clean compilation**: Remove `target/` between major changes

## Enhanced Storage Pool Visualization

### Auto-Discovery Architecture

The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping.

### Discovery Process

**At Agent Startup:**

1. Parse `/proc/mounts` to identify all mounted filesystems
2. Detect MergerFS pools by analyzing `fuse.mergerfs` mount sources
3. Identify member disks and potential parity relationships via heuristics
4. Store discovered storage topology for continuous monitoring
5. Generate pool-aware metrics with hierarchical relationships

**Continuous Monitoring:**

- Use stored discovery data for efficient metric collection
- Monitor individual drives for SMART data, temperature, wear
- Calculate pool-level health based on member drive status
- Generate enhanced metrics for dashboard visualization

### Supported Storage Types

**Single Disks:**

- ext4, xfs, btrfs mounted directly
- Individual drive monitoring with SMART data
- Traditional single-disk display for root, boot, etc.

**MergerFS Pools:**

- Auto-detect from `/proc/mounts` fuse.mergerfs entries
- Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
- Heuristic parity disk detection (sequential device names, "parity" in path)
- Pool health calculation (healthy/degraded/critical)
- Hierarchical tree display with data/parity disk grouping

**Future Extensions Ready:**

- RAID arrays via `/proc/mdstat` parsing
- ZFS pools via `zpool status` integration
- LVM logical volumes via `lvs` discovery

### Configuration

```toml
[collectors.disk]
enabled = true
auto_discover = true  # Default: true
# Optional exclusions for special filesystems
exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"]
exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]
```

### Display Format

```
Network:
● eno1:
  ├─ ip: 192.168.30.105
  └─ tailscale0: 100.125.108.16
● eno2:
  └─ ip: 192.168.32.105
CPU:
● Load: 0.23 0.21 0.13
  └─ Freq: 1048 MHz
RAM:
● Usage: 25% 5.8GB/23.3GB
  ├─ ● /tmp: 2% 0.5GB/2GB
  └─ ● /var/tmp: 0% 0GB/1.0GB
Storage:
● 844B9A25 T: 25C W: 4%
  ├─ ● /: 55% 250.5GB/456.4GB
  └─ ● /boot: 26% 0.3GB/1.0GB
● mergerfs /srv/media:
  ├─ ● 63% 2355.2GB/3686.4GB
  ├─ ● Data_1: WDZQ8H8D T: 28°C
  ├─ ● Data_2: GGA04461 T: 28°C
  └─ ● Parity: WDZS8RY0 T: 29°C
Backup:
● WD-WCC7K1234567 T: 32°C W: 12%
  ├─ Last: 2h ago (12.3GB)
  ├─ Next: in 22h
  └─ ● Usage: 45% 678GB/1.5TB
```

## Important Communication Guidelines

Keep responses concise and focused. Avoid extensive implementation summaries unless requested.

## Commit Message Guidelines

**NEVER mention:**

- Claude or any AI assistant names
- Automation or AI-generated content
- Any reference to automated code generation

**ALWAYS:**

- Focus purely on technical changes and their purpose
- Use standard software development commit message format
- Describe what was changed and why, not how it was created
- Write from the perspective of a human developer

**Examples:**

- ❌ "Generated with Claude Code"
- ❌ "AI-assisted implementation"
- ❌ "Automated refactoring"
- ✅ "Implement maintenance mode for backup operations"
- ✅ "Restructure storage widget with improved layout"
- ✅ "Update CPU thresholds to production values"

## Implementation Rules

1. **Agent Status Authority**: Agent calculates status for each metric using thresholds
2. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
3. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status

**NEVER:**

- Copy/paste ANY code from legacy implementations
- Calculate status in dashboard widgets
- Hardcode metric names in widgets (use const arrays)
- Create files unless absolutely necessary for achieving goals
- Create documentation files unless explicitly requested

**ALWAYS:**

- Prefer editing existing files to creating new ones
- Follow existing code conventions and patterns
- Use existing libraries and utilities
- Follow security best practices