cm-dashboard/CLAUDE.md

# CM Dashboard - Infrastructure Monitoring TUI

## Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.

## Current Features

### Core Functionality

- **Real-time Monitoring**: CPU, RAM, Storage, and Service status
- **Service Management**: Start/stop services with user-stopped tracking
- **Multi-host Support**: Monitor multiple servers from single dashboard
- **NixOS Integration**: System rebuild via SSH + tmux popup
- **Backup Monitoring**: Borgbackup status and scheduling

### User-Stopped Service Tracking

- Services stopped via dashboard are marked as "user-stopped"
- User-stopped services report Status::OK instead of Warning
- Prevents false alerts during intentional maintenance
- Persistent storage survives agent restarts
- Automatic flag clearing when services are restarted via dashboard

### Custom Service Logs

- Configure service-specific log file paths per host in dashboard config
- Press `L` on any service to view custom log files via `tail -f`
- Configuration format in dashboard config:

```toml
[service_logs]
hostname1 = [
  { service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
  { service_name = "app", log_file_path = "/var/log/myapp/app.log" }
]
hostname2 = [
  { service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
]
```

### Service Management

- **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services
- **Service Actions**:
  - `s` - Start service (sends UserStart command)
  - `S` - Stop service (sends UserStop command)
  - `J` - Show service logs (journalctl in tmux popup)
  - `L` - Show custom log files (tail -f custom paths in tmux popup)
  - `R` - Rebuild current host
- **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
- **Transitional Icons**: Blue arrows during operations

### Navigation

- **Tab**: Switch between hosts
- **↑↓ or j/k**: Select services
- **s**: Start selected service (UserStart)
- **S**: Stop selected service (UserStop)
- **J**: Show service logs (journalctl)
- **L**: Show custom log files
- **R**: Rebuild current host
- **B**: Run backup on current host
- **q**: Quit dashboard

## Core Architecture Principles

### Structured Data Architecture (✅ IMPLEMENTED v0.1.131)

Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access.

**Previous (String Metrics):**

- ❌ Agent sent individual metrics with string names like `disk_nvme0n1_temperature`
- ❌ Dashboard parsed metric names with underscore counting and string splitting
- ❌ Complex and error-prone metric filtering and extraction logic

**Current (Structured Data):**

```json
{
  "hostname": "cmbox",
  "agent_version": "v0.1.131",
  "timestamp": 1763926877,
  "system": {
    "cpu": {
      "load_1min": 3.5,
      "load_5min": 3.57,
      "load_15min": 3.58,
      "frequency_mhz": 1500,
      "temperature_celsius": 45.2
    },
    "memory": {
      "usage_percent": 25.0,
      "total_gb": 23.3,
      "used_gb": 5.9,
      "swap_total_gb": 10.7,
      "swap_used_gb": 0.99,
      "tmpfs": [
        {
          "mount": "/tmp",
          "usage_percent": 15.0,
          "used_gb": 0.3,
          "total_gb": 2.0
        }
      ]
    },
    "storage": {
      "drives": [
        {
          "name": "nvme0n1",
          "health": "PASSED",
          "temperature_celsius": 29.0,
          "wear_percent": 1.0,
          "filesystems": [
            {
              "mount": "/",
              "usage_percent": 24.0,
              "used_gb": 224.9,
              "total_gb": 928.2
            }
          ]
        }
      ],
      "pools": [
        {
          "name": "srv_media",
          "mount": "/srv/media",
          "type": "mergerfs",
          "health": "healthy",
          "usage_percent": 63.0,
          "used_gb": 2355.2,
          "total_gb": 3686.4,
          "data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }],
          "parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }]
        }
      ]
    }
  },
  "services": [
    { "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 }
  ],
  "backup": {
    "status": "completed",
    "last_run": 1763920000,
    "next_scheduled": 1764006400,
    "total_size_gb": 150.5,
    "repository_health": "ok"
  }
}
```

- ✅ Agent sends structured JSON over ZMQ (no legacy support)
- ✅ Type-safe data access: `data.system.storage.drives[0].temperature_celsius`
- ✅ Complete metric coverage: CPU, memory, storage, services, backup
- ✅ Backward compatibility via bridge conversion to existing UI widgets
- ✅ All string parsing bugs eliminated

### Maintenance Mode

- Agent checks for `/tmp/cm-maintenance` file before sending notifications
- File presence suppresses all email notifications while continuing monitoring
- Dashboard continues to show real status, only notifications are blocked

Usage:

```bash
# Enable maintenance mode
touch /tmp/cm-maintenance

# Run maintenance tasks
systemctl stop service
# ... maintenance work ...
systemctl start service

# Disable maintenance mode
rm /tmp/cm-maintenance
```

## Development and Deployment Architecture

### Development Path

- **Location:** `~/projects/cm-dashboard`
- **Purpose:** Development workflow only - for committing new code
- **Access:** Only for developers to commit changes

### Deployment Path

- **Location:** `/var/lib/cm-dashboard/nixos-config`
- **Purpose:** Production deployment only - agent clones/pulls from git
- **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild

### Git Flow

```
Development: ~/projects/cm-dashboard → git commit → git push
Deployment:  git pull → /var/lib/cm-dashboard/nixos-config → rebuild
```

## Automated Binary Release System

CM Dashboard uses automated binary releases instead of source builds.

### Creating New Releases

```bash
cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X
```

This automatically:

- Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
- Creates GitHub-style release with tarball
- Uploads binaries via Gitea API

### NixOS Configuration Updates

Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:

```nix
version = "v0.1.X";
src = pkgs.fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-NEW_HASH_HERE";
};
```

### Get Release Hash

```bash
cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
```

### Building

**Testing & Building:**

- **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"`
- **Clean compilation**: Remove `target/` between major changes

## Enhanced Storage Pool Visualization

### Auto-Discovery Architecture

The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping.

### Discovery Process

**At Agent Startup:**

1. Parse `/proc/mounts` to identify all mounted filesystems
2. Detect MergerFS pools by analyzing `fuse.mergerfs` mount sources
3. Identify member disks and potential parity relationships via heuristics
4. Store discovered storage topology for continuous monitoring
5. Generate pool-aware metrics with hierarchical relationships

**Continuous Monitoring:**

- Use stored discovery data for efficient metric collection
- Monitor individual drives for SMART data, temperature, wear
- Calculate pool-level health based on member drive status
- Generate enhanced metrics for dashboard visualization

### Supported Storage Types

**Single Disks:**

- ext4, xfs, btrfs mounted directly
- Individual drive monitoring with SMART data
- Traditional single-disk display for root, boot, etc.

**MergerFS Pools:**

- Auto-detect from `/proc/mounts` fuse.mergerfs entries
- Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
- Heuristic parity disk detection (sequential device names, "parity" in path)
- Pool health calculation (healthy/degraded/critical)
- Hierarchical tree display with data/parity disk grouping

**Future Extensions Ready:**

- RAID arrays via `/proc/mdstat` parsing
- ZFS pools via `zpool status` integration
- LVM logical volumes via `lvs` discovery

### Configuration

```toml
[collectors.disk]
enabled = true
auto_discover = true  # Default: true
# Optional exclusions for special filesystems
exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"]
exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]
```

### Display Format

```
CPU:
● Load: 0.23 0.21 0.13
  └─ Freq: 1048 MHz

RAM:
● Usage: 25% 5.8GB/23.3GB
  ├─ ● /tmp: 2% 0.5GB/2GB
  └─ ● /var/tmp: 0% 0GB/1.0GB

Storage:
● mergerfs (2+1):
  ├─ Total: ● 63% 2355.2GB/3686.4GB
  ├─ Data Disks:
  │  ├─ ● sdb T: 24°C W: 5%
  │  └─ ● sdd T: 27°C W: 5%
  ├─ Parity: ● sdc T: 24°C W: 5%
  └─ Mount: /srv/media

● nvme0n1 T: 25C W: 4%
  ├─ ● /: 55% 250.5GB/456.4GB
  └─ ● /boot: 26% 0.3GB/1.0GB

Backup:
● WD-WCC7K1234567 T: 32°C W: 12%
  ├─ Last: 2h ago (12.3GB)
  ├─ Next: in 22h
  └─ ● Usage: 45% 678GB/1.5TB
```

## Important Communication Guidelines

Keep responses concise and focused. Avoid extensive implementation summaries unless requested.

## Commit Message Guidelines

**NEVER mention:**

- Claude or any AI assistant names
- Automation or AI-generated content
- Any reference to automated code generation

**ALWAYS:**

- Focus purely on technical changes and their purpose
- Use standard software development commit message format
- Describe what was changed and why, not how it was created
- Write from the perspective of a human developer

**Examples:**

- ❌ "Generated with Claude Code"
- ❌ "AI-assisted implementation"
- ❌ "Automated refactoring"
- ✅ "Implement maintenance mode for backup operations"
- ✅ "Restructure storage widget with improved layout"
- ✅ "Update CPU thresholds to production values"

## Completed Architecture Migration (v0.1.131)

## ✅ COMPLETE MONITORING SYSTEM RESTORATION (v0.1.141)

**🎉 SUCCESS: All Issues Fixed - Complete Functional Monitoring System**

### ✅ Completed Implementation (v0.1.141)

**All Major Issues Resolved:**
```
✅ Data Collection: Agent collects structured data correctly
✅ Storage Display: Perfect format with correct mount points and temperature/wear
✅ Status Evaluation: All metrics properly evaluated against thresholds
✅ Notifications: Working email alerts on status changes
✅ Thresholds: All collectors using configured thresholds for status calculation
✅ Build Information: NixOS version displayed correctly
✅ Mount Point Consistency: Stable, sorted display order
```

### ✅ All Phases Completed Successfully

#### ✅ Phase 1: Storage Display - COMPLETED
- ✅ Use `lsblk` instead of `findmnt` (eliminated `/nix/store` bind mount issue)
- ✅ Add `sudo smartctl` for permissions (SMART data collection working)
- ✅ Fix NVMe SMART parsing (`Temperature:` and `Percentage Used:` fields)
- ✅ Consistent filesystem/tmpfs sorting (no more random order swapping)
- ✅ **VERIFIED**: Dashboard shows `● nvme0n1 T: 28°C W: 1%` correctly

#### ✅ Phase 2: Status Evaluation System - COMPLETED
- ✅ **CPU Status**: Load averages and temperature evaluated against `HysteresisThresholds`
- ✅ **Memory Status**: Usage percentage evaluated against thresholds
- ✅ **Storage Status**: Drive temperature, health, and filesystem usage evaluated
- ✅ **Service Status**: Service states properly tracked and evaluated
- ✅ **Status Fields**: All AgentData structures include status information
- ✅ **Threshold Integration**: All collectors use their configured thresholds

#### ✅ Phase 3: Notification System - COMPLETED
- ✅ **Status Change Detection**: Agent tracks status between collection cycles
- ✅ **Email Notifications**: Alerts sent on degradation (OK→Warning/Critical, Warning→Critical)
- ✅ **Notification Content**: Detailed alerts with metric values and timestamps
- ✅ **NotificationManager Integration**: Fully restored and operational
- ✅ **Maintenance Mode**: `/tmp/cm-maintenance` file support maintained

#### ✅ Phase 4: Integration & Testing - COMPLETED
- ✅ **AgentData Status Fields**: All structured data includes status evaluation
- ✅ **Status Processing**: Agent applies thresholds at collection time
- ✅ **End-to-End Flow**: Collection → Evaluation → Notification → Display
- ✅ **Dynamic Versioning**: Agent version from `CARGO_PKG_VERSION`
- ✅ **Build Information**: NixOS generation display restored

### ✅ Final Architecture - WORKING

**Complete Operational Flow:**
```
Collectors → AgentData (with Status) → NotificationManager → Email Alerts
                                    ↘                        ↗
                                     ZMQ → Dashboard → Perfect Display
```

**Operational Components:**
1. ✅ **Collectors**: Populate AgentData with metrics AND status evaluation
2. ✅ **Status Evaluation**: `HysteresisThresholds.evaluate()` applied per collector
3. ✅ **Notifications**: Email alerts on status change detection
4. ✅ **Display**: Correct mount points, temperature, wear, and build information

### ✅ Success Criteria - ALL MET

**Display Requirements:**
- ✅ Dashboard shows `● nvme0n1 T: 28°C W: 1%` format perfectly
- ✅ Mount points show `/` and `/boot` (not `root`/`boot`)
- ✅ Build information shows actual NixOS version (not "unknown")
- ✅ Consistent sorting eliminates random order changes

**Monitoring Requirements:**
- ✅ High CPU load triggers Warning/Critical status and email alert
- ✅ High memory usage triggers Warning/Critical status and email alert
- ✅ High disk temperature triggers Warning/Critical status and email alert
- ✅ Failed services trigger Warning/Critical status and email alert
- ✅ Maintenance mode suppresses notifications as expected

### 🚀 Production Ready

**CM Dashboard v0.1.141 is a complete, functional infrastructure monitoring system:**

- **Real-time Monitoring**: All system components with 1-second intervals
- **Intelligent Alerting**: Email notifications on threshold violations
- **Perfect Display**: Accurate mount points, temperatures, and system information
- **Status-Aware**: All metrics evaluated against configurable thresholds
- **Production Ready**: Full monitoring capabilities restored

**The monitoring system is fully operational and ready for production use.**

## Implementation Rules

1. **Agent Status Authority**: Agent calculates status for each metric using thresholds
2. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
3. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status

**NEVER:**

- Copy/paste ANY code from legacy implementations
- Calculate status in dashboard widgets
- Hardcode metric names in widgets (use const arrays)
- Create files unless absolutely necessary for achieving goals
- Create documentation files unless explicitly requested

**ALWAYS:**

- Prefer editing existing files to creating new ones
- Follow existing code conventions and patterns
- Use existing libraries and utilities
- Follow security best practices