cm-dashboard/CLAUDE.md
Christoffer Martinsson adf3b0f51c
All checks were successful
Build and Release / build-and-release (push) Successful in 2m10s
Implement complete structured data architecture
Replace fragile string-based metrics with type-safe JSON data structures.
Agent converts all metrics to structured data, dashboard processes typed fields.

Changes:
- Add AgentData struct with CPU, memory, storage, services, backup fields
- Replace string parsing with direct field access throughout system
- Maintain UI compatibility via temporary metric bridge conversion
- Fix NVMe temperature display and eliminate string parsing bugs
- Update protocol to support structured data transmission over ZMQ
- Comprehensive metric type coverage: CPU, memory, storage, services, backup

Version bump to 0.1.131
2025-11-23 21:32:00 +01:00

13 KiB

CM Dashboard - Infrastructure Monitoring TUI

Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.

Current Features

Core Functionality

  • Real-time Monitoring: CPU, RAM, Storage, and Service status
  • Service Management: Start/stop services with user-stopped tracking
  • Multi-host Support: Monitor multiple servers from single dashboard
  • NixOS Integration: System rebuild via SSH + tmux popup
  • Backup Monitoring: Borgbackup status and scheduling

User-Stopped Service Tracking

  • Services stopped via dashboard are marked as "user-stopped"
  • User-stopped services report Status::OK instead of Warning
  • Prevents false alerts during intentional maintenance
  • Persistent storage survives agent restarts
  • Automatic flag clearing when services are restarted via dashboard

Custom Service Logs

  • Configure service-specific log file paths per host in dashboard config
  • Press L on any service to view custom log files via tail -f
  • Configuration format in dashboard config:
[service_logs]
hostname1 = [
  { service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
  { service_name = "app", log_file_path = "/var/log/myapp/app.log" }
]
hostname2 = [
  { service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
]

Service Management

  • Direct Control: Arrow keys (↑↓) or vim keys (j/k) navigate services
  • Service Actions:
    • s - Start service (sends UserStart command)
    • S - Stop service (sends UserStop command)
    • J - Show service logs (journalctl in tmux popup)
    • L - Show custom log files (tail -f custom paths in tmux popup)
    • R - Rebuild current host
  • Visual Status: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
  • Transitional Icons: Blue arrows during operations

Navigation

  • Tab: Switch between hosts
  • ↑↓ or j/k: Select services
  • s: Start selected service (UserStart)
  • S: Stop selected service (UserStop)
  • J: Show service logs (journalctl)
  • L: Show custom log files
  • R: Rebuild current host
  • B: Run backup on current host
  • q: Quit dashboard

Core Architecture Principles

Structured Data Architecture (Planned Migration)

Current system uses string-based metrics with complex parsing. Planning migration to structured JSON data to eliminate fragile string manipulation.

Current (String Metrics):

  • Agent sends individual metrics with string names like disk_nvme0n1_temperature
  • Dashboard parses metric names with underscore counting and string splitting
  • Complex and error-prone metric filtering and extraction logic

Target (Structured Data):

{
  "hostname": "cmbox",
  "agent_version": "v0.1.130",
  "timestamp": 1763926877,
  "system": {
    "cpu": {
      "load_1min": 3.50,
      "load_5min": 3.57,
      "load_15min": 3.58,
      "frequency_mhz": 1500,
      "temperature_celsius": 45.2
    },
    "memory": {
      "usage_percent": 25.0,
      "total_gb": 23.3,
      "used_gb": 5.9,
      "swap_total_gb": 10.7,
      "swap_used_gb": 0.99,
      "tmpfs": [
        {"mount": "/tmp", "usage_percent": 15.0, "used_gb": 0.3, "total_gb": 2.0}
      ]
    },
    "storage": {
      "drives": [
        {
          "name": "nvme0n1",
          "health": "PASSED",
          "temperature_celsius": 29.0,
          "wear_percent": 1.0,
          "filesystems": [
            {"mount": "/", "usage_percent": 24.0, "used_gb": 224.9, "total_gb": 928.2}
          ]
        }
      ],
      "pools": [
        {
          "name": "srv_media",
          "mount": "/srv/media",
          "type": "mergerfs",
          "health": "healthy",
          "usage_percent": 63.0,
          "used_gb": 2355.2,
          "total_gb": 3686.4,
          "data_drives": [
            {"name": "sdb", "temperature_celsius": 24.0}
          ],
          "parity_drives": [
            {"name": "sdc", "temperature_celsius": 24.0}
          ]
        }
      ]
    }
  },
  "services": [
    {"name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0}
  ],
  "backup": {
    "status": "completed",
    "last_run": 1763920000,
    "next_scheduled": 1764006400,
    "total_size_gb": 150.5,
    "repository_health": "ok"
  }
}
  • Agent sends structured JSON over ZMQ
  • Dashboard accesses data directly: data.system.storage.drives[0].temperature_celsius
  • Type safety eliminates all parsing bugs

Maintenance Mode

  • Agent checks for /tmp/cm-maintenance file before sending notifications
  • File presence suppresses all email notifications while continuing monitoring
  • Dashboard continues to show real status, only notifications are blocked

Usage:

# Enable maintenance mode
touch /tmp/cm-maintenance

# Run maintenance tasks
systemctl stop service
# ... maintenance work ...
systemctl start service

# Disable maintenance mode
rm /tmp/cm-maintenance

Development and Deployment Architecture

Development Path

  • Location: ~/projects/cm-dashboard
  • Purpose: Development workflow only - for committing new code
  • Access: Only for developers to commit changes

Deployment Path

  • Location: /var/lib/cm-dashboard/nixos-config
  • Purpose: Production deployment only - agent clones/pulls from git
  • Workflow: git pull → /var/lib/cm-dashboard/nixos-config → nixos-rebuild

Git Flow

Development: ~/projects/cm-dashboard → git commit → git push
Deployment:  git pull → /var/lib/cm-dashboard/nixos-config → rebuild

Automated Binary Release System

CM Dashboard uses automated binary releases instead of source builds.

Creating New Releases

cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X

This automatically:

  • Builds static binaries with RUSTFLAGS="-C target-feature=+crt-static"
  • Creates GitHub-style release with tarball
  • Uploads binaries via Gitea API

NixOS Configuration Updates

Edit ~/projects/nixosbox/hosts/services/cm-dashboard.nix:

version = "v0.1.X";
src = pkgs.fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-NEW_HASH_HERE";
};

Get Release Hash

cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"

Building

Testing & Building:

  • Workspace builds: nix-shell -p openssl pkg-config --run "cargo build --workspace"
  • Clean compilation: Remove target/ between major changes

Enhanced Storage Pool Visualization

Auto-Discovery Architecture

The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping.

Discovery Process

At Agent Startup:

  1. Parse /proc/mounts to identify all mounted filesystems
  2. Detect MergerFS pools by analyzing fuse.mergerfs mount sources
  3. Identify member disks and potential parity relationships via heuristics
  4. Store discovered storage topology for continuous monitoring
  5. Generate pool-aware metrics with hierarchical relationships

Continuous Monitoring:

  • Use stored discovery data for efficient metric collection
  • Monitor individual drives for SMART data, temperature, wear
  • Calculate pool-level health based on member drive status
  • Generate enhanced metrics for dashboard visualization

Supported Storage Types

Single Disks:

  • ext4, xfs, btrfs mounted directly
  • Individual drive monitoring with SMART data
  • Traditional single-disk display for root, boot, etc.

MergerFS Pools:

  • Auto-detect from /proc/mounts fuse.mergerfs entries
  • Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
  • Heuristic parity disk detection (sequential device names, "parity" in path)
  • Pool health calculation (healthy/degraded/critical)
  • Hierarchical tree display with data/parity disk grouping

Future Extensions Ready:

  • RAID arrays via /proc/mdstat parsing
  • ZFS pools via zpool status integration
  • LVM logical volumes via lvs discovery

Configuration

[collectors.disk]
enabled = true
auto_discover = true  # Default: true
# Optional exclusions for special filesystems
exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"]
exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]

Display Format

Storage:
● /srv/media (mergerfs (2+1)):
  ├─ Pool Status: ● Healthy (3 drives)
  ├─ Total: ● 63% 2355.2GB/3686.4GB
  ├─ Data Disks:
  │  ├─ ● sdb T: 24°C
  │  └─ ● sdd T: 27°C
  └─ Parity: ● sdc T: 24°C
● /:
  ├─ ● nvme0n1 W: 13%
  └─ ● 7% 14.5GB/218.5GB

Implementation Benefits

  • Zero Configuration: No manual pool definitions required
  • Always Accurate: Reflects actual system state automatically
  • Scales Automatically: Handles any number of pools without config changes
  • Backwards Compatible: Single disks continue working unchanged
  • Future Ready: Easy extension for additional storage technologies

Current Status (v0.1.100)

Completed:

  • Auto-discovery system implemented and deployed
  • /proc/mounts parsing with smart heuristics for parity detection
  • Storage topology stored at agent startup for efficient monitoring
  • Universal zero-configuration for all hosts (cmbox, steambox, simonbox, srv01, srv02, srv03)
  • Enhanced pool health calculation (healthy/degraded/critical)
  • Hierarchical tree visualization with data/parity disk separation

🔄 In Progress - Complete Disk Collector Rewrite:

The current disk collector has grown complex with mixed legacy/auto-discovery approaches. Planning complete rewrite with clean, simple workflow supporting both physical drives and mergerfs pools.

New Clean Architecture:

Discovery Workflow:

  1. lsblk to detect all mount points and backing devices
  2. df to get filesystem usage for each mount point
  3. Group by physical drive (nvme0n1, sda, etc.)
  4. Parse /proc/mounts for mergerfs pools
  5. Generate unified metrics for both storage types

Physical Drive Display:

● nvme0n1:
  ├─ ● Drive: T: 35°C W: 1%
  ├─ ● Total: 23% 218.0GB/928.2GB
  ├─ ● /boot: 11% 0.1GB/1.0GB
  └─ ● /: 23% 214.9GB/928.2GB

MergerFS Pool Display:

● /srv/media (mergerfs):
  ├─ ● Pool: 63% 2355.2GB/3686.4GB
  ├─ Data Disks:
  │  ├─ ● sdb T: 24°C
  │  └─ ● sdd T: 27°C  
  └─ ● sdc T: 24°C (parity)

Implementation Benefits:

  • Pure auto-discovery: No configuration needed
  • Clean code paths: Single workflow for all storage types
  • Consistent display: Status icons on every line, no redundant text
  • Simple pipeline: lsblk → df → group → metrics
  • Support for both: Physical drives and mergerfs pools

Important Communication Guidelines

Keep responses concise and focused. Avoid extensive implementation summaries unless requested.

Commit Message Guidelines

NEVER mention:

  • Claude or any AI assistant names
  • Automation or AI-generated content
  • Any reference to automated code generation

ALWAYS:

  • Focus purely on technical changes and their purpose
  • Use standard software development commit message format
  • Describe what was changed and why, not how it was created
  • Write from the perspective of a human developer

Examples:

  • "Generated with Claude Code"
  • "AI-assisted implementation"
  • "Automated refactoring"
  • "Implement maintenance mode for backup operations"
  • "Restructure storage widget with improved layout"
  • "Update CPU thresholds to production values"

Planned Architecture Migration

Phase 1: Structured Data Types (Shared Crate)

  • Create Rust structs matching target JSON structure
  • Replace Metric enum with typed data structures
  • Add serde serialization/deserialization

Phase 2: Agent Refactor

  • Update collectors to return typed structs instead of Vec<Metric>
  • Remove string metric name generation
  • Send structured JSON over ZMQ

Phase 3: Dashboard Refactor

  • Replace metric parsing logic with direct field access
  • Remove extract_pool_name(), extract_drive_name(), underscore counting
  • Widgets access data.system.storage.drives[0].temperature_celsius

Phase 4: Migration & Cleanup

  • Support both formats during transition
  • Gradual rollout with backward compatibility
  • Remove legacy string metric system

Implementation Rules

  1. Agent Status Authority: Agent calculates status for each metric using thresholds
  2. Dashboard Composition: Dashboard widgets subscribe to specific metrics by name
  3. Status Aggregation: Dashboard aggregates individual metric statuses for widget status

NEVER:

  • Copy/paste ANY code from legacy implementations
  • Calculate status in dashboard widgets
  • Hardcode metric names in widgets (use const arrays)
  • Create files unless absolutely necessary for achieving goals
  • Create documentation files unless explicitly requested

ALWAYS:

  • Prefer editing existing files to creating new ones
  • Follow existing code conventions and patterns
  • Use existing libraries and utilities
  • Follow security best practices