Build and Release / build-and-release (push) Successful in 1m7s

Details

Fix storage display issues and use dynamic versioning

Phase 1 fixes for storage display:
- Replace findmnt with lsblk to eliminate bind mount issues (/nix/store)
- Add sudo to smartctl commands for permission access
- Fix NVMe SMART parsing for Temperature: and Percentage Used: fields
- Use dynamic version from CARGO_PKG_VERSION instead of hardcoded strings

Storage display should now show correct mount points and temperature/wear.
Status evaluation and notifications still need restoration in subsequent phases.

2025-11-24 19:26:09 +01:00

14 KiB

Raw Blame History

CM Dashboard - Infrastructure Monitoring TUI

Overview

A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.

Current Features

Core Functionality

Real-time Monitoring: CPU, RAM, Storage, and Service status
Service Management: Start/stop services with user-stopped tracking
Multi-host Support: Monitor multiple servers from single dashboard
NixOS Integration: System rebuild via SSH + tmux popup
Backup Monitoring: Borgbackup status and scheduling

User-Stopped Service Tracking

Services stopped via dashboard are marked as "user-stopped"
User-stopped services report Status::OK instead of Warning
Prevents false alerts during intentional maintenance
Persistent storage survives agent restarts
Automatic flag clearing when services are restarted via dashboard

Custom Service Logs

Configure service-specific log file paths per host in dashboard config
Press L on any service to view custom log files via tail -f
Configuration format in dashboard config:

[service_logs]
hostname1 = [
  { service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
  { service_name = "app", log_file_path = "/var/log/myapp/app.log" }
]
hostname2 = [
  { service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
]

Service Management

Direct Control: Arrow keys (↑↓) or vim keys (j/k) navigate services
Service Actions:
- s - Start service (sends UserStart command)
- S - Stop service (sends UserStop command)
- J - Show service logs (journalctl in tmux popup)
- L - Show custom log files (tail -f custom paths in tmux popup)
- R - Rebuild current host
Visual Status: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
Transitional Icons: Blue arrows during operations

Tab: Switch between hosts
↑↓ or j/k: Select services
s: Start selected service (UserStart)
S: Stop selected service (UserStop)
J: Show service logs (journalctl)
L: Show custom log files
R: Rebuild current host
B: Run backup on current host
q: Quit dashboard

Core Architecture Principles

Structured Data Architecture (✅ IMPLEMENTED v0.1.131)

Complete migration from string-based metrics to structured JSON data. Eliminates all string parsing bugs and provides type-safe data access.

Previous (String Metrics):

❌ Agent sent individual metrics with string names like disk_nvme0n1_temperature
❌ Dashboard parsed metric names with underscore counting and string splitting
❌ Complex and error-prone metric filtering and extraction logic

Current (Structured Data):

{
  "hostname": "cmbox",
  "agent_version": "v0.1.131",
  "timestamp": 1763926877,
  "system": {
    "cpu": {
      "load_1min": 3.5,
      "load_5min": 3.57,
      "load_15min": 3.58,
      "frequency_mhz": 1500,
      "temperature_celsius": 45.2
    },
    "memory": {
      "usage_percent": 25.0,
      "total_gb": 23.3,
      "used_gb": 5.9,
      "swap_total_gb": 10.7,
      "swap_used_gb": 0.99,
      "tmpfs": [
        {
          "mount": "/tmp",
          "usage_percent": 15.0,
          "used_gb": 0.3,
          "total_gb": 2.0
        }
      ]
    },
    "storage": {
      "drives": [
        {
          "name": "nvme0n1",
          "health": "PASSED",
          "temperature_celsius": 29.0,
          "wear_percent": 1.0,
          "filesystems": [
            {
              "mount": "/",
              "usage_percent": 24.0,
              "used_gb": 224.9,
              "total_gb": 928.2
            }
          ]
        }
      ],
      "pools": [
        {
          "name": "srv_media",
          "mount": "/srv/media",
          "type": "mergerfs",
          "health": "healthy",
          "usage_percent": 63.0,
          "used_gb": 2355.2,
          "total_gb": 3686.4,
          "data_drives": [{ "name": "sdb", "temperature_celsius": 24.0 }],
          "parity_drives": [{ "name": "sdc", "temperature_celsius": 24.0 }]
        }
      ]
    }
  },
  "services": [
    { "name": "sshd", "status": "active", "memory_mb": 4.5, "disk_gb": 0.0 }
  ],
  "backup": {
    "status": "completed",
    "last_run": 1763920000,
    "next_scheduled": 1764006400,
    "total_size_gb": 150.5,
    "repository_health": "ok"
  }
}

✅ Agent sends structured JSON over ZMQ (no legacy support)
✅ Type-safe data access: data.system.storage.drives[0].temperature_celsius
✅ Complete metric coverage: CPU, memory, storage, services, backup
✅ Backward compatibility via bridge conversion to existing UI widgets
✅ All string parsing bugs eliminated

Maintenance Mode

Agent checks for /tmp/cm-maintenance file before sending notifications
File presence suppresses all email notifications while continuing monitoring
Dashboard continues to show real status, only notifications are blocked

Usage:

# Enable maintenance mode
touch /tmp/cm-maintenance

# Run maintenance tasks
systemctl stop service
# ... maintenance work ...
systemctl start service

# Disable maintenance mode
rm /tmp/cm-maintenance

Development and Deployment Architecture

Development Path

Location: ~/projects/cm-dashboard
Purpose: Development workflow only - for committing new code
Access: Only for developers to commit changes

Deployment Path

Location: /var/lib/cm-dashboard/nixos-config
Purpose: Production deployment only - agent clones/pulls from git
Workflow: git pull → /var/lib/cm-dashboard/nixos-config → nixos-rebuild

Git Flow

Development: ~/projects/cm-dashboard → git commit → git push
Deployment:  git pull → /var/lib/cm-dashboard/nixos-config → rebuild

Automated Binary Release System

CM Dashboard uses automated binary releases instead of source builds.

Creating New Releases

cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X

This automatically:

Builds static binaries with RUSTFLAGS="-C target-feature=+crt-static"
Creates GitHub-style release with tarball
Uploads binaries via Gitea API

NixOS Configuration Updates

Edit ~/projects/nixosbox/hosts/services/cm-dashboard.nix:

version = "v0.1.X";
src = pkgs.fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-NEW_HASH_HERE";
};

Get Release Hash

cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
  sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"

Building

Testing & Building:

Workspace builds: nix-shell -p openssl pkg-config --run "cargo build --workspace"
Clean compilation: Remove target/ between major changes

Enhanced Storage Pool Visualization

Auto-Discovery Architecture

The dashboard uses automatic storage discovery to eliminate manual configuration complexity while providing intelligent storage pool grouping.

Discovery Process

At Agent Startup:

Parse /proc/mounts to identify all mounted filesystems
Detect MergerFS pools by analyzing fuse.mergerfs mount sources
Identify member disks and potential parity relationships via heuristics
Store discovered storage topology for continuous monitoring
Generate pool-aware metrics with hierarchical relationships

Continuous Monitoring:

Use stored discovery data for efficient metric collection
Monitor individual drives for SMART data, temperature, wear
Calculate pool-level health based on member drive status
Generate enhanced metrics for dashboard visualization

Supported Storage Types

Single Disks:

ext4, xfs, btrfs mounted directly
Individual drive monitoring with SMART data
Traditional single-disk display for root, boot, etc.

MergerFS Pools:

Auto-detect from /proc/mounts fuse.mergerfs entries
Parse source paths to identify member disks (e.g., "/mnt/disk1:/mnt/disk2")
Heuristic parity disk detection (sequential device names, "parity" in path)
Pool health calculation (healthy/degraded/critical)
Hierarchical tree display with data/parity disk grouping

Future Extensions Ready:

RAID arrays via /proc/mdstat parsing
ZFS pools via zpool status integration
LVM logical volumes via lvs discovery

Configuration

[collectors.disk]
enabled = true
auto_discover = true  # Default: true
# Optional exclusions for special filesystems
exclude_mount_points = ["/tmp", "/proc", "/sys", "/dev"]
exclude_fs_types = ["tmpfs", "devtmpfs", "sysfs", "proc"]

Display Format

CPU:
● Load: 0.23 0.21 0.13
  └─ Freq: 1048 MHz

RAM:
● Usage: 25% 5.8GB/23.3GB
  ├─ ● /tmp: 2% 0.5GB/2GB
  └─ ● /var/tmp: 0% 0GB/1.0GB

Storage:
● mergerfs (2+1):
  ├─ Total: ● 63% 2355.2GB/3686.4GB
  ├─ Data Disks:
  │  ├─ ● sdb T: 24°C W: 5%
  │  └─ ● sdd T: 27°C W: 5%
  ├─ Parity: ● sdc T: 24°C W: 5%
  └─ Mount: /srv/media

● nvme0n1 T: 25C W: 4%
  ├─ ● /: 55% 250.5GB/456.4GB
  └─ ● /boot: 26% 0.3GB/1.0GB

Important Communication Guidelines

Keep responses concise and focused. Avoid extensive implementation summaries unless requested.

Commit Message Guidelines

NEVER mention:

Claude or any AI assistant names
Automation or AI-generated content
Any reference to automated code generation

ALWAYS:

Focus purely on technical changes and their purpose
Use standard software development commit message format
Describe what was changed and why, not how it was created
Write from the perspective of a human developer

Examples:

❌ "Generated with Claude Code"
❌ "AI-assisted implementation"
❌ "Automated refactoring"
✅ "Implement maintenance mode for backup operations"
✅ "Restructure storage widget with improved layout"
✅ "Update CPU thresholds to production values"

Completed Architecture Migration (v0.1.131)

Complete Fix Plan (v0.1.140)

🎯 Goal: Fix ALL Issues - Display AND Core Functionality

Current Broken State (v0.1.139)

❌ What's Broken:

✅ Data Collection: Agent collects structured data correctly
❌ Storage Display: Shows wrong mount points, missing temperature/wear
❌ Status Evaluation: Everything shows "OK" regardless of actual values
❌ Notifications: Not working - can't send alerts when systems fail
❌ Thresholds: Not being evaluated (CPU load, memory usage, disk temperature)

Root Cause: During atomic migration, I removed core monitoring functionality and only fixed data collection, making the dashboard useless as a monitoring tool.

Complete Fix Plan - Do Everything Right

Phase 1: Fix Storage Display (CURRENT)

✅ Use lsblk instead of findmnt (eliminates /nix/store bind mount issue)
✅ Add sudo smartctl for permissions
✅ Fix NVMe SMART parsing (Temperature: and Percentage Used:)
🔄 Test that dashboard shows: ● nvme0n1 T: 28°C W: 1% correctly

Phase 2: Restore Status Evaluation System

CPU Status: Evaluate load averages against thresholds → Status::Warning/Critical
Memory Status: Evaluate usage_percent against thresholds → Status::Warning/Critical
Storage Status: Evaluate temperature & usage against thresholds → Status::Warning/Critical
Service Status: Evaluate service states → Status::Warning if inactive
Overall Host Status: Aggregate component statuses → host-level status

Phase 3: Restore Notification System

Status Change Detection: Track when component status changes from OK→Warning/Critical
Email Notifications: Send alerts when status degrades
Notification Rate Limiting: Prevent spam (existing logic)
Maintenance Mode: Honor /tmp/cm-maintenance to suppress alerts
Batched Notifications: Group multiple alerts into single email

Phase 4: Integration & Testing

AgentData Status Fields: Add status fields to structured data
Dashboard Status Display: Show colored indicators based on actual status
End-to-End Testing: Verify alerts fire when thresholds exceeded
Verify All Thresholds: CPU load, memory usage, disk temperature, service states

Target Architecture (CORRECT)

Complete Flow:

Collectors → AgentData → StatusEvaluator → Notifications
                      ↘                 ↗  
                      ZMQ → Dashboard → Status Display

Key Components:

Collectors: Populate AgentData with raw metrics
StatusEvaluator: Apply thresholds to AgentData → Status enum values
Notifications: Send emails on status changes (OK→Warning/Critical)
Dashboard: Display data with correct status colors/indicators

Implementation Rules

MUST COMPLETE ALL:

Fix storage display to show correct mount points and temperature
Restore working status evaluation (thresholds → Status enum)
Restore working notifications (email alerts on status changes)
Test that monitoring actually works (alerts fire when appropriate)

NO SHORTCUTS:

Don't commit partial fixes
Don't claim functionality works when it doesn't
Test every component thoroughly
Keep existing configuration and thresholds working

Success Criteria:

Dashboard shows ● nvme0n1 T: 28°C W: 1% format
High CPU load triggers Warning status and email alert
High memory usage triggers Warning status and email alert
High disk temperature triggers Warning status and email alert
Failed services trigger Warning status and email alert
Maintenance mode suppresses notifications as expected

Implementation Rules

Agent Status Authority: Agent calculates status for each metric using thresholds
Dashboard Composition: Dashboard widgets subscribe to specific metrics by name
Status Aggregation: Dashboard aggregates individual metric statuses for widget status

NEVER:

Copy/paste ANY code from legacy implementations
Calculate status in dashboard widgets
Hardcode metric names in widgets (use const arrays)
Create files unless absolutely necessary for achieving goals
Create documentation files unless explicitly requested

ALWAYS:

Prefer editing existing files to creating new ones
Follow existing code conventions and patterns
Use existing libraries and utilities
Follow security best practices

14 KiB Raw Blame History

CM Dashboard - Infrastructure Monitoring TUI

Overview

Current Features

Core Functionality

User-Stopped Service Tracking

Custom Service Logs

Service Management

Navigation

Core Architecture Principles

Structured Data Architecture (✅ IMPLEMENTED v0.1.131)

Maintenance Mode

Development and Deployment Architecture

Development Path

Deployment Path

Git Flow

Automated Binary Release System

Creating New Releases

NixOS Configuration Updates

Get Release Hash

Building

Enhanced Storage Pool Visualization

Auto-Discovery Architecture

Discovery Process

Supported Storage Types

Configuration

Display Format

Important Communication Guidelines

Commit Message Guidelines

Completed Architecture Migration (v0.1.131)

Complete Fix Plan (v0.1.140)

Current Broken State (v0.1.139)

Complete Fix Plan - Do Everything Right

Phase 1: Fix Storage Display (CURRENT)

Phase 2: Restore Status Evaluation System

Phase 3: Restore Notification System

Phase 4: Integration & Testing

Target Architecture (CORRECT)

Implementation Rules

Implementation Rules

14 KiB

Raw Blame History