Compare commits

...

22 Commits

Author SHA1 Message Date
e8e50ef9bb Replace empty panels with offline host message for better UX
All checks were successful
Build and Release / build-and-release (push) Successful in 2m33s
- Hide all system/backup/service panels when host is offline
- Show centered wake-up message with host status
- Display "Press 'w' to wake up host" if MAC address configured
- Provide clear visual indication when hosts are unreachable
- Improve user experience by removing confusing empty panels

Version 0.1.64
2025-11-08 18:28:28 +01:00
0faed9309e Improve host disconnection detection and fix notification exclusions
All checks were successful
Build and Release / build-and-release (push) Successful in 1m34s
- Add dedicated heartbeat transmission every 5 seconds independent of metric collection
- Fix host offline detection by clearing metrics for disconnected hosts
- Move exclude_email_metrics to NotificationConfig for better organization
- Add cleanup_offline_hosts method to remove stale metrics after heartbeat timeout
- Ensure offline hosts show proper status icons and visual indicators

Version 0.1.63
2025-11-08 11:33:32 +01:00
c980346d05 Fix heartbeat detection to properly detect offline hosts
All checks were successful
Build and Release / build-and-release (push) Successful in 2m34s
- Add independent heartbeat checking timer (1 second interval) separate from metric reception
- Move get_connected_hosts() call outside metric receive condition to run periodically
- Remove duplicate update_hosts() call from metric processing to avoid redundancy
- Ensure offline host detection works even when no new metrics are received
- Fix issue where hosts going offline were never detected due to conditional heartbeat check
- Heartbeat timeouts now properly detected within configured timeout + 1 second
- Bump version to 0.1.62
2025-11-07 14:27:03 +01:00
3e3d3f0c2b Fix Tab key 1-second delay by reverting ZMQ to non-blocking mode
All checks were successful
Build and Release / build-and-release (push) Successful in 1m10s
- Change receive_metrics() from blocking to DONTWAIT to prevent main loop freezing
- Eliminate 1-second ZMQ socket timeout that was blocking UI after Tab key press
- Main loop now continues immediately after immediate render instead of waiting
- Maintain heartbeat-based host detection while fixing visual responsiveness
- Fix blocking operation introduced when implementing heartbeat timeout mechanism
- Tab navigation now truly immediate without any network operation delays
- Bump version to 0.1.61
2025-11-06 12:04:49 +01:00
9eb7444d56 Cache localhost hostname to eliminate Tab key sluggishness
All checks were successful
Build and Release / build-and-release (push) Successful in 2m10s
- Add cached localhost field to TuiApp struct to avoid repeated gethostname() system calls
- Initialize localhost once in constructor instead of calling gethostname() on every navigation
- Replace gethostname() calls in update_hosts() and navigate_host() with cached value
- Eliminate expensive system call bottleneck causing Tab key responsiveness issues
- Reduce Tab navigation from 2+ system calls to zero system calls (memory access only)
- Fix performance regression introduced by immediate UI refresh implementation
- Bump version to 0.1.60
2025-11-06 11:53:49 +01:00
278d1763aa Fix Tab key responsiveness with immediate UI refresh
All checks were successful
Build and Release / build-and-release (push) Successful in 2m10s
- Add immediate terminal.draw() call after input handling in main loop
- Eliminate delay between Tab key press and visual host switching
- Provide instant visual feedback for all navigation inputs
- Maintain existing metric update render cycle without duplication
- Fix UI update timing issue where changes only appeared on metric intervals
- Bump version to 0.1.59
2025-11-06 11:30:26 +01:00
f874264e13 Optimize dashboard performance for responsive Tab key navigation
All checks were successful
Build and Release / build-and-release (push) Successful in 1m32s
- Replace 6 separate filter operations with single-pass metric categorization in update_metrics
- Reduce CPU overhead from 6x to 1x work per metric update cycle
- Fix Tab key sluggishness caused by competing expensive filtering operations
- Maintain exact same functionality with significantly better performance
- Improve UI responsiveness for host switching and navigation
- Bump version to 0.1.58
2025-11-06 11:18:39 +01:00
5f6e47ece5 Implement heartbeat-based host connectivity detection
All checks were successful
Build and Release / build-and-release (push) Successful in 2m8s
- Add agent_heartbeat metric to agent transmission for reliable host detection
- Update dashboard to track heartbeat timestamps per host instead of general metrics
- Add configurable heartbeat_timeout_seconds to dashboard ZMQ config (default 10s)
- Remove unused timeout_ms from agent config and revert to non-blocking command reception
- Remove unused heartbeat_interval_ms from agent configuration
- Host disconnect detection now uses dedicated heartbeat metrics for improved reliability
- Bump version to 0.1.57
2025-11-06 11:04:01 +01:00
0e7cf24dbb Add exclude_email_metrics configuration option
All checks were successful
Build and Release / build-and-release (push) Successful in 2m34s
- Add exclude_email_metrics field to AgentConfig for filtering email notifications
- Metrics matching excluded names skip notification processing but still appear in dashboard
- Optional field with serde(default) for backward compatibility
- Bump version to 0.1.56
2025-11-06 10:31:25 +01:00
2d080a2f51 Implement WakeOnLAN functionality and offline status handling
All checks were successful
Build and Release / build-and-release (push) Successful in 1m35s
- Add WakeOnLAN support for offline hosts using 'w' key
- Configure MAC addresses for all infrastructure hosts
- Implement Status::Offline for disconnected hosts
- Exclude offline hosts from status aggregation to prevent false alerts
- Update versions to 0.1.55
2025-10-31 09:28:31 +01:00
6179bd51a7 Implement WakeOnLAN functionality with simplified configuration
All checks were successful
Build and Release / build-and-release (push) Successful in 2m32s
- Add Status::Offline enum variant for disconnected hosts
- All configured hosts now always visible showing offline status when disconnected
- Add WakeOnLAN support using wake-on-lan Rust crate
- Implement w key binding to wake offline hosts with MAC addresses
- Simplify configuration to single [hosts] section with MAC addresses only
- Change critical status icon from ◯ to ! for better visibility
- Add proper MAC address parsing and error handling
- Silent WakeOnLAN operation with logging for success/failure

Configuration format:
[hosts]
hostname = { mac_address = "AA:BB:CC:DD:EE:FF" }
2025-10-31 09:03:01 +01:00
57de4c366a Bump version to 0.1.53
All checks were successful
Build and Release / build-and-release (push) Successful in 2m10s
2025-10-30 17:00:39 +01:00
e18778e962 Fix string syntax error in rebuild command
- Replace raw string with escaped string to fix compilation error
- Maintain same functionality with proper string formatting
2025-10-30 16:59:41 +01:00
e4469a0ebf Replace tmux popups with split windows for better log navigation
Some checks failed
Build and Release / build-and-release (push) Failing after 1m9s
- Change J/L log commands from popups to split windows for scrolling support
- Change rebuild command from popup to split window with consistent 30% height
- Add auto-close behavior with bash -c "command; exit" wrapper for logs
- Add "press any key to close" prompt with visual separators for rebuild
- Enable proper tmux copy mode and navigation in all split windows

Users can now scroll through logs, copy text, and resize windows while
maintaining clean auto-close behavior for all operations.
2025-10-30 15:30:58 +01:00
6fedf4c7fc Add sudo support and line count to log viewing commands
All checks were successful
Build and Release / build-and-release (push) Successful in 1m12s
- Add sudo to journalctl command for proper systemd log access
- Add sudo to tail command for system log file access
- Add -n 50 to tail command to match journalctl behavior
- Both J and L keys now show last 50 lines before following

Ensures consistent behavior and proper permissions for all log viewing.
2025-10-30 13:26:04 +01:00
3f6dffa66e Add custom service log file support with L key
All checks were successful
Build and Release / build-and-release (push) Successful in 2m7s
- Add ServiceLogConfig structure for per-host service log paths
- Implement L key handler for custom log file viewing via tmux popup
- Update dashboard config to support service_logs HashMap
- Add tail -f command execution over SSH for real-time log streaming
- Update status line to show L: Custom shortcut
- Document configuration format in CLAUDE.md

Each service can now have custom log file paths configured per host,
accessible via L key with same tmux popup interface as journalctl.
2025-10-30 13:12:36 +01:00
1b64fbde3d Fix tmux popup title flag for service logs feature
All checks were successful
Build and Release / build-and-release (push) Successful in 1m47s
Fix journalctl popup that was failing with 'can't find session' error:

Issue Resolution:
- Change tmux display-popup flag from -t to -T for setting popup title
- -t flag was incorrectly trying to target a session named 'Logs: servicename'
- -T flag correctly sets the popup window title

The J key (Shift+j) service logs feature now works properly, opening
an 80% tmux popup with journalctl -f for real-time log viewing.

Bump version to v0.1.49
2025-10-30 12:42:58 +01:00
4f4c3b0d6e Improve notification behavior during startup and recovery
All checks were successful
Build and Release / build-and-release (push) Successful in 2m9s
Fix notification issues for better operational experience:

Startup Notification Suppression:
- Suppress notifications for transitions from Status::Unknown during agent/server startup
- Prevents notification spam when services transition from Unknown to Warning/Critical on restart
- Only real status changes (not initial discovery) trigger notifications
- Maintains alerting for actual service state changes after startup

Recovery Notification Refinement:
- Recovery notifications only sent when ALL services reach OK status
- Individual service recoveries suppressed if other services still have problems
- Ensures recovery notifications indicate complete system health restoration
- Prevents premature celebration when partial recoveries occur

Result: Clean startup experience without false alerts and meaningful recovery
notifications that truly indicate full system health restoration.

Bump version to v0.1.48
2025-10-30 12:35:23 +01:00
bd20f0cae1 Fix user-stopped flag timing and service transition handling
All checks were successful
Build and Release / build-and-release (push) Successful in 2m9s
Correct user-stopped service behavior during startup transitions:

User-Stopped Flag Timing Fix:
- Clear user-stopped flag only when service actually becomes active, not when start command succeeds
- Remove premature flag clearing from service control handler
- Add automatic flag clearing when service status metrics show active state
- Services retain user-stopped status during activating/transitioning states

Service Transition Handling:
- User-stopped services in activating state now report Status::OK instead of Status::Pending
- Prevents host warnings during legitimate service startup transitions
- Maintains accurate status reporting throughout service lifecycle
- Failed service starts preserve user-stopped flags correctly

Journalctl Popup Fix:
- Fix terminal corruption when using J key for service logs
- Correct command quoting to prevent tmux popup interference
- Stable popup display without dashboard interface corruption

Result: Clean service startup experience with no false warnings and proper
user-stopped tracking throughout the entire service lifecycle.

Bump version to v0.1.47
2025-10-30 12:05:54 +01:00
11c9a5f9d2 Add service logs feature and improve tmux popup sizing
All checks were successful
Build and Release / build-and-release (push) Successful in 2m9s
New Features:
- Add journalctl service logs viewer via Shift+J key
- Opens tmux popup with real-time log streaming using journalctl -f
- Shows last 50 lines and follows new log entries for selected service
- Popup titled 'Logs: service.service' for clear context

Improvements:
- Increase tmux popup size to 80% width and height for better readability
- Applies to both rebuild (R) and logs (J) popups
- Compact status line text to fit new J: Logs shortcut
- Updated documentation with new key binding

Navigation Updates:
- J: Show service logs (journalctl in tmux popup)
- Status line: Tab: Host • ↑↓/jk: Select • r: Rebuild • s/S: Start/Stop • J: Logs • q: Quit

Bump version to v0.1.46
2025-10-30 11:21:14 +01:00
aeae60146d Fix user-stopped service display and flag timing issues
All checks were successful
Build and Release / build-and-release (push) Successful in 2m10s
Improve user-stopped service tracking behavior:

Service Display Fix:
- Services widget now shows actual systemctl status (active/inactive)
- Use info.status instead of hardcoded text based on widget_status
- User-stopped services correctly display 'inactive' with green OK icon
- Prevents misleading 'active' display for stopped services

User-Stopped Flag Timing Fix:
- Clear user-stopped flag AFTER successful service start, not when command sent
- Prevents warnings during service startup transition period
- Service remains Status::OK during 'activating' state for user-stopped services
- Flag only cleared when systemctl start command actually succeeds
- Failed start attempts preserve user-stopped flag

Result: Clean service state tracking with accurate display and no false alerts
during intentional user operations.

Bump version to v0.1.45
2025-10-30 11:11:39 +01:00
a82c81e8e3 Fix service control by adding .service suffix to systemctl commands
All checks were successful
Build and Release / build-and-release (push) Successful in 2m8s
Service stop/start operations were failing because systemctl commands
were missing the .service suffix. This caused the new user-stopped
tracking feature to mark services but not actually control them.

Changes:
- Add .service suffix to systemctl commands in service control handler
- Matches pattern used throughout systemd collector
- Fixes service start/stop functionality via dashboard

Clean up legacy documentation:
- Remove outdated TODO.md, AGENTS.md, and test files
- Update CLAUDE.md with current architecture and rules only
- Comprehensive README.md rewrite with technical documentation
- Document user-stopped service tracking feature

Bump version to v0.1.44
2025-10-30 11:00:36 +01:00
26 changed files with 842 additions and 1061 deletions

View File

@@ -1,3 +0,0 @@
# Agent Guide
Agents working in this repo must follow the instructions in `CLAUDE.md`.

473
CLAUDE.md
View File

@@ -2,277 +2,76 @@
## Overview
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and ZMQ-based metric collection.
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.
## Implementation Strategy
## Current Features
### Current Implementation Status
### Core Functionality
- **Real-time Monitoring**: CPU, RAM, Storage, and Service status
- **Service Management**: Start/stop services with user-stopped tracking
- **Multi-host Support**: Monitor multiple servers from single dashboard
- **NixOS Integration**: System rebuild via SSH + tmux popup
- **Backup Monitoring**: Borgbackup status and scheduling
**System Panel Enhancement - COMPLETED**
### User-Stopped Service Tracking
- Services stopped via dashboard are marked as "user-stopped"
- User-stopped services report Status::OK instead of Warning
- Prevents false alerts during intentional maintenance
- Persistent storage survives agent restarts
- Automatic flag clearing when services are restarted via dashboard
All system panel features successfully implemented:
- **NixOS Collector**: Created collector for version and active users
- **System Widget**: Unified widget combining NixOS, CPU, RAM, and Storage
- **Build Display**: Shows NixOS build information without codename
-**Active Users**: Displays currently logged in users
-**Tmpfs Monitoring**: Added /tmp usage to RAM section
-**Agent Deployment**: NixOS collector working in production
**Simplified Navigation and Service Management - COMPLETED**
All navigation and service management features successfully implemented:
-**Direct Service Control**: Up/Down (or j/k) arrows directly control service selection
-**Always Visible Selection**: Service selection highlighting always visible (no panel focus needed)
-**Complete Service Discovery**: All configured services visible regardless of state
-**Transitional Visual Feedback**: Service operations show directional arrows (↑ ↓ ↻)
-**Simplified Interface**: Removed panel switching complexity, uniform appearance
-**Vi-style Navigation**: Added j/k keys for vim users alongside arrow keys
**Current Status - October 28, 2025:**
- All service discovery and display features working correctly ✅
- Simplified navigation system implemented ✅
- Service selection always visible with direct control ✅
- Complete service visibility (all configured services show regardless of state) ✅
- Transitional service icons working with proper color handling ✅
- Build display working: "Build: 25.05.20251004.3bcc93c" ✅
- Agent version display working: "Agent: v0.1.33" ✅
- Cross-host version comparison implemented ✅
- Automated binary release system working ✅
- SMART data consolidated into disk collector ✅
**RESOLVED - Remote Rebuild Functionality:**
-**System Rebuild**: Now uses simple SSH + tmux popup approach
-**Process Isolation**: Rebuild runs independently via SSH, survives agent/dashboard restarts
-**Configuration**: SSH user and rebuild alias configurable in dashboard config
-**Service Control**: Works correctly for start/stop/restart of services
**Solution Implemented:**
- Replaced complex SystemRebuild command infrastructure with direct tmux popup
- Uses `tmux display-popup "ssh -tt {user}@{hostname} 'bash -ic {alias}'"`
- Configurable SSH user and rebuild alias in dashboard config
- Eliminates all agent crashes during rebuilds
- Simple, reliable, and follows standard tmux interface patterns
**Current Layout:**
```
NixOS:
Build: 25.05.20251004.3bcc93c
Agent: v0.1.17 # Shows agent version from Cargo.toml
Active users: cm, simon
CPU:
● Load: 0.02 0.31 0.86 • 3000MHz
RAM:
● Usage: 33% 2.6GB/7.6GB
● /tmp: 0% 0B/2.0GB
Storage:
● root (Single):
├─ ● nvme0n1 W: 1%
└─ ● 18% 167.4GB/928.2GB
### Custom Service Logs
- Configure service-specific log file paths per host in dashboard config
- Press `L` on any service to view custom log files via `tail -f`
- Configuration format in dashboard config:
```toml
[service_logs]
hostname1 = [
{ service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
{ service_name = "app", log_file_path = "/var/log/myapp/app.log" }
]
hostname2 = [
{ service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
]
```
**System panel layout fully implemented with blue tree symbols ✅**
**Tree symbols now use consistent blue theming across all panels ✅**
**Overflow handling restored for all widgets ("... and X more") ✅**
**Agent version display working correctly ✅**
**Cross-host version comparison logging warnings ✅**
**Backup panel visibility fixed - only shows when meaningful data exists ✅**
**SSH-based rebuild system fully implemented and working ✅**
### Service Management
- **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services
- **Service Actions**:
- `s` - Start service (sends UserStart command)
- `S` - Stop service (sends UserStop command)
- `J` - Show service logs (journalctl in tmux popup)
- `L` - Show custom log files (tail -f custom paths in tmux popup)
- `R` - Rebuild current host
- **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
- **Transitional Icons**: Blue arrows during operations
### Current Simplified Navigation Implementation
**Navigation Controls:**
- **Tab**: Switch between hosts (cmbox, srv01, srv02, steambox, etc.)
- **↑↓ or j/k**: Move service selection cursor (always works)
### Navigation
- **Tab**: Switch between hosts
- **↑↓ or j/k**: Select services
- **J**: Show service logs (journalctl)
- **L**: Show custom log files
- **q**: Quit dashboard
**Service Control:**
- **s**: Start selected service
- **S**: Stop selected service
- **R**: Rebuild current host (works from any context)
**Visual Features:**
- **Service Selection**: Always visible blue background highlighting current service
- **Status Icons**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed), ? (unknown)
- **Transitional Icons**: Blue ↑ (starting), ↓ (stopping), ↻ (restarting) when not selected
- **Transitional Icons**: Dark gray arrows when service is selected (for visibility)
- **Uniform Interface**: All panels have consistent appearance (no focus borders)
### Service Discovery and Display - WORKING ✅
**All Issues Resolved (as of 2025-10-28):**
-**Complete Service Discovery**: Uses `systemctl list-unit-files` + `list-units --all` for comprehensive service detection
-**All Services Visible**: Shows all configured services regardless of current state (active/inactive)
-**Proper Status Display**: Active services show green ●, inactive show yellow ◐, failed show red ◯
-**Transitional Icons**: Visual feedback during service operations with proper color handling
-**Simplified Navigation**: Removed panel complexity, direct service control always available
-**Service Control**: Start (s) and Stop (S) commands work from anywhere
-**System Rebuild**: SSH + tmux popup approach for reliable remote rebuilds
### Terminal Popup for Real-time Output - IMPLEMENTED ✅
**Status (as of 2025-10-26):**
-**Terminal Popup UI**: 80% screen coverage with terminal styling and color-coded output
-**ZMQ Streaming Protocol**: CommandOutputMessage for real-time output transmission
-**Keyboard Controls**: ESC/Q to close, ↑↓ to scroll, manual close (no auto-close)
-**Real-time Display**: Live streaming of command output as it happens
-**Version-based Agent Reporting**: Shows "Agent: v0.1.13" instead of nix store hash
**Current Implementation Issues:**
-**Agent Process Crashes**: Agent dies during nixos-rebuild execution
-**Inconsistent Output**: Different outputs each time 'R' is pressed
-**Limited Output Visibility**: Not capturing all nixos-rebuild progress
**PLANNED SOLUTION - Systemd Service Approach:**
**Problem**: Direct nixos-rebuild execution in agent causes process crashes and inconsistent output.
**Solution**: Create dedicated systemd service for rebuild operations.
**Implementation Plan:**
1. **NixOS Systemd Service**:
```nix
systemd.services.cm-rebuild = {
description = "CM Dashboard NixOS Rebuild";
serviceConfig = {
Type = "oneshot";
ExecStart = "${pkgs.nixos-rebuild}/bin/nixos-rebuild switch --flake . --option sandbox false";
WorkingDirectory = "/var/lib/cm-dashboard/nixos-config";
User = "root";
StandardOutput = "journal";
StandardError = "journal";
};
};
```
2. **Agent Modification**:
- Replace direct nixos-rebuild execution with: `systemctl start cm-rebuild`
- Stream output via: `journalctl -u cm-rebuild -f --no-pager`
- Monitor service status for completion detection
3. **Benefits**:
- **Process Isolation**: Service runs independently, won't crash agent
- **Consistent Output**: Always same deterministic rebuild process
- **Proper Logging**: systemd journal handles all output management
- **Resource Management**: systemd manages cleanup and resource limits
- **Status Tracking**: Can query service status (running/failed/success)
**Next Priority**: Implement systemd service approach for reliable rebuild operations.
**Keyboard Controls Status:**
- **Services Panel**:
- R (restart) ✅ Working
- s (start) ✅ Working
- S (stop) ✅ Working
- **System Panel**: R (nixos-rebuild) ✅ Working with --option sandbox false
- **Backup Panel**: B (trigger backup) ❓ Not implemented
**Visual Feedback Implementation - IN PROGRESS:**
Context-appropriate progress indicators for each panel:
**Services Panel** (Service status transitions):
```
● nginx active → ⏳ nginx restarting → ● nginx active
● docker active → ⏳ docker stopping → ● docker inactive
```
**System Panel** (Build progress in NixOS section):
```
NixOS:
Build: 25.05.20251004.3bcc93c → Build: [████████████ ] 65%
Active users: cm, simon Active users: cm, simon
```
**Backup Panel** (OnGoing status with progress):
```
Latest backup: → Latest backup:
● 2024-10-23 14:32:15 ● OnGoing
└─ Duration: 1.3m └─ [██████ ] 60%
```
**Critical Configuration Hash Fix - HIGH PRIORITY:**
**Problem:** Configuration hash currently shows git commit hash instead of actual deployed system hash.
**Current (incorrect):**
- Shows git hash: `db11f82` (source repository commit)
- Not accurate - doesn't reflect what's actually deployed
**Target (correct):**
- Show nix store hash: `d8ivwiar` (first 8 chars from deployed system)
- Source: `/nix/store/d8ivwiarhwhgqzskj6q2482r58z46qjf-nixos-system-cmbox-25.05.20251004.3bcc93c`
- Pattern: Extract hash from `/nix/store/HASH-nixos-system-HOSTNAME-VERSION`
**Benefits:**
1. **Deployment Verification:** Confirms rebuild actually succeeded
2. **Accurate Status:** Shows what's truly running, not just source
3. **Rebuild Completion Detection:** Hash change = rebuild completed
4. **Rollback Tracking:** Each deployment has unique identifier
**Implementation Required:**
1. Agent extracts nix store hash from `ls -la /run/current-system`
2. Reports this as `system_config_hash` metric instead of git hash
3. Dashboard displays first 8 characters: `Config: d8ivwiar`
**Next Session Priority Tasks:**
**Remaining Features:**
1. **Fix Configuration Hash Display (CRITICAL)**:
- Use nix store hash instead of git commit hash
- Extract from `/run/current-system` -> `/nix/store/HASH-nixos-system-*`
- Enables proper rebuild completion detection
2. **Command Response Protocol**:
- Agent sends command completion/failure back to dashboard via ZMQ
- Dashboard updates UI status from ⏳ to ● when commands complete
- Clear success/failure status after timeout
3. **Backup Panel Features**:
- Implement backup trigger functionality (B key)
- Complete visual feedback for backup operations
- Add backup progress indicators
**Enhancement Tasks:**
- Add confirmation dialogs for destructive actions (stop/restart/rebuild)
- Implement command history/logging
- Add keyboard shortcuts help overlay
**Future Enhanced Navigation:**
- Add Page Up/Down for faster scrolling through long service lists
- Implement search/filter functionality for services
- Add jump-to-service shortcuts (first letter navigation)
**Future Advanced Features:**
- Service dependency visualization
- Historical service status tracking
- Real-time log viewing integration
## Core Architecture Principles - CRITICAL
## Core Architecture Principles
### Individual Metrics Philosophy
**NEW ARCHITECTURE**: Agent collects individual metrics, dashboard composes widgets from those metrics.
- Agent collects individual metrics, dashboard composes widgets
- Each metric collected, transmitted, and stored individually
- Agent calculates status for each metric using thresholds
- Dashboard aggregates individual metric statuses for widget status
### Maintenance Mode
**Purpose:**
- Suppress email notifications during planned maintenance or backups
- Prevents false alerts when services are intentionally stopped
**Implementation:**
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
- File presence suppresses all email notifications while continuing monitoring
- Dashboard continues to show real status, only notifications are blocked
**Usage:**
Usage:
```bash
# Enable maintenance mode
touch /tmp/cm-maintenance
# Run maintenance tasks (backups, service restarts, etc.)
# Run maintenance tasks
systemctl stop service
# ... maintenance work ...
systemctl start service
@@ -281,61 +80,84 @@ systemctl start service
rm /tmp/cm-maintenance
```
**NixOS Integration:**
## Development and Deployment Architecture
- Borgbackup script automatically creates/removes maintenance file
- Automatic cleanup via trap ensures maintenance mode doesn't stick
- All cinfiguration are shall be done from nixos config
### Development Path
- **Location:** `~/projects/cm-dashboard`
- **Purpose:** Development workflow only - for committing new code
- **Access:** Only for developers to commit changes
**ARCHITECTURE ENFORCEMENT**:
### Deployment Path
- **Location:** `/var/lib/cm-dashboard/nixos-config`
- **Purpose:** Production deployment only - agent clones/pulls from git
- **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild
- **ZERO legacy code reuse** - Fresh implementation following ARCHITECT.md exactly
- **Individual metrics only** - NO grouped metric structures
- **Reference-only legacy** - Study old functionality, implement new architecture
- **Clean slate mindset** - Build as if legacy codebase never existed
### Git Flow
```
Development: ~/projects/cm-dashboard → git commit → git push
Deployment: git pull → /var/lib/cm-dashboard/nixos-config → rebuild
```
**Implementation Rules**:
## Automated Binary Release System
1. **Individual Metrics**: Each metric is collected, transmitted, and stored individually
2. **Agent Status Authority**: Agent calculates status for each metric using thresholds
3. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
4. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status
**Testing & Building**:
CM Dashboard uses automated binary releases instead of source builds.
- **Workspace builds**: `cargo build --workspace` for all testing
- **Clean compilation**: Remove `target/` between architecture changes
- **ZMQ testing**: Test agent-dashboard communication independently
- **Widget testing**: Verify UI layout matches legacy appearance exactly
### Creating New Releases
```bash
cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X
```
**NEVER in New Implementation**:
This automatically:
- Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
- Creates GitHub-style release with tarball
- Uploads binaries via Gitea API
- Copy/paste ANY code from legacy backup
- Calculate status in dashboard widgets
- Hardcode metric names in widgets (use const arrays)
### NixOS Configuration Updates
Edit `~/projects/nixosbox/hosts/common/cm-dashboard.nix`:
# Important Communication Guidelines
```nix
version = "v0.1.X";
src = pkgs.fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-NEW_HASH_HERE";
};
```
NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.
### Get Release Hash
```bash
cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
```
NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.
### Building
**Testing & Building:**
- **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"`
- **Clean compilation**: Remove `target/` between major changes
## Important Communication Guidelines
Keep responses concise and focused. Avoid extensive implementation summaries unless requested.
## Commit Message Guidelines
**NEVER mention:**
- Claude or any AI assistant names
- Automation or AI-generated content
- Any reference to automated code generation
**ALWAYS:**
- Focus purely on technical changes and their purpose
- Use standard software development commit message format
- Describe what was changed and why, not how it was created
- Write from the perspective of a human developer
**Examples:**
- ❌ "Generated with Claude Code"
- ❌ "AI-assisted implementation"
- ❌ "Automated refactoring"
@@ -343,83 +165,22 @@ NEVER implement code without first getting explicit user agreement on the approa
- ✅ "Restructure storage widget with improved layout"
- ✅ "Update CPU thresholds to production values"
## Development and Deployment Architecture
## Implementation Rules
**CRITICAL:** Development and deployment paths are completely separate:
1. **Individual Metrics**: Each metric is collected, transmitted, and stored individually
2. **Agent Status Authority**: Agent calculates status for each metric using thresholds
3. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
4. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status
### Development Path
- **Location:** `~/projects/nixosbox`
- **Purpose:** Development workflow only - for committing new cm-dashboard code
- **Access:** Only for developers to commit changes
- **Code Access:** Running cm-dashboard code shall NEVER access this path
**NEVER:**
- Copy/paste ANY code from legacy implementations
- Calculate status in dashboard widgets
- Hardcode metric names in widgets (use const arrays)
- Create files unless absolutely necessary for achieving goals
- Create documentation files unless explicitly requested
### Deployment Path
- **Location:** `/var/lib/cm-dashboard/nixos-config`
- **Purpose:** Production deployment only - agent clones/pulls from git
- **Access:** Only cm-dashboard agent for deployment operations
- **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild
### Git Flow
```
Development: ~/projects/nixosbox → git commit → git push
Deployment: git pull → /var/lib/cm-dashboard/nixos-config → rebuild
```
## Automated Binary Release System
**IMPLEMENTED:** cm-dashboard now uses automated binary releases instead of source builds.
### Release Workflow
1. **Automated Release Creation**
- Gitea Actions workflow builds static binaries on tag push
- Creates release with `cm-dashboard-linux-x86_64.tar.gz` tarball
- No manual intervention required for binary generation
2. **Creating New Releases**
```bash
cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X
```
This automatically:
- Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
- Creates GitHub-style release with tarball
- Uploads binaries via Gitea API
3. **NixOS Configuration Updates**
Edit `~/projects/nixosbox/hosts/common/cm-dashboard.nix`:
```nix
version = "v0.1.X";
src = pkgs.fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-NEW_HASH_HERE";
};
```
4. **Get Release Hash**
```bash
cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
```
5. **Commit and Deploy**
```bash
cd ~/projects/nixosbox
git add hosts/common/cm-dashboard.nix
git commit -m "Update cm-dashboard to v0.1.X with static binaries"
git push
```
### Benefits
- **No compilation overhead** on each host
- **Consistent static binaries** across all hosts
- **Faster deployments** - download vs compile
- **No library dependency issues** - static linking
- **Automated pipeline** - tag push triggers everything
**ALWAYS:**
- Prefer editing existing files to creating new ones
- Follow existing code conventions and patterns
- Use existing libraries and utilities
- Follow security best practices

13
Cargo.lock generated
View File

@@ -270,7 +270,7 @@ checksum = "a1d728cc89cf3aee9ff92b05e62b19ee65a02b5702cff7d5a377e32c6ae29d8d"
[[package]]
name = "cm-dashboard"
version = "0.1.43"
version = "0.1.63"
dependencies = [
"anyhow",
"chrono",
@@ -286,12 +286,13 @@ dependencies = [
"toml",
"tracing",
"tracing-subscriber",
"wake-on-lan",
"zmq",
]
[[package]]
name = "cm-dashboard-agent"
version = "0.1.43"
version = "0.1.63"
dependencies = [
"anyhow",
"async-trait",
@@ -314,7 +315,7 @@ dependencies = [
[[package]]
name = "cm-dashboard-shared"
version = "0.1.43"
version = "0.1.63"
dependencies = [
"chrono",
"serde",
@@ -2064,6 +2065,12 @@ version = "0.9.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a"
[[package]]
name = "wake-on-lan"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1ccf60b60ad7e5b1b37372c5134cbcab4db0706c231d212e0c643a077462bc8f"
[[package]]
name = "walkdir"
version = "2.5.0"

513
README.md
View File

@@ -1,88 +1,106 @@
# CM Dashboard
A real-time infrastructure monitoring system with intelligent status aggregation and email notifications, built with Rust and ZMQ.
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.
## Current Implementation
## Features
This is a complete rewrite implementing an **individual metrics architecture** where:
### Core Monitoring
- **Real-time metrics**: CPU, RAM, Storage, and Service status
- **Multi-host support**: Monitor multiple servers from single dashboard
- **Service management**: Start/stop services with intelligent status tracking
- **NixOS integration**: System rebuild via SSH + tmux popup
- **Backup monitoring**: Borgbackup status and scheduling
- **Email notifications**: Intelligent batching prevents spam
- **Agent** collects individual metrics (e.g., `cpu_load_1min`, `memory_usage_percent`) and calculates status
- **Dashboard** subscribes to specific metrics and composes widgets
- **Status Aggregation** provides intelligent email notifications with batching
- **Persistent Cache** prevents false notifications on restart
### User-Stopped Service Tracking
Services stopped via the dashboard are intelligently tracked to prevent false alerts:
## Dashboard Interface
- **Smart status reporting**: User-stopped services show as Status::OK instead of Warning
- **Persistent storage**: Tracking survives agent restarts via JSON storage
- **Automatic management**: Flags cleared when services restarted via dashboard
- **Maintenance friendly**: No false alerts during intentional service operations
## Architecture
### Individual Metrics Philosophy
- **Agent**: Collects individual metrics, calculates status using thresholds
- **Dashboard**: Subscribes to specific metrics, composes widgets from individual data
- **ZMQ Communication**: Efficient real-time metric transmission
- **Status Aggregation**: Host-level status calculated from all service metrics
### Components
```
┌─────────────────┐ ZMQ ┌─────────────────┐
│ │◄──────────►│ │
│ Agent │ Metrics │ Dashboard │
│ - Collectors │ │ - TUI │
│ - Status │ │ - Widgets │
│ - Tracking │ │ - Commands │
│ │ │ │
└─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ JSON Storage │ │ SSH + tmux │
│ - User-stopped │ │ - Remote rebuild│
│ - Cache │ │ - Process │
│ - State │ │ isolation │
└─────────────────┘ └─────────────────┘
```
### Service Control Flow
1. **User Action**: Dashboard sends `UserStart`/`UserStop` commands
2. **Agent Processing**:
- Marks service as user-stopped (if stopping)
- Executes `systemctl start/stop service`
- Syncs state to global tracker
3. **Status Calculation**:
- Systemd collector checks user-stopped flag
- Reports Status::OK for user-stopped inactive services
- Normal Warning status for system failures
## Interface
```
cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
┌system──────────────────────────────┐┌services─────────────────────────────────────────┐
CPU: ││Service: Status: RAM: Disk: │
● Load: 0.10 0.52 0.88 • 400.0 MHz ││● docker active 27M 496MB │
RAM: ││● docker-registry active 19M 496MB │
● Used: 30% 2.3GB/7.6GB ││● gitea active 579M 2.6GB
● tmp: 0.0% 0B/2.0GB ││● gitea-runner-default active 11M 2.6GB
Disk nvme0n1: ││● haasp-core active 9M 1MB
● Health: PASSED ││● haasp-mqtt active 3M 1MB
│● Usage @root: 8.3% • 75.4/906.2 GB ││● haasp-webgrid active 10M 1MB
│● Usage @boot: 5.9% • 0.1/1.0 GB ││● immich-server active 240M 45.1GB
││● mosquitto active 1M 1MB
││● mysql active 38M 225MB
││● nginx active 28M 24MB
││ ├─ ● gitea.cmtec.se 51ms
│ ││ ├─ ● haasp.cmtec.se 43ms │
│ ││ ├─ ● haasp.net 43ms │
│ ││ ├─ ● pages.cmtec.se 45ms │
└────────────────────────────────────┘│ ├─ ● photos.cmtec.se 41ms │
┌backup──────────────────────────────┐│ ├─ ● unifi.cmtec.se 46ms │
│Latest backup: ││ ├─ ● vault.cmtec.se 47ms │
│● Status: OK ││ ├─ ● www.kryddorten.se 81ms │
│Duration: 54s • Last: 4h ago ││ ├─ ● www.mariehall2.se 86ms │
│Disk usage: 48.2GB/915.8GB ││● postgresql active 112M 357MB │
│P/N: Samsung SSD 870 QVO 1TB ││● redis-immich active 8M 45.1GB │
│S/N: S5RRNF0W800639Y ││● sshd active 2M 0 │
│● gitea 2 archives 2.7GB ││● unifi active 594M 495MB │
│● immich 2 archives 45.0GB ││● vaultwarden active 12M 1MB │
│● kryddorten 2 archives 67.6MB ││ │
│● mariehall2 2 archives 321.8MB ││ │
│● nixosbox 2 archives 4.5MB ││ │
│● unifi 2 archives 2.9MB ││ │
│● vaultwarden 2 archives 305kB ││ │
NixOS: ││Service: Status: RAM: Disk: │
Build: 25.05.20251004.3bcc93c ││● docker active 27M 496MB │
Agent: v0.1.43 ││● gitea active 579M 2.6GB │
Active users: cm, simon ││● nginx active 28M 24MB
CPU: ││ ├─ ● gitea.cmtec.se 51ms
● Load: 0.10 0.52 0.88 • 3000MHz ││ ├─ ● photos.cmtec.se 41ms
RAM: ││● postgresql active 112M 357MB
│● Usage: 33% 2.6GB/7.6GB ││● redis-immich user-stopped
│● /tmp: 0% 0B/2.0GB ││● sshd active 2M 0
Storage: ││● unifi active 594M 495MB
● root (Single): ││
├─ ● nvme0n1 W: 1% ││
└─ ● 18% 167.4GB/928.2GB ││
└────────────────────────────────────┘└─────────────────────────────────────────────────┘
```
**Navigation**: `←→` switch hosts, `r` refresh, `q` quit
### Navigation
- **Tab**: Switch between hosts
- **↑↓ or j/k**: Navigate services
- **s**: Start selected service (UserStart)
- **S**: Stop selected service (UserStop)
- **J**: Show service logs (journalctl in tmux popup)
- **R**: Rebuild current host
- **q**: Quit
## Features
- **Real-time monitoring** - Dashboard updates every 1-2 seconds
- **Individual metric collection** - Granular data for flexible dashboard composition
- **Intelligent status aggregation** - Host-level status calculated from all services
- **Smart email notifications** - Batched, detailed alerts with service groupings
- **Persistent state** - Prevents false notifications on restarts
- **ZMQ communication** - Efficient agent-to-dashboard messaging
- **Clean TUI** - Terminal-based dashboard with color-coded status indicators
## Architecture
### Core Components
- **Agent** (`cm-dashboard-agent`) - Collects metrics and sends via ZMQ
- **Dashboard** (`cm-dashboard`) - Real-time TUI display consuming metrics
- **Shared** (`cm-dashboard-shared`) - Common types and protocol
- **Status Aggregation** - Intelligent batching and notification management
- **Persistent Cache** - Maintains state across restarts
### Status Levels
- **🟢 Ok** - Service running normally
- **🔵 Pending** - Service starting/stopping/reloading
- **🟡 Warning** - Service issues (high load, memory, disk usage)
- **🔴 Critical** - Service failed or critical thresholds exceeded
- **❓ Unknown** - Service state cannot be determined
### Status Indicators
- **Green ●**: Active service
- **Yellow ◐**: Inactive service (system issue)
- **Red ◯**: Failed service
- **Blue arrows**: Service transitioning (↑ starting, ↓ stopping, ↻ restarting)
- **"user-stopped"**: Service stopped via dashboard (Status::OK)
## Quick Start
### Build
### Building
```bash
# With Nix (recommended)
@@ -93,21 +111,20 @@ sudo apt install libssl-dev pkg-config # Ubuntu/Debian
cargo build --workspace
```
### Run
### Running
```bash
# Start agent (requires configuration file)
# Start agent (requires configuration)
./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml
# Start dashboard
./target/debug/cm-dashboard --config /path/to/dashboard.toml
# Start dashboard (inside tmux session)
tmux
./target/debug/cm-dashboard --config /etc/cm-dashboard/dashboard.toml
```
## Configuration
### Agent Configuration (`agent.toml`)
The agent requires a comprehensive TOML configuration file:
### Agent Configuration
```toml
collection_interval_seconds = 2
@@ -116,50 +133,27 @@ collection_interval_seconds = 2
publisher_port = 6130
command_port = 6131
bind_address = "0.0.0.0"
timeout_ms = 5000
heartbeat_interval_ms = 30000
transmission_interval_seconds = 2
[collectors.cpu]
enabled = true
interval_seconds = 2
load_warning_threshold = 9.0
load_warning_threshold = 5.0
load_critical_threshold = 10.0
temperature_warning_threshold = 100.0
temperature_critical_threshold = 110.0
[collectors.memory]
enabled = true
interval_seconds = 2
usage_warning_percent = 80.0
usage_critical_percent = 95.0
[collectors.disk]
enabled = true
interval_seconds = 300
usage_warning_percent = 80.0
usage_critical_percent = 90.0
[[collectors.disk.filesystems]]
name = "root"
uuid = "4cade5ce-85a5-4a03-83c8-dfd1d3888d79"
mount_point = "/"
fs_type = "ext4"
monitor = true
[collectors.systemd]
enabled = true
interval_seconds = 10
memory_warning_mb = 1000.0
memory_critical_mb = 2000.0
service_name_filters = [
"nginx*", "postgresql*", "redis*", "docker*", "sshd*",
"gitea*", "immich*", "haasp*", "mosquitto*", "mysql*",
"unifi*", "vaultwarden*"
]
excluded_services = [
"nginx-config-reload", "sshd-keygen", "systemd-",
"getty@", "user@", "dbus-", "NetworkManager-"
]
service_name_filters = ["nginx*", "postgresql*", "docker*", "sshd*"]
excluded_services = ["nginx-config-reload", "systemd-", "getty@"]
nginx_latency_critical_ms = 1000.0
http_timeout_seconds = 10
[notifications]
enabled = true
@@ -167,251 +161,202 @@ smtp_host = "localhost"
smtp_port = 25
from_email = "{hostname}@example.com"
to_email = "admin@example.com"
rate_limit_minutes = 0
trigger_on_warnings = true
trigger_on_failures = true
recovery_requires_all_ok = true
suppress_individual_recoveries = true
[status_aggregation]
enabled = true
aggregation_method = "worst_case"
notification_interval_seconds = 30
[cache]
persist_path = "/var/lib/cm-dashboard/cache.json"
aggregation_interval_seconds = 30
```
### Dashboard Configuration (`dashboard.toml`)
### Dashboard Configuration
```toml
[zmq]
hosts = [
{ name = "server1", address = "192.168.1.100", port = 6130 },
{ name = "server2", address = "192.168.1.101", port = 6130 }
]
connection_timeout_ms = 5000
reconnect_interval_ms = 10000
subscriber_ports = [6130]
[hosts]
predefined_hosts = ["cmbox", "srv01", "srv02"]
[ui]
refresh_interval_ms = 1000
theme = "dark"
ssh_user = "cm"
rebuild_alias = "nixos-rebuild-cmtec"
```
## Collectors
## Technical Implementation
The agent implements several specialized collectors:
### Collectors
### CPU Collector (`cpu.rs`)
#### Systemd Collector
- **Service Discovery**: Uses `systemctl list-unit-files` + `list-units --all`
- **Status Calculation**: Checks user-stopped flag before assigning Warning status
- **Memory Tracking**: Per-service memory usage via `systemctl show`
- **Sub-services**: Nginx site latency, Docker containers
- **User-stopped Integration**: `UserStoppedServiceTracker::is_service_user_stopped()`
- Load average (1, 5, 15 minute)
- CPU temperature monitoring
- Real-time process monitoring (top CPU consumers)
- Status calculation with configurable thresholds
#### User-Stopped Service Tracker
- **Storage**: `/var/lib/cm-dashboard/user-stopped-services.json`
- **Thread Safety**: Global singleton with `Arc<Mutex<>>`
- **Persistence**: Automatic save on state changes
- **Global Access**: Static methods for collector integration
### Memory Collector (`memory.rs`)
#### Other Collectors
- **CPU**: Load average, temperature, frequency monitoring
- **Memory**: RAM/swap usage, tmpfs monitoring
- **Disk**: Filesystem usage, SMART health data
- **NixOS**: Build version, active users, agent version
- **Backup**: Borgbackup repository status and metrics
- RAM usage (total, used, available)
- Swap monitoring
- Real-time process monitoring (top RAM consumers)
- Memory pressure detection
### ZMQ Protocol
### Disk Collector (`disk.rs`)
```rust
// Metric Message
#[derive(Serialize, Deserialize)]
pub struct MetricMessage {
pub hostname: String,
pub timestamp: u64,
pub metrics: Vec<Metric>,
}
- Filesystem usage per mount point
- SMART health monitoring
- Temperature and wear tracking
- Configurable filesystem monitoring
// Service Commands
pub enum AgentCommand {
ServiceControl {
service_name: String,
action: ServiceAction,
},
SystemRebuild { /* SSH config */ },
CollectNow,
}
### Systemd Collector (`systemd.rs`)
pub enum ServiceAction {
Start, // System-initiated
Stop, // System-initiated
UserStart, // User via dashboard (clears user-stopped)
UserStop, // User via dashboard (marks user-stopped)
Status,
}
```
- Service status monitoring (`active`, `inactive`, `failed`)
- Memory usage per service
- Service filtering and exclusions
- Handles transitional states (`Status::Pending`)
### Maintenance Mode
### Backup Collector (`backup.rs`)
Suppress notifications during planned maintenance:
- Reads TOML status files from backup systems
- Archive age verification
- Disk usage tracking
- Repository health monitoring
```bash
# Enable maintenance mode
touch /tmp/cm-maintenance
# Perform maintenance
systemctl stop service
# ... work ...
systemctl start service
# Disable maintenance mode
rm /tmp/cm-maintenance
```
## Email Notifications
### Intelligent Batching
- **Real-time dashboard**: Immediate status updates
- **Batched emails**: Aggregated every 30 seconds
- **Smart grouping**: Services organized by severity
- **Recovery suppression**: Reduces notification spam
The system implements smart notification batching to prevent email spam:
- **Real-time dashboard updates** - Status changes appear immediately
- **Batched email notifications** - Aggregated every 30 seconds
- **Detailed groupings** - Services organized by severity
### Example Alert Email
### Example Alert
```
Subject: Status Alert: 2 critical, 1 warning, 15 started
Subject: Status Alert: 1 critical, 2 warnings, 0 recoveries
Status Summary (30s duration)
Host Status: Ok → Warning
🔴 CRITICAL ISSUES (2):
postgresql: Ok → Critical
nginx: Warning → Critical
🔴 CRITICAL ISSUES (1):
postgresql: Ok → Critical (memory usage 95%)
🟡 WARNINGS (1):
redis: Ok → Warning (memory usage 85%)
🟡 WARNINGS (2):
nginx: Ok → Warning (high load 8.5)
redis: user-stopped → Warning (restarted by system)
✅ RECOVERIES (0):
🟢 SERVICE STARTUPS (15):
docker: Unknown → Ok
sshd: Unknown → Ok
...
--
CM Dashboard Agent
Generated at 2025-10-21 19:42:42 CET
CM Dashboard Agent v0.1.43
```
## Individual Metrics Architecture
The system follows a **metrics-first architecture**:
### Agent Side
```rust
// Agent collects individual metrics
vec![
Metric::new("cpu_load_1min".to_string(), MetricValue::Float(2.5), Status::Ok),
Metric::new("memory_usage_percent".to_string(), MetricValue::Float(78.5), Status::Warning),
Metric::new("service_nginx_status".to_string(), MetricValue::String("active".to_string()), Status::Ok),
]
```
### Dashboard Side
```rust
// Widgets subscribe to specific metrics
impl Widget for CpuWidget {
fn update_from_metrics(&mut self, metrics: &[&Metric]) {
for metric in metrics {
match metric.name.as_str() {
"cpu_load_1min" => self.load_1min = metric.value.as_f32(),
"cpu_load_5min" => self.load_5min = metric.value.as_f32(),
"cpu_temperature_celsius" => self.temperature = metric.value.as_f32(),
_ => {}
}
}
}
}
```
## Persistent Cache
The cache system prevents false notifications:
- **Automatic saving** - Saves when service status changes
- **Persistent storage** - Maintains state across agent restarts
- **Simple design** - No complex TTL or cleanup logic
- **Status preservation** - Prevents duplicate notifications
## Development
### Project Structure
```
cm-dashboard/
├── agent/ # Metrics collection agent
├── agent/ # Metrics collection agent
│ ├── src/
│ │ ├── collectors/ # CPU, memory, disk, systemd, backup
│ │ ├── status/ # Status aggregation and notifications
│ │ ├── cache/ # Persistent metric caching
│ │ ├── config/ # TOML configuration loading
│ │ └── notifications/ # Email notification system
├── dashboard/ # TUI dashboard application
│ │ ├── collectors/ # CPU, memory, disk, systemd, backup, nixos
│ │ ├── service_tracker.rs # User-stopped service tracking
│ │ ├── status/ # Status aggregation and notifications
│ │ ├── config/ # TOML configuration loading
│ │ └── communication/ # ZMQ message handling
├── dashboard/ # TUI dashboard application
│ ├── src/
│ │ ├── ui/widgets/ # CPU, memory, services, backup widgets
│ │ ├── metrics/ # Metric storage and filtering
│ │ └── communication/ # ZMQ metric consumption
├── shared/ # Shared types and utilities
│ │ ├── ui/widgets/ # CPU, memory, services, backup, system
│ │ ├── communication/ # ZMQ consumption and commands
│ │ └── app.rs # Main application loop
├── shared/ # Shared types and utilities
│ └── src/
│ ├── metrics.rs # Metric, Status, and Value types
│ ├── protocol.rs # ZMQ message format
│ └── cache.rs # Cache configuration
└── README.md # This file
│ ├── metrics.rs # Metric, Status, StatusTracker types
│ ├── protocol.rs # ZMQ message format
│ └── cache.rs # Cache configuration
└── CLAUDE.md # Development guidelines and rules
```
### Building
### Testing
```bash
# Debug build
cargo build --workspace
# Build and test
nix-shell -p openssl pkg-config --run "cargo build --workspace"
nix-shell -p openssl pkg-config --run "cargo test --workspace"
# Release build
cargo build --workspace --release
# Run tests
cargo test --workspace
# Check code formatting
cargo fmt --all -- --check
# Run clippy linter
# Code quality
cargo fmt --all
cargo clippy --workspace -- -D warnings
```
### Dependencies
## Deployment
- **tokio** - Async runtime
- **zmq** - Message passing between agent and dashboard
- **ratatui** - Terminal user interface
- **serde** - Serialization for metrics and config
- **anyhow/thiserror** - Error handling
- **tracing** - Structured logging
- **lettre** - SMTP email notifications
- **clap** - Command-line argument parsing
- **toml** - Configuration file parsing
### Automated Binary Releases
```bash
# Create new release
cd ~/projects/cm-dashboard
git tag v0.1.X
git push origin v0.1.X
```
## NixOS Integration
This triggers automated:
- Static binary compilation with `RUSTFLAGS="-C target-feature=+crt-static"`
- GitHub-style release creation
- Tarball upload to Gitea
This project is designed for declarative deployment via NixOS:
### Configuration Generation
The NixOS module automatically generates the agent configuration:
### NixOS Integration
Update `~/projects/nixosbox/hosts/common/cm-dashboard.nix`:
```nix
# hosts/common/cm-dashboard.nix
services.cm-dashboard-agent = {
enable = true;
port = 6130;
version = "v0.1.43";
src = pkgs.fetchurl {
url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
sha256 = "sha256-HASH";
};
```
### Deployment
Get hash via:
```bash
# Update NixOS configuration
git add hosts/common/cm-dashboard.nix
git commit -m "Update cm-dashboard configuration"
git push
# Rebuild system (user-performed)
sudo nixos-rebuild switch --flake .
cd ~/projects/nixosbox
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
url = "URL_HERE";
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
}' 2>&1 | grep "got:"
```
## Monitoring Intervals
- **CPU/Memory**: 2 seconds (real-time monitoring)
- **Disk usage**: 300 seconds (5 minutes)
- **Systemd services**: 10 seconds
- **SMART health**: 600 seconds (10 minutes)
- **Backup status**: 60 seconds (1 minute)
- **Email notifications**: 30 seconds (batched)
- **Dashboard updates**: 1 second (real-time display)
- **Metrics Collection**: 2 seconds (CPU, memory, services)
- **Metric Transmission**: 2 seconds (ZMQ publish)
- **Dashboard Updates**: 1 second (UI refresh)
- **Email Notifications**: 30 seconds (batched)
- **Disk Monitoring**: 300 seconds (5 minutes)
- **Service Discovery**: 300 seconds (5 minutes cache)
## License
MIT License - see LICENSE file for details
MIT License - see LICENSE file for details.

63
TODO.md
View File

@@ -1,63 +0,0 @@
# TODO
## Systemd filtering (agent)
- remove user systemd collection
- reduce number of systemctl call
- Cahnge so only services in include list are detected
- Filter on exact name
- Add support for "\*" in filtering
## System panel (agent/dashboard)
use following layout:
'''
NixOS:
Build: xxxxxx
Agen: xxxxxx
CPU:
● Load: 0.02 0.31 0.86
└─ Freq: 3000MHz
RAM:
● Usage: 33% 2.6GB/7.6GB
└─ ● /tmp: 0% 0B/2.0GB
Storage:
● /:
├─ ● nvme0n1 T: 40C • W: 4%
└─ ● 8% 75.0GB/906.2GB
'''
- Add support to show login/active users
- Add support to show timestamp/version for latest nixos rebuild
## Backup panel (dashboard)
use following layout:
'''
Latest backup:
● <timestamp>
└─ Duration: 1.3m
Disk:
● Samsung SSD 870 QVO 1TB
├─ S/N: S5RRNF0W800639Y
└─ Usage: 50.5GB/915.8GB
Repos:
● gitea (4) 5.1GB
● immich (4) 45.0GB
● kryddorten (4) 67.8MB
● mariehall2 (4) 322.7MB
● nixosbox (4) 5.5MB
● unifi (4) 5.7MB
● vaultwarden (4) 508kB
'''
## Keyboard navigation and scrolling (dashboard)
- Add keyboard navigation between panels "Shift-Tab"
- Add lower statusbar with dynamic updated shortcuts when switchng between panels
## Remote execution (agent/dashboard)
- Add support for send command via dashboard to agent to do nixos rebuid
- Add support for navigating services in dashboard and trigger start/stop/restart
- Add support for trigger backup

View File

@@ -1,6 +1,6 @@
[package]
name = "cm-dashboard-agent"
version = "0.1.43"
version = "0.1.64"
edition = "2021"
[dependencies]

View File

@@ -78,10 +78,11 @@ impl Agent {
info!("Initial metric collection completed - all data cached and ready");
}
// Separate intervals for collection, transmission, and email notifications
// Separate intervals for collection, transmission, heartbeat, and email notifications
let mut collection_interval =
interval(Duration::from_secs(self.config.collection_interval_seconds));
let mut transmission_interval = interval(Duration::from_secs(self.config.zmq.transmission_interval_seconds));
let mut heartbeat_interval = interval(Duration::from_secs(self.config.zmq.heartbeat_interval_seconds));
let mut notification_interval = interval(Duration::from_secs(self.config.notifications.aggregation_interval_seconds));
loop {
@@ -98,6 +99,12 @@ impl Agent {
error!("Failed to broadcast metrics: {}", e);
}
}
_ = heartbeat_interval.tick() => {
// Send standalone heartbeat for host connectivity detection
if let Err(e) = self.send_heartbeat().await {
error!("Failed to send heartbeat: {}", e);
}
}
_ = notification_interval.tick() => {
// Process batched email notifications (separate from dashboard updates)
if let Err(e) = self.host_status_manager.process_pending_notifications(&mut self.notification_manager).await {
@@ -180,6 +187,13 @@ impl Agent {
let version_metric = self.get_agent_version_metric();
metrics.push(version_metric);
// Add heartbeat metric for host connectivity detection
let heartbeat_metric = self.get_heartbeat_metric();
metrics.push(heartbeat_metric);
// Check for user-stopped services that are now active and clear their flags
self.clear_user_stopped_flags_for_active_services(&metrics);
if metrics.is_empty() {
debug!("No metrics to broadcast");
return Ok(());
@@ -198,6 +212,12 @@ impl Agent {
async fn process_metrics(&mut self, metrics: &[Metric]) -> bool {
let mut status_changed = false;
for metric in metrics {
// Filter excluded metrics from email notification processing only
if self.config.notifications.exclude_email_metrics.contains(&metric.name) {
debug!("Excluding metric '{}' from email notification processing", metric.name);
continue;
}
if self.host_status_manager.process_metric(metric, &mut self.notification_manager).await {
status_changed = true;
}
@@ -223,6 +243,35 @@ impl Agent {
format!("v{}", env!("CARGO_PKG_VERSION"))
}
/// Create heartbeat metric for host connectivity detection
fn get_heartbeat_metric(&self) -> Metric {
use std::time::{SystemTime, UNIX_EPOCH};
let timestamp = SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap()
.as_secs();
Metric::new(
"agent_heartbeat".to_string(),
MetricValue::Integer(timestamp as i64),
Status::Ok,
)
}
/// Send standalone heartbeat for connectivity detection
async fn send_heartbeat(&mut self) -> Result<()> {
let heartbeat_metric = self.get_heartbeat_metric();
let message = MetricMessage::new(
self.hostname.clone(),
vec![heartbeat_metric],
);
self.zmq_handler.publish_metrics(&message).await?;
debug!("Sent standalone heartbeat for connectivity detection");
Ok(())
}
async fn handle_commands(&mut self) -> Result<()> {
// Try to receive commands (non-blocking)
match self.zmq_handler.try_receive_command() {
@@ -288,7 +337,7 @@ impl Agent {
info!("Executing systemctl {} {} (user action: {})", action_str, service_name, is_user_action);
// Handle user-stopped service tracking before systemctl execution
// Handle user-stopped service tracking before systemctl execution (stop only)
match action {
ServiceAction::UserStop => {
info!("Marking service '{}' as user-stopped", service_name);
@@ -299,22 +348,13 @@ impl Agent {
UserStoppedServiceTracker::update_global(&self.service_tracker);
}
}
ServiceAction::UserStart => {
info!("Clearing user-stopped flag for service '{}'", service_name);
if let Err(e) = self.service_tracker.clear_user_stopped(service_name) {
error!("Failed to clear user-stopped flag: {}", e);
} else {
// Sync to global tracker
UserStoppedServiceTracker::update_global(&self.service_tracker);
}
}
_ => {}
}
let output = tokio::process::Command::new("sudo")
.arg("systemctl")
.arg(action_str)
.arg(service_name)
.arg(format!("{}.service", service_name))
.output()
.await?;
@@ -323,6 +363,9 @@ impl Agent {
if !output.stdout.is_empty() {
debug!("stdout: {}", String::from_utf8_lossy(&output.stdout));
}
// Note: User-stopped flag will be cleared by systemd collector
// when service actually reaches 'active' state, not here
} else {
let stderr = String::from_utf8_lossy(&output.stderr);
error!("Service {} {} failed: {}", service_name, action_str, stderr);
@@ -342,4 +385,33 @@ impl Agent {
Ok(())
}
/// Check metrics for user-stopped services that are now active and clear their flags
fn clear_user_stopped_flags_for_active_services(&mut self, metrics: &[Metric]) {
for metric in metrics {
// Look for service status metrics that are active
if metric.name.starts_with("service_") && metric.name.ends_with("_status") {
if let MetricValue::String(status) = &metric.value {
if status == "active" {
// Extract service name from metric name (service_nginx_status -> nginx)
let service_name = metric.name
.strip_prefix("service_")
.and_then(|s| s.strip_suffix("_status"))
.unwrap_or("");
if !service_name.is_empty() && UserStoppedServiceTracker::is_service_user_stopped(service_name) {
info!("Service '{}' is now active - clearing user-stopped flag", service_name);
if let Err(e) = self.service_tracker.clear_user_stopped(service_name) {
error!("Failed to clear user-stopped flag for '{}': {}", service_name, e);
} else {
// Sync to global tracker
UserStoppedServiceTracker::update_global(&self.service_tracker);
debug!("Cleared user-stopped flag for service '{}'", service_name);
}
}
}
}
}
}
}
}

View File

@@ -140,6 +140,7 @@ impl Collector for BackupCollector {
Status::Warning => "warning".to_string(),
Status::Critical => "critical".to_string(),
Status::Unknown => "unknown".to_string(),
Status::Offline => "offline".to_string(),
}),
status: overall_status,
timestamp,
@@ -202,6 +203,7 @@ impl Collector for BackupCollector {
Status::Warning => "warning".to_string(),
Status::Critical => "critical".to_string(),
Status::Unknown => "unknown".to_string(),
Status::Offline => "offline".to_string(),
}),
status: service_status,
timestamp,

View File

@@ -357,7 +357,15 @@ impl SystemdCollector {
/// Calculate service status, taking user-stopped services into account
fn calculate_service_status(&self, service_name: &str, active_status: &str) -> Status {
match active_status.to_lowercase().as_str() {
"active" => Status::Ok,
"active" => {
// If service is now active and was marked as user-stopped, clear the flag
if UserStoppedServiceTracker::is_service_user_stopped(service_name) {
debug!("Service '{}' is now active - clearing user-stopped flag", service_name);
// Note: We can't directly clear here because this is a read-only context
// The agent will need to handle this differently
}
Status::Ok
},
"inactive" | "dead" => {
// Check if this service was stopped by user action
if UserStoppedServiceTracker::is_service_user_stopped(service_name) {
@@ -368,7 +376,15 @@ impl SystemdCollector {
}
},
"failed" | "error" => Status::Critical,
"activating" | "deactivating" | "reloading" | "start" | "stop" | "restart" => Status::Pending,
"activating" | "deactivating" | "reloading" | "start" | "stop" | "restart" => {
// For user-stopped services that are transitioning, keep them as OK during transition
if UserStoppedServiceTracker::is_service_user_stopped(service_name) {
debug!("Service '{}' is transitioning but was user-stopped - treating as OK", service_name);
Status::Ok
} else {
Status::Pending
}
},
_ => Status::Unknown,
}
}

View File

@@ -66,8 +66,6 @@ impl ZmqHandler {
}
/// Send heartbeat (placeholder for future use)
/// Try to receive a command (non-blocking)
pub fn try_receive_command(&self) -> Result<Option<AgentCommand>> {
match self.command_receiver.recv_bytes(zmq::DONTWAIT) {

View File

@@ -25,9 +25,10 @@ pub struct ZmqConfig {
pub publisher_port: u16,
pub command_port: u16,
pub bind_address: String,
pub timeout_ms: u64,
pub heartbeat_interval_ms: u64,
pub transmission_interval_seconds: u64,
/// Heartbeat transmission interval in seconds for host connectivity detection
#[serde(default = "default_heartbeat_interval_seconds")]
pub heartbeat_interval_seconds: u64,
}
/// Collector configuration
@@ -146,9 +147,16 @@ pub struct NotificationConfig {
pub rate_limit_minutes: u64,
/// Email notification batching interval in seconds (default: 60)
pub aggregation_interval_seconds: u64,
/// List of metric names to exclude from email notifications
#[serde(default)]
pub exclude_email_metrics: Vec<String>,
}
fn default_heartbeat_interval_seconds() -> u64 {
5
}
impl AgentConfig {
pub fn from_file<P: AsRef<Path>>(path: P) -> Result<Self> {
loader::load_config(path)

View File

@@ -19,10 +19,6 @@ pub fn validate_config(config: &AgentConfig) -> Result<()> {
bail!("ZMQ bind address cannot be empty");
}
if config.zmq.timeout_ms == 0 {
bail!("ZMQ timeout cannot be 0");
}
// Validate collection interval
if config.collection_interval_seconds == 0 {
bail!("Collection interval cannot be 0");

View File

@@ -272,11 +272,13 @@ impl HostStatusManager {
/// Check if a status change is significant enough for notification
fn is_significant_change(&self, old_status: Status, new_status: Status) -> bool {
match (old_status, new_status) {
// Always notify on problems
// Don't notify on transitions from Unknown (startup/restart scenario)
(Status::Unknown, _) => false,
// Always notify on problems (but not from Unknown)
(_, Status::Warning) | (_, Status::Critical) => true,
// Only notify on recovery if it's from a problem state to OK and all services are OK
(Status::Warning | Status::Critical, Status::Ok) => self.current_host_status == Status::Ok,
// Don't notify on startup or other transitions
// Don't notify on other transitions
_ => false,
}
}
@@ -374,8 +376,8 @@ impl HostStatusManager {
details.push('\n');
}
// Show recoveries
if !recovery_changes.is_empty() {
// Show recoveries only if host status is now OK (all services recovered)
if !recovery_changes.is_empty() && aggregated.host_status_final == Status::Ok {
details.push_str(&format!("✅ RECOVERIES ({}):\n", recovery_changes.len()));
for change in recovery_changes {
details.push_str(&format!(" {}\n", change));

View File

@@ -1,6 +1,6 @@
[package]
name = "cm-dashboard"
version = "0.1.43"
version = "0.1.64"
edition = "2021"
[dependencies]
@@ -18,4 +18,5 @@ tracing-subscriber = { workspace = true }
ratatui = { workspace = true }
crossterm = { workspace = true }
toml = { workspace = true }
gethostname = { workspace = true }
gethostname = { workspace = true }
wake-on-lan = "0.2"

View File

@@ -22,7 +22,7 @@ pub struct Dashboard {
terminal: Option<Terminal<CrosstermBackend<io::Stdout>>>,
headless: bool,
initial_commands_sent: std::collections::HashSet<String>,
_config: DashboardConfig,
config: DashboardConfig,
}
impl Dashboard {
@@ -67,8 +67,8 @@ impl Dashboard {
}
};
// Connect to predefined hosts from configuration
let hosts = config.hosts.predefined_hosts.clone();
// Connect to configured hosts from configuration
let hosts: Vec<String> = config.hosts.keys().cloned().collect();
// Try to connect to hosts but don't fail if none are available
match zmq_consumer.connect_to_predefined_hosts(&hosts).await {
@@ -133,7 +133,7 @@ impl Dashboard {
terminal,
headless,
initial_commands_sent: std::collections::HashSet::new(),
_config: config,
config,
})
}
@@ -149,6 +149,8 @@ impl Dashboard {
let mut last_metrics_check = Instant::now();
let metrics_check_interval = Duration::from_millis(100); // Check for metrics every 100ms
let mut last_heartbeat_check = Instant::now();
let heartbeat_check_interval = Duration::from_secs(1); // Check for host connectivity every 1 second
loop {
// Handle terminal events (keyboard input) only if not headless
@@ -191,6 +193,17 @@ impl Dashboard {
break;
}
}
// Render UI immediately after handling input for responsive feedback
if let Some(ref mut terminal) = self.terminal {
if let Some(ref mut tui_app) = self.tui_app {
if let Err(e) = terminal.draw(|frame| {
tui_app.render(frame, &self.metric_store);
}) {
error!("Error rendering TUI after input: {}", e);
}
}
}
}
// Check for new metrics
@@ -243,14 +256,8 @@ impl Dashboard {
}
}
// Update TUI with new hosts and metrics (only if not headless)
// Update TUI with new metrics (only if not headless)
if let Some(ref mut tui_app) = self.tui_app {
let connected_hosts = self
.metric_store
.get_connected_hosts(Duration::from_secs(30));
tui_app.update_hosts(connected_hosts);
tui_app.update_metrics(&self.metric_store);
}
}
@@ -269,6 +276,20 @@ impl Dashboard {
last_metrics_check = Instant::now();
}
// Check for host connectivity changes (heartbeat timeouts) periodically
if last_heartbeat_check.elapsed() >= heartbeat_check_interval {
let timeout = Duration::from_secs(self.config.zmq.heartbeat_timeout_seconds);
// Clean up metrics for offline hosts
self.metric_store.cleanup_offline_hosts(timeout);
if let Some(ref mut tui_app) = self.tui_app {
let connected_hosts = self.metric_store.get_connected_hosts(timeout);
tui_app.update_hosts(connected_hosts);
}
last_heartbeat_check = Instant::now();
}
// Render TUI (only if not headless)
if !self.headless {
if let Some(ref mut terminal) = self.terminal {

View File

@@ -6,21 +6,29 @@ use std::path::Path;
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct DashboardConfig {
pub zmq: ZmqConfig,
pub hosts: HostsConfig,
pub hosts: std::collections::HashMap<String, HostDetails>,
pub system: SystemConfig,
pub ssh: SshConfig,
pub service_logs: std::collections::HashMap<String, Vec<ServiceLogConfig>>,
}
/// ZMQ consumer configuration
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ZmqConfig {
pub subscriber_ports: Vec<u16>,
/// Heartbeat timeout in seconds - hosts considered offline if no heartbeat received within this time
#[serde(default = "default_heartbeat_timeout_seconds")]
pub heartbeat_timeout_seconds: u64,
}
/// Hosts configuration
fn default_heartbeat_timeout_seconds() -> u64 {
10 // Default to 10 seconds - allows for multiple missed heartbeats
}
/// Individual host configuration details
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct HostsConfig {
pub predefined_hosts: Vec<String>,
pub struct HostDetails {
pub mac_address: Option<String>,
}
/// System configuration
@@ -39,6 +47,13 @@ pub struct SshConfig {
pub rebuild_alias: String,
}
/// Service log file configuration per host
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ServiceLogConfig {
pub service_name: String,
pub log_file_path: String,
}
impl DashboardConfig {
pub fn load_from_file<P: AsRef<Path>>(path: P) -> Result<Self> {
let path = path.as_ref();
@@ -60,8 +75,3 @@ impl Default for ZmqConfig {
}
}
impl Default for HostsConfig {
fn default() -> Self {
panic!("Dashboard configuration must be loaded from file - no hardcoded defaults allowed")
}
}

View File

@@ -11,8 +11,8 @@ pub struct MetricStore {
current_metrics: HashMap<String, HashMap<String, Metric>>,
/// Historical metrics for trending
historical_metrics: HashMap<String, Vec<MetricDataPoint>>,
/// Last update timestamp per host
last_update: HashMap<String, Instant>,
/// Last heartbeat timestamp per host
last_heartbeat: HashMap<String, Instant>,
/// Configuration
max_metrics_per_host: usize,
history_retention: Duration,
@@ -23,7 +23,7 @@ impl MetricStore {
Self {
current_metrics: HashMap::new(),
historical_metrics: HashMap::new(),
last_update: HashMap::new(),
last_heartbeat: HashMap::new(),
max_metrics_per_host,
history_retention: Duration::from_secs(history_retention_hours * 3600),
}
@@ -56,10 +56,13 @@ impl MetricStore {
// Add to history
host_history.push(MetricDataPoint { received_at: now });
}
// Update last update timestamp
self.last_update.insert(hostname.to_string(), now);
// Track heartbeat metrics for connectivity detection
if metric_name == "agent_heartbeat" {
self.last_heartbeat.insert(hostname.to_string(), now);
debug!("Updated heartbeat for host {}", hostname);
}
}
// Get metrics count before cleanup
let metrics_count = host_metrics.len();
@@ -88,22 +91,46 @@ impl MetricStore {
}
}
/// Get connected hosts (hosts with recent updates)
/// Get connected hosts (hosts with recent heartbeats)
pub fn get_connected_hosts(&self, timeout: Duration) -> Vec<String> {
let now = Instant::now();
self.last_update
self.last_heartbeat
.iter()
.filter_map(|(hostname, &last_update)| {
if now.duration_since(last_update) <= timeout {
.filter_map(|(hostname, &last_heartbeat)| {
if now.duration_since(last_heartbeat) <= timeout {
Some(hostname.clone())
} else {
debug!("Host {} considered offline - last heartbeat was {:?} ago",
hostname, now.duration_since(last_heartbeat));
None
}
})
.collect()
}
/// Clean up data for offline hosts
pub fn cleanup_offline_hosts(&mut self, timeout: Duration) {
let now = Instant::now();
let mut hosts_to_cleanup = Vec::new();
// Find hosts that are offline (no recent heartbeat)
for (hostname, &last_heartbeat) in &self.last_heartbeat {
if now.duration_since(last_heartbeat) > timeout {
hosts_to_cleanup.push(hostname.clone());
}
}
// Clear metrics for offline hosts
for hostname in hosts_to_cleanup {
if let Some(metrics) = self.current_metrics.remove(&hostname) {
info!("Cleared {} metrics for offline host: {}", metrics.len(), hostname);
}
// Keep heartbeat timestamp for reconnection detection
// Don't remove from last_heartbeat to track when host was last seen
}
}
/// Cleanup old data and enforce limits
fn cleanup_host_data(&mut self, hostname: &str) {
let now = Instant::now();

View File

@@ -9,6 +9,7 @@ use ratatui::{
use std::collections::HashMap;
use std::time::Instant;
use tracing::info;
use wake_on_lan::MagicPacket;
pub mod theme;
pub mod widgets;
@@ -89,19 +90,33 @@ pub struct TuiApp {
user_navigated_away: bool,
/// Dashboard configuration
config: DashboardConfig,
/// Cached localhost hostname to avoid repeated system calls
localhost: String,
}
impl TuiApp {
pub fn new(config: DashboardConfig) -> Self {
Self {
let localhost = gethostname::gethostname().to_string_lossy().to_string();
let mut app = Self {
host_widgets: HashMap::new(),
current_host: None,
available_hosts: Vec::new(),
available_hosts: config.hosts.keys().cloned().collect(),
host_index: 0,
should_quit: false,
user_navigated_away: false,
config,
localhost,
};
// Sort predefined hosts
app.available_hosts.sort();
// Initialize with first host if available
if !app.available_hosts.is_empty() {
app.current_host = Some(app.available_hosts[0].clone());
}
app
}
/// Get or create host widgets for the given hostname
@@ -120,31 +135,31 @@ impl TuiApp {
// Only update widgets if we have metrics for this host
let all_metrics = metric_store.get_metrics_for_host(&hostname);
if !all_metrics.is_empty() {
// Get metrics first while hostname is borrowed
let cpu_metrics: Vec<&Metric> = all_metrics
.iter()
.filter(|m| {
m.name.starts_with("cpu_")
|| m.name.contains("c_state_")
|| m.name.starts_with("process_top_")
})
.copied()
.collect();
let memory_metrics: Vec<&Metric> = all_metrics
.iter()
.filter(|m| m.name.starts_with("memory_") || m.name.starts_with("disk_tmp_"))
.copied()
.collect();
let service_metrics: Vec<&Metric> = all_metrics
.iter()
.filter(|m| m.name.starts_with("service_"))
.copied()
.collect();
let all_backup_metrics: Vec<&Metric> = all_metrics
.iter()
.filter(|m| m.name.starts_with("backup_"))
.copied()
.collect();
// Single pass metric categorization for better performance
let mut cpu_metrics = Vec::new();
let mut memory_metrics = Vec::new();
let mut service_metrics = Vec::new();
let mut backup_metrics = Vec::new();
let mut nixos_metrics = Vec::new();
let mut disk_metrics = Vec::new();
for metric in all_metrics {
if metric.name.starts_with("cpu_")
|| metric.name.contains("c_state_")
|| metric.name.starts_with("process_top_") {
cpu_metrics.push(metric);
} else if metric.name.starts_with("memory_") || metric.name.starts_with("disk_tmp_") {
memory_metrics.push(metric);
} else if metric.name.starts_with("service_") {
service_metrics.push(metric);
} else if metric.name.starts_with("backup_") {
backup_metrics.push(metric);
} else if metric.name == "system_nixos_build" || metric.name == "system_active_users" || metric.name == "agent_version" {
nixos_metrics.push(metric);
} else if metric.name.starts_with("disk_") {
disk_metrics.push(metric);
}
}
// Clear completed transitions first
self.clear_completed_transitions(&hostname, &service_metrics);
@@ -155,21 +170,7 @@ impl TuiApp {
// Collect all system metrics (CPU, memory, NixOS, disk/storage)
let mut system_metrics = cpu_metrics;
system_metrics.extend(memory_metrics);
// Add NixOS metrics - using exact matching for build display fix
let nixos_metrics: Vec<&Metric> = all_metrics
.iter()
.filter(|m| m.name == "system_nixos_build" || m.name == "system_active_users" || m.name == "agent_version")
.copied()
.collect();
system_metrics.extend(nixos_metrics);
// Add disk/storage metrics
let disk_metrics: Vec<&Metric> = all_metrics
.iter()
.filter(|m| m.name.starts_with("disk_"))
.copied()
.collect();
system_metrics.extend(disk_metrics);
host_widgets.system_widget.update_from_metrics(&system_metrics);
@@ -178,7 +179,7 @@ impl TuiApp {
.update_from_metrics(&service_metrics);
host_widgets
.backup_widget
.update_from_metrics(&all_backup_metrics);
.update_from_metrics(&backup_metrics);
host_widgets.last_update = Some(Instant::now());
}
@@ -186,30 +187,36 @@ impl TuiApp {
}
/// Update available hosts with localhost prioritization
pub fn update_hosts(&mut self, hosts: Vec<String>) {
// Sort hosts alphabetically
let mut sorted_hosts = hosts.clone();
pub fn update_hosts(&mut self, discovered_hosts: Vec<String>) {
// Start with configured hosts (always visible)
let mut all_hosts: Vec<String> = self.config.hosts.keys().cloned().collect();
// Add any discovered hosts that aren't already configured
for host in discovered_hosts {
if !all_hosts.contains(&host) {
all_hosts.push(host);
}
}
// Keep hosts that have pending transitions even if they're offline
for (hostname, host_widgets) in &self.host_widgets {
if !host_widgets.pending_service_transitions.is_empty() {
if !sorted_hosts.contains(hostname) {
sorted_hosts.push(hostname.clone());
if !all_hosts.contains(hostname) {
all_hosts.push(hostname.clone());
}
}
}
sorted_hosts.sort();
self.available_hosts = sorted_hosts;
all_hosts.sort();
self.available_hosts = all_hosts;
// Get the current hostname (localhost) for auto-selection
let localhost = gethostname::gethostname().to_string_lossy().to_string();
if !self.available_hosts.is_empty() {
if self.available_hosts.contains(&localhost) && !self.user_navigated_away {
if self.available_hosts.contains(&self.localhost) && !self.user_navigated_away {
// Localhost is available and user hasn't navigated away - switch to it
self.current_host = Some(localhost.clone());
self.current_host = Some(self.localhost.clone());
// Find the actual index of localhost in the sorted list
self.host_index = self.available_hosts.iter().position(|h| h == &localhost).unwrap_or(0);
self.host_index = self.available_hosts.iter().position(|h| h == &self.localhost).unwrap_or(0);
} else if self.current_host.is_none() {
// No current host - select first available (which is localhost if available)
self.current_host = Some(self.available_hosts[0].clone());
@@ -244,14 +251,9 @@ impl TuiApp {
KeyCode::Char('r') => {
// System rebuild command - works on any panel for current host
if let Some(hostname) = self.current_host.clone() {
// Create command that shows CM Dashboard logo and then rebuilds
// Create command that shows logo, rebuilds, and waits for user input
let logo_and_rebuild = format!(
r"cat << 'EOF'
NixOS System Rebuild
Target: {}
EOF
ssh -tt {}@{} 'bash -ic {}'",
"bash -c 'cat << \"EOF\"\nNixOS System Rebuild\nTarget: {}\n\nEOF\nssh -tt {}@{} \"bash -ic {}\"\necho\necho \"========================================\"\necho \"Rebuild completed. Press any key to close...\"\necho \"========================================\"\nread -n 1 -s\nexit'",
hostname,
self.config.ssh.rebuild_user,
hostname,
@@ -259,7 +261,10 @@ ssh -tt {}@{} 'bash -ic {}'",
);
std::process::Command::new("tmux")
.arg("display-popup")
.arg("split-window")
.arg("-v")
.arg("-p")
.arg("30")
.arg(&logo_and_rebuild)
.spawn()
.ok(); // Ignore errors, tmux will handle them
@@ -281,6 +286,51 @@ ssh -tt {}@{} 'bash -ic {}'",
}
}
}
KeyCode::Char('J') => {
// Show service logs via journalctl in tmux split window
if let (Some(service_name), Some(hostname)) = (self.get_selected_service(), self.current_host.clone()) {
let journalctl_command = format!(
"bash -c \"ssh -tt {}@{} 'sudo journalctl -u {}.service -f --no-pager -n 50'; exit\"",
self.config.ssh.rebuild_user,
hostname,
service_name
);
std::process::Command::new("tmux")
.arg("split-window")
.arg("-v")
.arg("-p")
.arg("30")
.arg(&journalctl_command)
.spawn()
.ok(); // Ignore errors, tmux will handle them
}
}
KeyCode::Char('L') => {
// Show custom service log file in tmux split window
if let (Some(service_name), Some(hostname)) = (self.get_selected_service(), self.current_host.clone()) {
// Check if this service has a custom log file configured
if let Some(host_logs) = self.config.service_logs.get(&hostname) {
if let Some(log_config) = host_logs.iter().find(|config| config.service_name == service_name) {
let tail_command = format!(
"bash -c \"ssh -tt {}@{} 'sudo tail -n 50 -f {}'; exit\"",
self.config.ssh.rebuild_user,
hostname,
log_config.log_file_path
);
std::process::Command::new("tmux")
.arg("split-window")
.arg("-v")
.arg("-p")
.arg("30")
.arg(&tail_command)
.spawn()
.ok(); // Ignore errors, tmux will handle them
}
}
}
}
KeyCode::Char('b') => {
// Trigger backup
if let Some(hostname) = self.current_host.clone() {
@@ -288,6 +338,33 @@ ssh -tt {}@{} 'bash -ic {}'",
return Ok(Some(UiCommand::TriggerBackup { hostname }));
}
}
KeyCode::Char('w') => {
// Wake on LAN for offline hosts
if let Some(hostname) = self.current_host.clone() {
// Check if host has MAC address configured
if let Some(host_details) = self.config.hosts.get(&hostname) {
if let Some(mac_address) = &host_details.mac_address {
// Parse MAC address and send WoL packet
let mac_bytes = Self::parse_mac_address(mac_address);
match mac_bytes {
Ok(mac) => {
match MagicPacket::new(&mac).send() {
Ok(_) => {
info!("WakeOnLAN packet sent successfully to {} ({})", hostname, mac_address);
}
Err(e) => {
tracing::error!("Failed to send WakeOnLAN packet to {}: {}", hostname, e);
}
}
}
Err(_) => {
tracing::error!("Invalid MAC address format for {}: {}", hostname, mac_address);
}
}
}
}
}
}
KeyCode::Tab => {
// Tab cycles to next host
self.navigate_host(1);
@@ -336,9 +413,8 @@ ssh -tt {}@{} 'bash -ic {}'",
self.current_host = Some(self.available_hosts[self.host_index].clone());
// Check if user navigated away from localhost
let localhost = gethostname::gethostname().to_string_lossy().to_string();
if let Some(ref current) = self.current_host {
if current != &localhost {
if current != &self.localhost {
self.user_navigated_away = true;
} else {
self.user_navigated_away = false; // User navigated back to localhost
@@ -482,6 +558,21 @@ ssh -tt {}@{} 'bash -ic {}'",
])
.split(main_chunks[1]); // main_chunks[1] is now the content area (between title and statusbar)
// Check if current host is offline
let current_host_offline = if let Some(hostname) = self.current_host.clone() {
self.calculate_host_status(&hostname, metric_store) == Status::Offline
} else {
true // No host selected is considered offline
};
// If host is offline, render wake-up message instead of panels
if current_host_offline {
self.render_offline_host_message(frame, main_chunks[1]);
self.render_btop_title(frame, main_chunks[0], metric_store);
self.render_statusbar(frame, main_chunks[2]);
return;
}
// Check if backup panel should be shown
let show_backup = if let Some(hostname) = self.current_host.clone() {
let host_widgets = self.get_or_create_host_widgets(&hostname);
@@ -549,11 +640,14 @@ ssh -tt {}@{} 'bash -ic {}'",
return;
}
// Calculate worst-case status across all hosts
// Calculate worst-case status across all hosts (excluding offline)
let mut worst_status = Status::Ok;
for host in &self.available_hosts {
let host_status = self.calculate_host_status(host, metric_store);
worst_status = Status::aggregate(&[worst_status, host_status]);
// Don't include offline hosts in status aggregation
if host_status != Status::Offline {
worst_status = Status::aggregate(&[worst_status, host_status]);
}
}
// Use the worst status color as background
@@ -631,7 +725,7 @@ ssh -tt {}@{} 'bash -ic {}'",
let metrics = metric_store.get_metrics_for_host(hostname);
if metrics.is_empty() {
return Status::Unknown;
return Status::Offline;
}
// First check if we have the aggregated host status summary from the agent
@@ -651,7 +745,8 @@ ssh -tt {}@{} 'bash -ic {}'",
Status::Warning => has_warning = true,
Status::Pending => has_pending = true,
Status::Ok => ok_count += 1,
Status::Unknown => {} // Ignore unknown for aggregation
Status::Unknown => {}, // Ignore unknown for aggregation
Status::Offline => {}, // Ignore offline for aggregation
}
}
@@ -686,10 +781,13 @@ ssh -tt {}@{} 'bash -ic {}'",
let mut shortcuts = Vec::new();
// Global shortcuts
shortcuts.push("Tab: Switch Host".to_string());
shortcuts.push("↑↓/jk: Select Service".to_string());
shortcuts.push("r: Rebuild Host".to_string());
shortcuts.push("s/S: Start/Stop Service".to_string());
shortcuts.push("Tab: Host".to_string());
shortcuts.push("↑↓/jk: Select".to_string());
shortcuts.push("r: Rebuild".to_string());
shortcuts.push("s/S: Start/Stop".to_string());
shortcuts.push("J: Logs".to_string());
shortcuts.push("L: Custom".to_string());
shortcuts.push("w: Wake".to_string());
// Always show quit
shortcuts.push("q: Quit".to_string());
@@ -728,5 +826,91 @@ ssh -tt {}@{} 'bash -ic {}'",
}
}
/// Render offline host message with wake-up option
fn render_offline_host_message(&self, frame: &mut Frame, area: Rect) {
use ratatui::layout::Alignment;
use ratatui::style::Modifier;
use ratatui::text::{Line, Span};
use ratatui::widgets::{Block, Borders, Paragraph};
// Get hostname for message
let hostname = self.current_host.as_ref()
.map(|h| h.as_str())
.unwrap_or("Unknown");
// Check if host has MAC address for wake-on-LAN
let has_mac = self.current_host.as_ref()
.and_then(|hostname| self.config.hosts.get(hostname))
.and_then(|details| details.mac_address.as_ref())
.is_some();
// Create message content
let mut lines = vec![
Line::from(Span::styled(
format!("Host '{}' is offline", hostname),
Style::default().fg(Theme::muted_text()).add_modifier(Modifier::BOLD),
)),
Line::from(""),
];
if has_mac {
lines.push(Line::from(Span::styled(
"Press 'w' to wake up host",
Style::default().fg(Theme::primary_text()).add_modifier(Modifier::BOLD),
)));
} else {
lines.push(Line::from(Span::styled(
"No MAC address configured - cannot wake up",
Style::default().fg(Theme::muted_text()),
)));
}
// Create centered message
let message = Paragraph::new(lines)
.block(Block::default()
.borders(Borders::ALL)
.border_style(Style::default().fg(Theme::muted_text()))
.title(" Offline Host ")
.title_style(Style::default().fg(Theme::muted_text()).add_modifier(Modifier::BOLD)))
.style(Style::default().bg(Theme::background()).fg(Theme::primary_text()))
.alignment(Alignment::Center);
// Center the message in the available area
let popup_area = ratatui::layout::Layout::default()
.direction(Direction::Vertical)
.constraints([
Constraint::Percentage(40),
Constraint::Length(6),
Constraint::Percentage(40),
])
.split(area)[1];
let popup_area = ratatui::layout::Layout::default()
.direction(Direction::Horizontal)
.constraints([
Constraint::Percentage(25),
Constraint::Percentage(50),
Constraint::Percentage(25),
])
.split(popup_area)[1];
frame.render_widget(message, popup_area);
}
/// Parse MAC address string (e.g., "AA:BB:CC:DD:EE:FF") to [u8; 6]
fn parse_mac_address(mac_str: &str) -> Result<[u8; 6], &'static str> {
let parts: Vec<&str> = mac_str.split(':').collect();
if parts.len() != 6 {
return Err("MAC address must have 6 parts separated by colons");
}
let mut mac = [0u8; 6];
for (i, part) in parts.iter().enumerate() {
match u8::from_str_radix(part, 16) {
Ok(byte) => mac[i] = byte,
Err(_) => return Err("Invalid hexadecimal byte in MAC address"),
}
}
Ok(mac)
}
}

View File

@@ -147,6 +147,7 @@ impl Theme {
Status::Warning => Self::warning(),
Status::Critical => Self::error(),
Status::Unknown => Self::muted_text(),
Status::Offline => Self::muted_text(), // Dark gray for offline
}
}
@@ -244,8 +245,9 @@ impl StatusIcons {
Status::Ok => "",
Status::Pending => "", // Hollow circle for pending
Status::Warning => "",
Status::Critical => "",
Status::Critical => "!",
Status::Unknown => "?",
Status::Offline => "", // Empty circle for offline
}
}
@@ -258,6 +260,7 @@ impl StatusIcons {
Status::Warning => Theme::warning(), // Yellow
Status::Critical => Theme::error(), // Red
Status::Unknown => Theme::muted_text(), // Gray
Status::Offline => Theme::muted_text(), // Dark gray for offline
};
vec![

View File

@@ -113,13 +113,10 @@ impl ServicesWidget {
name.to_string()
};
// Parent services always show active/inactive status
// Parent services always show actual systemctl status
let status_str = match info.widget_status {
Status::Ok => "active".to_string(),
Status::Pending => "pending".to_string(),
Status::Warning => "inactive".to_string(),
Status::Critical => "failed".to_string(),
Status::Unknown => "unknown".to_string(),
_ => info.status.clone(), // Use actual status from agent (active/inactive/failed)
};
format!(
@@ -149,6 +146,7 @@ impl ServicesWidget {
Status::Warning => Theme::warning(),
Status::Critical => Theme::error(),
Status::Unknown => Theme::muted_text(),
Status::Offline => Theme::muted_text(),
};
(icon.to_string(), info.status.clone(), status_color)

View File

@@ -1,88 +0,0 @@
# Hardcoded Values Removed - Configuration Summary
## ✅ All Hardcoded Values Converted to Configuration
### **1. SystemD Nginx Check Interval**
- **Before**: `nginx_check_interval_seconds: 30` (hardcoded)
- **After**: `nginx_check_interval_seconds: config.nginx_check_interval_seconds`
- **NixOS Config**: `nginx_check_interval_seconds = 30;`
### **2. ZMQ Transmission Interval**
- **Before**: `Duration::from_secs(1)` (hardcoded)
- **After**: `Duration::from_secs(self.config.zmq.transmission_interval_seconds)`
- **NixOS Config**: `transmission_interval_seconds = 1;`
### **3. HTTP Timeouts in SystemD Collector**
- **Before**:
```rust
.timeout(Duration::from_secs(10))
.connect_timeout(Duration::from_secs(10))
```
- **After**:
```rust
.timeout(Duration::from_secs(self.config.http_timeout_seconds))
.connect_timeout(Duration::from_secs(self.config.http_connect_timeout_seconds))
```
- **NixOS Config**:
```nix
http_timeout_seconds = 10;
http_connect_timeout_seconds = 10;
```
## **Configuration Structure Changes**
### **SystemdConfig** (agent/src/config/mod.rs)
```rust
pub struct SystemdConfig {
// ... existing fields ...
pub nginx_check_interval_seconds: u64, // NEW
pub http_timeout_seconds: u64, // NEW
pub http_connect_timeout_seconds: u64, // NEW
}
```
### **ZmqConfig** (agent/src/config/mod.rs)
```rust
pub struct ZmqConfig {
// ... existing fields ...
pub transmission_interval_seconds: u64, // NEW
}
```
## **NixOS Configuration Updates**
### **ZMQ Section** (hosts/common/cm-dashboard.nix)
```nix
zmq = {
# ... existing fields ...
transmission_interval_seconds = 1; # NEW
};
```
### **SystemD Section** (hosts/common/cm-dashboard.nix)
```nix
systemd = {
# ... existing fields ...
nginx_check_interval_seconds = 30; # NEW
http_timeout_seconds = 10; # NEW
http_connect_timeout_seconds = 10; # NEW
};
```
## **Benefits**
**No hardcoded values** - All timing/timeout values configurable
**Consistent configuration** - Everything follows NixOS config pattern
**Environment-specific tuning** - Can adjust timeouts per deployment
**Maintainability** - No magic numbers scattered in code
**Testing flexibility** - Can configure different values for testing
## **Runtime Behavior**
All previously hardcoded values now respect configuration:
- **Nginx latency checks**: Every 30s (configurable)
- **ZMQ transmission**: Every 1s (configurable)
- **HTTP requests**: 10s timeout (configurable)
- **HTTP connections**: 10s timeout (configurable)
The codebase is now **100% configuration-driven** with no hardcoded timing values.

View File

@@ -1,6 +1,6 @@
[package]
name = "cm-dashboard-shared"
version = "0.1.43"
version = "0.1.64"
edition = "2021"
[dependencies]

View File

@@ -87,6 +87,7 @@ pub enum Status {
Warning,
Critical,
Unknown,
Offline,
}
impl Status {
@@ -190,6 +191,16 @@ impl HysteresisThresholds {
Status::Ok
}
}
Status::Offline => {
// Host coming back online, use normal thresholds like first measurement
if value >= self.critical_high {
Status::Critical
} else if value >= self.warning_high {
Status::Warning
} else {
Status::Ok
}
}
}
}
}

View File

@@ -1,42 +0,0 @@
#!/bin/bash
# Test script to verify collector intervals are working correctly
# Expected behavior:
# - CPU/Memory: Every 2 seconds
# - Systemd/Network: Every 10 seconds
# - Backup/NixOS: Every 60 seconds
# - Disk: Every 300 seconds (5 minutes)
echo "=== Testing Collector Interval Implementation ==="
echo "Expected intervals from NixOS config:"
echo " CPU: 2s, Memory: 2s"
echo " Systemd: 10s, Network: 10s"
echo " Backup: 60s, NixOS: 60s"
echo " Disk: 300s (5m)"
echo ""
# Note: Cannot run actual agent without proper config, but we can verify the code logic
echo "✅ Code Implementation Status:"
echo " - TimedCollector struct with interval tracking: IMPLEMENTED"
echo " - Individual collector intervals from config: IMPLEMENTED"
echo " - collect_metrics_timed() respects intervals: IMPLEMENTED"
echo " - Debug logging shows interval compliance: IMPLEMENTED"
echo ""
echo "🔍 Key Implementation Details:"
echo " - MetricCollectionManager now tracks last_collection time per collector"
echo " - Each collector gets Duration::from_secs(config.{collector}.interval_seconds)"
echo " - Only collectors with elapsed >= interval are called"
echo " - Debug logs show actual collection with interval info"
echo ""
echo "📊 Expected Runtime Behavior:"
echo " At 0s: All collectors run (startup)"
echo " At 2s: CPU, Memory run"
echo " At 4s: CPU, Memory run"
echo " At 10s: CPU, Memory, Systemd, Network run"
echo " At 60s: CPU, Memory, Systemd, Network, Backup, NixOS run"
echo " At 300s: All collectors run including Disk"
echo ""
echo "✅ CONCLUSION: Codebase now follows NixOS configuration intervals correctly!"

View File

@@ -1,32 +0,0 @@
#!/usr/bin/env rust-script
use std::process;
/// Check if running inside tmux session
fn check_tmux_session() {
// Check for TMUX environment variable which is set when inside a tmux session
if std::env::var("TMUX").is_err() {
eprintln!("╭─────────────────────────────────────────────────────────────╮");
eprintln!("│ ⚠️ TMUX REQUIRED │");
eprintln!("├─────────────────────────────────────────────────────────────┤");
eprintln!("│ CM Dashboard must be run inside a tmux session for proper │");
eprintln!("│ terminal handling and remote operation functionality. │");
eprintln!("│ │");
eprintln!("│ Please start a tmux session first: │");
eprintln!("│ tmux new-session -d -s dashboard cm-dashboard │");
eprintln!("│ tmux attach-session -t dashboard │");
eprintln!("│ │");
eprintln!("│ Or simply: │");
eprintln!("│ tmux │");
eprintln!("│ cm-dashboard │");
eprintln!("╰─────────────────────────────────────────────────────────────╯");
process::exit(1);
} else {
println!("✅ Running inside tmux session - OK");
}
}
fn main() {
println!("Testing tmux check function...");
check_tmux_session();
}

View File

@@ -1,53 +0,0 @@
#!/bin/bash
echo "=== TMUX Check Implementation Test ==="
echo ""
echo "📋 Testing tmux check logic:"
echo ""
echo "1. Current environment:"
if [ -n "$TMUX" ]; then
echo " ✅ Running inside tmux session"
echo " TMUX variable: $TMUX"
else
echo " ❌ NOT running inside tmux session"
echo " TMUX variable: (not set)"
fi
echo ""
echo "2. Simulating dashboard tmux check logic:"
echo ""
# Simulate the Rust check logic
if [ -z "$TMUX" ]; then
echo " Dashboard would show:"
echo " ╭─────────────────────────────────────────────────────────────╮"
echo " │ ⚠️ TMUX REQUIRED │"
echo " ├─────────────────────────────────────────────────────────────┤"
echo " │ CM Dashboard must be run inside a tmux session for proper │"
echo " │ terminal handling and remote operation functionality. │"
echo " │ │"
echo " │ Please start a tmux session first: │"
echo " │ tmux new-session -d -s dashboard cm-dashboard │"
echo " │ tmux attach-session -t dashboard │"
echo " │ │"
echo " │ Or simply: │"
echo " │ tmux │"
echo " │ cm-dashboard │"
echo " ╰─────────────────────────────────────────────────────────────╯"
echo " Then exit with code 1"
else
echo " ✅ Dashboard tmux check would PASS - continuing normally"
fi
echo ""
echo "3. Implementation status:"
echo " ✅ check_tmux_session() function added to dashboard/src/main.rs"
echo " ✅ Called early in main() but only for TUI mode (not headless)"
echo " ✅ Uses std::env::var(\"TMUX\") to detect tmux session"
echo " ✅ Shows helpful error message with usage instructions"
echo " ✅ Exits with code 1 if not in tmux"
echo ""
echo "✅ TMUX check implementation complete!"