Collectors now respect their configured intervals instead of running
every transmission cycle (2s). This prevents disk SMART checks from
running every 2 seconds, which was causing constant disk activity.
- Add TimedCollector wrapper with interval tracking
- Only collect from collectors whose interval has elapsed
- Disk collector now properly runs every 300s instead of every 2s
- Bump version to v0.1.229
- Changed code to use zmq.transmission_interval_seconds instead of top-level collection_interval_seconds
- Removed collection_interval_seconds from AgentConfig
- Updated validation to check zmq.transmission_interval_seconds
- Improves config organization by grouping all ZMQ settings together
Bump version to v0.1.210
Co-Authored-By: Claude <noreply@anthropic.com>
Move network collection from NixOS collector to dedicated NetworkCollector. Add link status detection for physical interfaces (up/down). Group interfaces by physical/virtual, show status icons for physical NICs only. Down interfaces show as Inactive instead of Critical.
Version bump to 0.1.165
Fully restored CM Dashboard as a complete monitoring system with working
status evaluation and email notifications.
COMPLETED PHASES:
✅ Phase 1: Fixed storage display issues
- Use lsblk instead of findmnt (eliminates /nix/store bind mount)
- Fixed NVMe SMART parsing (Temperature: and Percentage Used:)
- Added sudo to smartctl for permissions
- Consistent filesystem and tmpfs sorting
✅ Phase 2a: Fixed missing NixOS build information
- Added build_version field to AgentData
- NixOS collector now populates build info
- Dashboard shows actual build instead of "unknown"
✅ Phase 2b: Restored status evaluation system
- Added status fields to all structured data types
- CPU: load and temperature status evaluation
- Memory: usage status evaluation
- Storage: temperature, health, and filesystem usage status
- All collectors now use their threshold configurations
✅ Phase 3: Restored notification system
- Status change detection between collection cycles
- Email alerts on status degradation (OK→Warning/Critical)
- Detailed notification content with metric values
- Full NotificationManager integration
CORE FUNCTIONALITY RESTORED:
- Real-time monitoring with proper status evaluation
- Email notifications on threshold violations
- Correct storage display (nvme0n1 T: 28°C W: 1%)
- Complete status-aware infrastructure monitoring
- Dashboard is now a monitoring system, not just data viewer
The CM Dashboard monitoring system is fully operational.
Phase 1 fixes for storage display:
- Replace findmnt with lsblk to eliminate bind mount issues (/nix/store)
- Add sudo to smartctl commands for permission access
- Fix NVMe SMART parsing for Temperature: and Percentage Used: fields
- Use dynamic version from CARGO_PKG_VERSION instead of hardcoded strings
Storage display should now show correct mount points and temperature/wear.
Status evaluation and notifications still need restoration in subsequent phases.
Implements clean structured data collection eliminating all string metric
parsing bugs. Collectors now populate AgentData directly with type-safe
field access.
Key improvements:
- Mount points preserved correctly (/ and /boot instead of root/boot)
- Tmpfs discovery added to memory collector
- Temperature data flows as typed f32 fields
- Zero string parsing overhead
- Complete removal of MetricCollectionManager bridge
- Direct ZMQ transmission of structured JSON
All functionality maintained: service tracking, notifications, status
evaluation, and multi-host monitoring.
Update storage display to match CLAUDE.md specification:
- Show drive temp/wear on main line: nvme0n1 T: 25°C W: 4%
- Display individual filesystems as sub-items: /: 55% 250.5GB/456.4GB
- Remove Total usage line in favor of filesystem breakdown
Clean up code warnings:
- Remove unused heartbeat methods and fields
- Remove unused backup widget fields and methods
- Add allow attributes for legacy methods
Agent heartbeat was sending empty AgentData every few seconds, causing
dashboard to display zero values for all metrics intermittently. Since
agent already transmits complete data every 1 second, heartbeat is
redundant. Dashboard will detect offline hosts via data timestamps.
Replace fragile string-based metrics with type-safe JSON data structures.
Agent converts all metrics to structured data, dashboard processes typed fields.
Changes:
- Add AgentData struct with CPU, memory, storage, services, backup fields
- Replace string parsing with direct field access throughout system
- Maintain UI compatibility via temporary metric bridge conversion
- Fix NVMe temperature display and eliminate string parsing bugs
- Update protocol to support structured data transmission over ZMQ
- Comprehensive metric type coverage: CPU, memory, storage, services, backup
Version bump to 0.1.131
Replace ZMQ-based service start/stop commands with SSH execution in tmux
popups. This provides better user feedback with real-time systemctl output
while eliminating blocking operations from the main message processing loop.
Changes:
- Service start/stop now use SSH with progress display
- Added backup functionality with 'B' key
- Preserved transitional icons (↑/↓) for immediate visual feedback
- Removed all ZMQ service control commands and handlers
- Updated configuration to include backup_alias setting
- All operations (rebuild, backup, services) now use consistent SSH interface
This ensures stable heartbeat processing while providing superior user
experience with live command output and service status feedback.
Simplifies host connection configuration by removing tailscale_ip field,
connection_type preferences, and fallback retry logic. Now uses only the
ip field or hostname as fallback. Eliminates blocking TCP connectivity
tests that interfered with heartbeat processing.
This resolves intermittent host lost/found issues by removing the
connection retry timeouts that blocked the ZMQ message processing loop.
- Add dedicated heartbeat transmission every 5 seconds independent of metric collection
- Fix host offline detection by clearing metrics for disconnected hosts
- Move exclude_email_metrics to NotificationConfig for better organization
- Add cleanup_offline_hosts method to remove stale metrics after heartbeat timeout
- Ensure offline hosts show proper status icons and visual indicators
Version 0.1.63
- Add agent_heartbeat metric to agent transmission for reliable host detection
- Update dashboard to track heartbeat timestamps per host instead of general metrics
- Add configurable heartbeat_timeout_seconds to dashboard ZMQ config (default 10s)
- Remove unused timeout_ms from agent config and revert to non-blocking command reception
- Remove unused heartbeat_interval_ms from agent configuration
- Host disconnect detection now uses dedicated heartbeat metrics for improved reliability
- Bump version to 0.1.57
- Add exclude_email_metrics field to AgentConfig for filtering email notifications
- Metrics matching excluded names skip notification processing but still appear in dashboard
- Optional field with serde(default) for backward compatibility
- Bump version to 0.1.56
Correct user-stopped service behavior during startup transitions:
User-Stopped Flag Timing Fix:
- Clear user-stopped flag only when service actually becomes active, not when start command succeeds
- Remove premature flag clearing from service control handler
- Add automatic flag clearing when service status metrics show active state
- Services retain user-stopped status during activating/transitioning states
Service Transition Handling:
- User-stopped services in activating state now report Status::OK instead of Status::Pending
- Prevents host warnings during legitimate service startup transitions
- Maintains accurate status reporting throughout service lifecycle
- Failed service starts preserve user-stopped flags correctly
Journalctl Popup Fix:
- Fix terminal corruption when using J key for service logs
- Correct command quoting to prevent tmux popup interference
- Stable popup display without dashboard interface corruption
Result: Clean service startup experience with no false warnings and proper
user-stopped tracking throughout the entire service lifecycle.
Bump version to v0.1.47
Improve user-stopped service tracking behavior:
Service Display Fix:
- Services widget now shows actual systemctl status (active/inactive)
- Use info.status instead of hardcoded text based on widget_status
- User-stopped services correctly display 'inactive' with green OK icon
- Prevents misleading 'active' display for stopped services
User-Stopped Flag Timing Fix:
- Clear user-stopped flag AFTER successful service start, not when command sent
- Prevents warnings during service startup transition period
- Service remains Status::OK during 'activating' state for user-stopped services
- Flag only cleared when systemctl start command actually succeeds
- Failed start attempts preserve user-stopped flag
Result: Clean service state tracking with accurate display and no false alerts
during intentional user operations.
Bump version to v0.1.45
Service stop/start operations were failing because systemctl commands
were missing the .service suffix. This caused the new user-stopped
tracking feature to mark services but not actually control them.
Changes:
- Add .service suffix to systemctl commands in service control handler
- Matches pattern used throughout systemd collector
- Fixes service start/stop functionality via dashboard
Clean up legacy documentation:
- Remove outdated TODO.md, AGENTS.md, and test files
- Update CLAUDE.md with current architecture and rules only
- Comprehensive README.md rewrite with technical documentation
- Document user-stopped service tracking feature
Bump version to v0.1.44
Add comprehensive tracking for services stopped via dashboard to prevent
false alerts when users intentionally stop services.
Features:
- User-stopped services report Status::Ok instead of Warning
- Persistent storage survives agent restarts
- Dashboard sends UserStart/UserStop commands
- Agent tracks and syncs user-stopped state globally
- Systemd collector respects user-stopped flags
Implementation:
- New service_tracker module with persistent JSON storage
- Enhanced ServiceAction enum with UserStart/UserStop variants
- Global singleton tracker accessible by collectors
- Service status logic updated to check user-stopped flag
- Dashboard version now uses CARGO_PKG_VERSION automatically
Bump version to v0.1.43
Simplified keyboard controls by removing service restart functionality:
- Removed 'r' key restart functionality from Services panel
- Made 'R' key always trigger system rebuild regardless of focused panel
- Updated context shortcuts to show 'R: Rebuild Host' globally
- Removed all ServiceRestart enum variants and associated code:
- UiCommand::ServiceRestart
- CommandType::ServiceRestart
- ServiceAction::Restart
- Cleaned up pending transition logic to only handle Start/Stop commands
The 'R' key now consistently rebuilds the current host from any panel,
while 's' and 'S' continue to handle service start/stop in Services panel.
- Fix /tmp usage status to use proper thresholds instead of hardcoded Ok status
- Fix wear level status to use configurable thresholds instead of hardcoded values
- Add dedicated tmp_status field to SystemWidget for proper /tmp status display
- Remove host-level hourglass icon during service operations
- Implement immediate service status updates after start/stop/restart commands
- Remove active users display and collection from NixOS section
- Fix immediate host status aggregation transmission to dashboard
- Separate dashboard updates from email notifications for immediate status aggregation
- Add metric caching to MetricCollectionManager for instant dashboard updates
- Dashboard now receives cached data every 1 second instead of waiting for collection intervals
- Fix transmission to use cached metrics rather than triggering fresh collection
- Email notifications maintain separate 60-second batching interval
- Update configurable email notification aggregation interval
- Fix email notification aggregation to send batched notifications instead of individual emails
- Fix startup data collection to properly process initial status without triggering change notifications
- Maintain event-driven transmission while preserving aggregated notification batching
- Update version from 0.1.19 to 0.1.20 across all components
Major architectural improvements:
CORE CHANGES:
- Remove notification_interval_seconds - status aggregation now immediate
- Status calculation moved to collection phase instead of transmission
- Event-driven transmission triggers immediately on status changes
- Dual transmission strategy: immediate on change + periodic backup
- Real-time notifications without batching delays
TECHNICAL IMPROVEMENTS:
- process_metric() now returns bool indicating status change
- Immediate ZMQ broadcast when status changes detected
- Status aggregation happens during metric collection, not later
- Legacy get_nixos_build_info() method removed (unused)
- All compilation warnings fixed
BEHAVIOR CHANGES:
- Critical alerts sent instantly instead of waiting for intervals
- Dashboard receives real-time status updates
- Notifications triggered immediately on status transitions
- Backup periodic transmission every 1s ensures heartbeat
This provides much more responsive monitoring with instant alerting
while maintaining the reliability of periodic transmission as backup.
- Implement per-collector interval timing respecting NixOS config
- Remove all hardcoded timeout/interval values and make configurable
- Add tmux session requirement check for TUI mode (bypassed for headless)
- Update agent to send config hash in Build field instead of nixos version
- Add nginx check interval, HTTP timeouts, and ZMQ transmission interval configs
- Update NixOS configuration with new configurable values
Breaking changes:
- Build field now shows nix store config hash (8 chars) instead of nixos version
- All intervals now follow individual collector configuration instead of global
New configuration fields:
- systemd.nginx_check_interval_seconds
- systemd.http_timeout_seconds
- systemd.http_connect_timeout_seconds
- zmq.transmission_interval_seconds
- Remove all SystemRebuild command infrastructure from agent and dashboard
- Replace with direct tmux popup execution: ssh {user}@{host} {alias}
- Add configurable SSH user and rebuild alias in dashboard config
- Eliminate agent process crashes during rebuilds
- Simplify architecture by removing ZMQ command streaming complexity
- Clean up all related dead code and fix compilation warnings
Benefits:
- Process isolation: rebuild runs independently via SSH
- Crash resilience: agent/dashboard can restart without affecting rebuilds
- Configuration flexibility: SSH user and alias configurable per deployment
- Operational simplicity: standard tmux popup interface
- Add cm-rebuild systemd service for process isolation
- Add sudo permissions for service control and journal access
- Remove verbose flag for cleaner output
- Ensures reliable rebuild operations without agent crashes
- Remove auto-close behavior from terminal popup for manual review
- Fix system panel to show correct NixOS section layout
- Add missing Active users line after Agent version
- Switch agent version from nix store hash to actual version number (v0.1.11)
- Display full version string without truncation for clear version tracking
- Replace simulated progress messages with actual stdout/stderr capture
- Stream all nixos-rebuild output line-by-line to terminal popup
- Show transparent build process including downloads, compilation, and activation
- Maintain real-time visibility into complete rebuild process
- Add terminal popup UI component with 80% screen coverage and terminal styling
- Extend ZMQ protocol with CommandOutputMessage for streaming output
- Implement real-time output streaming in agent system rebuild handler
- Add keyboard controls (ESC/Q to close, ↑↓ to scroll) for popup interaction
- Fix system panel Build display to show actual NixOS build instead of config hash
- Update service filters in README with wildcard patterns for better matching
- Add periodic progress updates during nixos-rebuild execution
- Integrate command output handling in dashboard main loop
- Agent reports version via agent_version metric using nix store hash
- Dashboard displays agent version in system widget
- Foundation for cross-host version comparison
- Both agent -V and dashboard show versions
Redirect nixos-rebuild stdout/stderr to /var/log/cm-dashboard/nixos-rebuild.log
while keeping the process detached. This allows monitoring rebuild progress
and debugging why cargo builds in /tmp aren't visible when agent runs.
Use: tail -f /var/log/cm-dashboard/nixos-rebuild.log to monitor progress.
Run nixos-rebuild with nohup in background to prevent the agent
from killing itself during system rebuild. The rebuild process
now runs independently, allowing the agent to return success
immediately and avoid crashes during binary updates.
This fixes the issue where agent would crash during rebuild
and restart with the old binary due to missing daemon-reload.
Fixes kernel namespace sandboxing issues when running as systemd service.
The --no-sandbox flag disables Nix build sandboxing which requires
kernel namespaces not available in restricted service environments.
Remove sudo -u cm wrapper that was causing git repository ownership
mismatch. Now cm-agent runs nixos-rebuild directly as root, avoiding
the ownership conflict between cm-agent (git clone) and cm user.
Updated sudo rules to allow cm-agent -> root nixos-rebuild access.
Use explicit /run/current-system/sw/bin/nixos-rebuild path instead of
'nixos-rebuild' command to match sudo rules exactly. This resolves
'command not allowed' errors when the command resolves to nix store paths.
- Add nixos_config_api_key_file option to NixOS configuration
- Support reading API token from file for private repositories
- Automatically inject token into HTTPS URLs (https://token@host/repo.git)
- Graceful fallback to original URL if key file missing/empty
- Default key file location: /var/lib/cm-dashboard/git-api-key
Usage: echo 'your-api-token' | sudo tee /var/lib/cm-dashboard/git-api-key
Replace direct directory access with git clone/pull approach:
- Add git configuration options (url, branch, working_dir) to NixOS module
- Update SystemConfig and AgentCommand to use git parameters
- Implement ensure_git_repository() method for clone/pull operations
- Agent clones nixosbox to /var/lib/cm-dashboard/nixos-config
- Maintains security while solving permission denied issues
The agent now manages its own copy of the configuration without
needing access to /home/cm directory.
- Add sudo permissions for systemctl and nixos-rebuild commands
- Use sudo in agent command execution for proper privileges
- Fix backup collector to handle missing status files gracefully
- Eliminate backup error spam when no backup system is configured
This implements the core functionality for executing remote commands through
the dashboard and providing real-time visual feedback to users.
Key Features:
- Remote service control (start/stop/restart) via existing keyboard shortcuts
- System rebuild command with maintenance mode integration
- Real-time visual feedback with service status transitions
- ZMQ command protocol extension for service and system operations
Implementation Details:
- Extended AgentCommand enum with ServiceControl and SystemRebuild variants
- Added agent-side handlers for systemctl and nixos-rebuild execution
- Implemented command status tracking system for visual feedback
- Enhanced services widget to show progress states (⏳ restarting)
- Integrated command execution with existing keyboard navigation
Keyboard Controls:
- Services Panel: Space (start/stop), R (restart)
- System Panel: R (nixos-rebuild switch)
- Backup Panel: B (trigger backup)
Technical Architecture:
- Command flow: UI → Dashboard → ZMQ → Agent → systemctl/nixos-rebuild
- Status tracking: InProgress/Success/Failed states with visual indicators
- Maintenance mode: Automatic /tmp/cm-maintenance file management
- Service feedback: Icon transitions (● → ⏳ → ● with status text)