- Change from showing version to build format: 'hash dd/mm/yy H:M:S'
- Parse nixos-version output to extract short hash and format date
- Update system widget to display 'Build:' instead of 'Version:'
- Remove version/build_date fields in favor of single build string
- Follow TODO.md specification for NixOS section layout
- Add memory_tmp_usage_percent, memory_tmp_used_gb, memory_tmp_total_gb metric parsing
- Fix tmpfs display showing as —% —GB/—GB in dashboard
- System widget now properly receives and displays tmpfs metrics from memory collector
- Remove /tmp autodetection from disk collector (57 lines removed)
- Add tmpfs monitoring to memory collector with get_tmpfs_metrics() method
- Generate memory_tmp_* metrics for proper RAM-based tmpfs monitoring
- Fix type annotations in tmpfs parsing for compilation
- System widget now correctly displays tmpfs usage in RAM section
- Create NixOS collector for version and active users detection
- Add SystemWidget combining all system information in TODO.md layout
- Replace separate CPU/Memory widgets with unified system display
- Add tree structure for storage with drive temperature/wear info
- Support NixOS version, active users, load averages, memory usage
- Follow exact decimal formatting from specification
- Use config.interval_seconds instead of hardcoded 300 seconds
- Discovery now happens every 10 seconds (configurable) instead of 5 minutes
- Follows configuration-driven architecture requirements
- Restructure get_monitored_services to avoid nested write locks
- Split discover_services into discover_services_internal that returns data
- Update state in separate scope to prevent deadlock
- Fix borrow checker errors with clone() for status cache
Add detailed debug logging to track:
- Service discovery start
- Individual service parsing
- Final service count and list
- Empty results indication
This will help identify why cmbox disappeared from dashboard.
Major performance optimization:
- Parse and cache service status during discovery from systemctl list-units
- Eliminate per-service systemctl is-active and show calls
- Reduce systemctl calls from 1+2N to just 1 call total
- For 10 services: 21 calls → 1 call (95% reduction)
- Add fallback to systemctl for cache misses
This completes the major systemctl call reduction goal from TODO.md.
Implement glob pattern matching for service filters:
- nginx* matches nginx, nginx-config-reload, etc.
- *backup matches any service ending with 'backup'
- docker*prune matches docker-weekly-prune, etc.
- Exact matches still work as before (backward compatible)
Addresses TODO.md requirement for '*' filtering support.
Reduce from 2 systemctl commands to 1 by using only:
systemctl list-units --type=service --all
This captures all services (active, inactive, failed) in one call,
eliminating the redundant list-unit-files command.
Achieves the TODO.md goal of reducing systemctl calls.
Remove all sudo -u systemctl commands and user service processing.
Now only collects system services via systemctl list-units/list-unit-files.
Eliminates user service discovery completely as planned in TODO.md.
Change service matching logic from contains-based to exact equality.
Services now match only if service_name == pattern exactly.
This is the first step in the systemd collector optimization plan.
- Fix duplicate storage pool issue by clearing cache on agent startup
- Change storage pool header text to normal color for better readability
- Improve services panel tree icons with proper └─ symbols for last items
- Ensure fresh metrics data on each agent restart
Eliminate duplicate storage entries by removing old disk_count dependency.
Dashboard now uses pure auto-discovery of disk_{pool}_usage_percent metrics.
Fixes multiple storage instances (Storage 0, Storage 1, Storage root)
showing only proper tree structure format.
Add proper hierarchical tree display for storage pools and drives:
- Pool headers with status icons and type indication (Single/multi-drive)
- Individual drive lines with ├─ tree symbols and health status
- Usage summary with └─ end symbol and capacity status
- T: and W: prefixes for temperature and wear level metrics
- Themed status icons using StatusIcons::get_icon() with proper colors
- 2-space indentation for clean tree structure appearance
Replace flat storage display with beautiful tree format:
● Storage steampool (multi-drive):
├─ ● sdb T:35°C W:12%
├─ ● sdc T:38°C W:8%
└─ ● 78.1% 1250.3GB/1600.0GB
Uses agent-calculated status from NixOS-configured thresholds.
Update CLAUDE.md with complete implementation specification.
Restructure storage display to handle new individual metrics architecture:
- Parse disk_{pool}_* metrics instead of indexed disk_{index}_* format
- Support individual drive metrics disk_{pool}_{drive}_health/temperature/wear
- Display tree structure: "Storage {pool} ({type}): drive details"
- Show pool usage summary with individual drive health/temp/wear status
- Auto-discover storage pools and drives from metric patterns
- Maintain proper status aggregation from individual metrics
The dashboard now correctly displays the new enhanced disk collector output
with storage pools containing multiple drives and their individual metrics.
- Add StoragePool and DriveInfo structures for grouping drives by mount point
- Implement SMART data collection for individual drives (health, temperature, wear)
- Support for ext4, zfs, xfs, mergerfs, btrfs filesystem types
- Generate individual drive metrics: disk_[pool]_[drive]_health/temperature/wear
- Add storage_type and underlying_devices to filesystem configuration
- Move hardcoded service directory mappings to NixOS configuration
- Move hardcoded host-to-user mapping to NixOS configuration
- Remove all unused code and fix compilation warnings
- Clean implementation with zero warnings and no dead code
Individual drives now show health status per storage pool:
Storage root (ext4): nvme0n1 PASSED 42°C 5% wear
Storage steampool (mergerfs): sda/sdb/sdc with individual health data
- Dashboard now automatically looks for /etc/cm-dashboard/dashboard.toml
- No need to specify --config flag when using standard NixOS deployment
- Fallback to manual config path if default not found
- Update help text to reflect optional config parameter
- Simplifies dashboard usage - just run 'cm-dashboard' without arguments
- Remove all unused configuration options from dashboard config module
- Eliminate hardcoded defaults - dashboard now requires config file like agent
- Keep only actually used config: zmq.subscriber_ports and hosts.predefined_hosts
- Remove unused get_host_metrics function from metric store
- Clean up missing module imports (hosts, utils)
- Make dashboard fail fast if no configuration provided
- Align dashboard config approach with agent configuration pattern
- Remove all Default implementations from agent configuration structs
- Make configuration file required for agent startup
- Update NixOS module to generate complete agent.toml configuration
- Add comprehensive configuration options to NixOS module including:
- Service include/exclude patterns for systemd collector
- All thresholds and intervals
- ZMQ communication settings
- Notification and cache configuration
- Agent now fails fast if no configuration provided
- Eliminates configuration drift between defaults and NixOS settings
- Use systemctl --user commands to discover user-level services
- Include both user unit files and loaded user units
- Gracefully handle cases where user commands fail (no user session)
- Treat user services same as system services in filtering
- Enables monitoring of user-level Docker, development servers, etc.
- Add ark-permissions to exclusion list (maintenance service)
- Add sunshine to service_name_filters (game streaming server)
- Improves service discovery for game streaming infrastructure
- Use systemctl list-unit-files and list-units --all to find inactive services
- Parse both outputs to ensure all services are discovered
- Remove special SSH detection logic since sshd is in service filters
- Rename interesting_services to service_name_filters for clarity
- Now detects services in any state: active, inactive, failed, dead, etc.
- Remove bloated last_meaningful_status tracking
- Treat any Unknown→Ok transition as recovery
- Reduce JSON persistence to only metric_statuses and metric_details
- Eliminate unnecessary status history complexity
Add comprehensive hysteresis support to prevent status oscillation near
threshold boundaries while maintaining responsive alerting.
Key Features:
- HysteresisThresholds with configurable upper/lower limits
- StatusTracker for per-metric status history
- Default gaps: CPU load 10%, memory 5%, disk temp 5°C
Updated Components:
- CPU load collector (5-minute average with hysteresis)
- Memory usage collector (percentage-based thresholds)
- Disk temperature collector (SMART data monitoring)
- All collectors updated to support StatusTracker interface
Cache Interval Adjustments:
- Service status: 60s → 10s (faster response)
- Disk usage: 300s → 60s (more frequent checks)
- Backup status: 900s → 60s (quicker updates)
- SMART data: moved to 600s tier (10 minutes)
Architecture:
- Individual metric status calculation in collectors
- Centralized StatusTracker in MetricCollectionManager
- Status aggregation preserved in dashboard widgets
- Remove proxy_pass backend checking
- All sites now checked using https://server_name format
- Maintains 10-second timeout for external site checks
- Simplifies monitoring to consistent external health checks
- Add support for both proxied and static nginx sites
- Proxied sites show 'P' prefix and check backend URLs
- Static sites check external HTTPS URLs
- Fix services panel column alignment for main services
- Keep 10-second timeout for all site checks
- Add full email notifications with lettre and Stockholm timezone
- Add status persistence to prevent notification spam on restart
- Change nginx monitoring to check backend proxy_pass URLs instead of frontend domains
- Increase nginx site timeout to 10 seconds for backend health checks
- Fix cache intervals: disk (5min), backup (10min), systemd (30s), cpu/memory (5s)
- Remove rate limiting for immediate notifications on all status changes
- Store metric status in /var/lib/cm-dashboard/last-status.json
- Fix host_index calculation for localhost to use actual position in sorted list
- Remove incorrect assumption that localhost is always at index 0
- Host navigation (Tab key) now works correctly with all hosts in alphabetical order
Fixes issue where only 3 of 5 hosts were accessible via Tab navigation.
- Remove predefined host order that was causing random display order
- Sort hosts alphabetically for consistent title display
- Localhost is still auto-selected at startup but doesn't affect display order
- Title will now show: cmbox ● labbox ● simonbox ● srv01 ● srv02 ● steambox
Eliminates confusing random host order in dashboard title bar.
- Add has_data() method to BackupWidget to check if backup metrics exist
- Modify dashboard layout to conditionally show backup panel only when data exists
- When no backup data: system panel takes full left side height
- When backup data exists: system and backup panels share left side equally
Prevents empty backup panel from taking up screen space unnecessarily.