Implement agent version tracking to diagnose deployment issues:
- Add get_agent_hash() method to extract Nix store hash from executable path
- Collect system_agent_hash metric in NixOS collector
- Display "Agent Hash" in system panel under NixOS section
- Update metric filtering to include agent hash
This helps identify which version of the agent is actually running
when troubleshooting deployment or metric collection issues.
- Strip codename part (e.g., '(Warbler)') from nixos-version output
- Display clean version format: '25.05.20251004.3bcc93c'
- Simplify parsing to use raw nixos-version output as requested
- Change from showing version to build format: 'hash dd/mm/yy H:M:S'
- Parse nixos-version output to extract short hash and format date
- Update system widget to display 'Build:' instead of 'Version:'
- Remove version/build_date fields in favor of single build string
- Follow TODO.md specification for NixOS section layout
- Remove /tmp autodetection from disk collector (57 lines removed)
- Add tmpfs monitoring to memory collector with get_tmpfs_metrics() method
- Generate memory_tmp_* metrics for proper RAM-based tmpfs monitoring
- Fix type annotations in tmpfs parsing for compilation
- System widget now correctly displays tmpfs usage in RAM section
- Create NixOS collector for version and active users detection
- Add SystemWidget combining all system information in TODO.md layout
- Replace separate CPU/Memory widgets with unified system display
- Add tree structure for storage with drive temperature/wear info
- Support NixOS version, active users, load averages, memory usage
- Follow exact decimal formatting from specification
- Use config.interval_seconds instead of hardcoded 300 seconds
- Discovery now happens every 10 seconds (configurable) instead of 5 minutes
- Follows configuration-driven architecture requirements
- Restructure get_monitored_services to avoid nested write locks
- Split discover_services into discover_services_internal that returns data
- Update state in separate scope to prevent deadlock
- Fix borrow checker errors with clone() for status cache
Add detailed debug logging to track:
- Service discovery start
- Individual service parsing
- Final service count and list
- Empty results indication
This will help identify why cmbox disappeared from dashboard.
Major performance optimization:
- Parse and cache service status during discovery from systemctl list-units
- Eliminate per-service systemctl is-active and show calls
- Reduce systemctl calls from 1+2N to just 1 call total
- For 10 services: 21 calls → 1 call (95% reduction)
- Add fallback to systemctl for cache misses
This completes the major systemctl call reduction goal from TODO.md.
Implement glob pattern matching for service filters:
- nginx* matches nginx, nginx-config-reload, etc.
- *backup matches any service ending with 'backup'
- docker*prune matches docker-weekly-prune, etc.
- Exact matches still work as before (backward compatible)
Addresses TODO.md requirement for '*' filtering support.
Reduce from 2 systemctl commands to 1 by using only:
systemctl list-units --type=service --all
This captures all services (active, inactive, failed) in one call,
eliminating the redundant list-unit-files command.
Achieves the TODO.md goal of reducing systemctl calls.
Remove all sudo -u systemctl commands and user service processing.
Now only collects system services via systemctl list-units/list-unit-files.
Eliminates user service discovery completely as planned in TODO.md.
Change service matching logic from contains-based to exact equality.
Services now match only if service_name == pattern exactly.
This is the first step in the systemd collector optimization plan.
- Fix duplicate storage pool issue by clearing cache on agent startup
- Change storage pool header text to normal color for better readability
- Improve services panel tree icons with proper └─ symbols for last items
- Ensure fresh metrics data on each agent restart
- Add StoragePool and DriveInfo structures for grouping drives by mount point
- Implement SMART data collection for individual drives (health, temperature, wear)
- Support for ext4, zfs, xfs, mergerfs, btrfs filesystem types
- Generate individual drive metrics: disk_[pool]_[drive]_health/temperature/wear
- Add storage_type and underlying_devices to filesystem configuration
- Move hardcoded service directory mappings to NixOS configuration
- Move hardcoded host-to-user mapping to NixOS configuration
- Remove all unused code and fix compilation warnings
- Clean implementation with zero warnings and no dead code
Individual drives now show health status per storage pool:
Storage root (ext4): nvme0n1 PASSED 42°C 5% wear
Storage steampool (mergerfs): sda/sdb/sdc with individual health data
- Remove all unused configuration options from dashboard config module
- Eliminate hardcoded defaults - dashboard now requires config file like agent
- Keep only actually used config: zmq.subscriber_ports and hosts.predefined_hosts
- Remove unused get_host_metrics function from metric store
- Clean up missing module imports (hosts, utils)
- Make dashboard fail fast if no configuration provided
- Align dashboard config approach with agent configuration pattern
- Remove all Default implementations from agent configuration structs
- Make configuration file required for agent startup
- Update NixOS module to generate complete agent.toml configuration
- Add comprehensive configuration options to NixOS module including:
- Service include/exclude patterns for systemd collector
- All thresholds and intervals
- ZMQ communication settings
- Notification and cache configuration
- Agent now fails fast if no configuration provided
- Eliminates configuration drift between defaults and NixOS settings
- Use systemctl --user commands to discover user-level services
- Include both user unit files and loaded user units
- Gracefully handle cases where user commands fail (no user session)
- Treat user services same as system services in filtering
- Enables monitoring of user-level Docker, development servers, etc.
- Add ark-permissions to exclusion list (maintenance service)
- Add sunshine to service_name_filters (game streaming server)
- Improves service discovery for game streaming infrastructure
- Use systemctl list-unit-files and list-units --all to find inactive services
- Parse both outputs to ensure all services are discovered
- Remove special SSH detection logic since sshd is in service filters
- Rename interesting_services to service_name_filters for clarity
- Now detects services in any state: active, inactive, failed, dead, etc.
- Remove bloated last_meaningful_status tracking
- Treat any Unknown→Ok transition as recovery
- Reduce JSON persistence to only metric_statuses and metric_details
- Eliminate unnecessary status history complexity
Add comprehensive hysteresis support to prevent status oscillation near
threshold boundaries while maintaining responsive alerting.
Key Features:
- HysteresisThresholds with configurable upper/lower limits
- StatusTracker for per-metric status history
- Default gaps: CPU load 10%, memory 5%, disk temp 5°C
Updated Components:
- CPU load collector (5-minute average with hysteresis)
- Memory usage collector (percentage-based thresholds)
- Disk temperature collector (SMART data monitoring)
- All collectors updated to support StatusTracker interface
Cache Interval Adjustments:
- Service status: 60s → 10s (faster response)
- Disk usage: 300s → 60s (more frequent checks)
- Backup status: 900s → 60s (quicker updates)
- SMART data: moved to 600s tier (10 minutes)
Architecture:
- Individual metric status calculation in collectors
- Centralized StatusTracker in MetricCollectionManager
- Status aggregation preserved in dashboard widgets
- Remove proxy_pass backend checking
- All sites now checked using https://server_name format
- Maintains 10-second timeout for external site checks
- Simplifies monitoring to consistent external health checks
- Add support for both proxied and static nginx sites
- Proxied sites show 'P' prefix and check backend URLs
- Static sites check external HTTPS URLs
- Fix services panel column alignment for main services
- Keep 10-second timeout for all site checks
- Add full email notifications with lettre and Stockholm timezone
- Add status persistence to prevent notification spam on restart
- Change nginx monitoring to check backend proxy_pass URLs instead of frontend domains
- Increase nginx site timeout to 10 seconds for backend health checks
- Fix cache intervals: disk (5min), backup (10min), systemd (30s), cpu/memory (5s)
- Remove rate limiting for immediate notifications on all status changes
- Store metric status in /var/lib/cm-dashboard/last-status.json
Only the 5-minute load average should trigger warning/critical alerts.
1-minute and 15-minute load averages now always show Status::Ok.
Thresholds (Warning: 9.0, Critical: 10.0) apply only to cpu_load_5min metric.
- Replace complex multi-strategy detection with single deterministic method
- Remove estimate_service_disk_usage and all fallback strategies
- Use simple get_service_disk_usage method with clear logic:
* Defined path exists → use only that path
* Defined path fails → return None (shows as '-')
* No defined path → use systemctl WorkingDirectory
* No estimates or guessing ever
Fixes misleading 5MB estimates when defined paths fail due to permissions.
ARK service directories require elevated permissions to access. The NixOS
configuration already allows sudo du with NOPASSWD, so use sudo du instead
of direct du command to properly detect disk usage for restricted directories.
- Prioritize defined service directories over systemctl WorkingDirectory fallbacks
- Add ARK Survival Ascended server mappings to correct NixOS-configured paths
- Remove legacy get_service_disk_usage method to eliminate code duplication
- Ensure deterministic behavior with single-purpose detection logic
Fixes ARK service disk usage reporting on srv02 by using actual data paths
from NixOS configuration instead of systemctl working directory detection.
Map each ARK service to its specific data directory:
- ark-island -> /var/lib/ark-servers/island
- ark-scorched -> /var/lib/ark-servers/scorched
- ark-center -> /var/lib/ark-servers/center
- ark-aberration -> /var/lib/ark-servers/aberration
- ark-extinction -> /var/lib/ark-servers/extinction
- ark-ragnarok -> /var/lib/ark-servers/ragnarok
- ark-valguero -> /var/lib/ark-servers/valguero
Based on NixOS configuration in srv02/configuration.nix.
Replace df-based auto-discovery with UUID-based detection using NixOS
hardware configuration data. Each host now has predefined filesystem
configurations with predictable metric names.
- Add FilesystemConfig struct with UUID, mount point, and filesystem type
- Remove auto_discover and devices fields from DiskConfig
- Add host-specific UUID defaults for cmbox, srv01, srv02, simonbox, steambox
- Remove legacy get_mounted_disks() df-based detection method
- Update DiskCollector to use UUID resolution via /dev/disk/by-uuid/
- Generate predictable metric names: disk_root_*, disk_boot_*, etc.
- Maintain fallback for labbox/wslbox (no UUIDs configured yet)
Provides consistent metric names across reboots and reliable detection
aligned with NixOS deployments without dependency on mount order.
Removed unused widget subscription system, cache utilities, error variants,
theme functions, and struct fields. Replaced subscription-based widgets
with direct metric filtering. Build now completes with zero warnings.
- Add sudo to disk collector smartctl commands for proper SMART data access
- Add reqwest dependency with blocking feature for HTTP site checks
- Replace curl-based site latency with reqwest HTTP client implementation
- Maintain 2-second connect timeout and 5-second total timeout
- Fix disk health UNKNOWN status by enabling proper SMART permissions
- Fix nginx site timeout issues by using proper HTTP client with redirect support
- Add BackupCollector for reading TOML status files with disk space metrics
- Implement BackupWidget with disk usage display and service status details
- Fix backup script disk space parsing by adding missing capture_output=True
- Update backup widget to show actual disk usage instead of repository size
- Fix timestamp parsing to use backup completion time instead of start time
- Resolve timezone issues by using UTC timestamps in backup script
- Add disk identification metrics (product name, serial number) to backup status
- Enhance UI layout with proper backup monitoring integration