160 Commits

Author SHA1 Message Date
c5ec529210 Add agent hash display to system panel
Implement agent version tracking to diagnose deployment issues:
- Add get_agent_hash() method to extract Nix store hash from executable path
- Collect system_agent_hash metric in NixOS collector
- Display "Agent Hash" in system panel under NixOS section
- Update metric filtering to include agent hash

This helps identify which version of the agent is actually running
when troubleshooting deployment or metric collection issues.
2025-10-23 17:33:45 +02:00
3b1bda741b Remove codename from NixOS build display
- Strip codename part (e.g., '(Warbler)') from nixos-version output
- Display clean version format: '25.05.20251004.3bcc93c'
- Simplify parsing to use raw nixos-version output as requested
2025-10-23 14:55:18 +02:00
64af24dc40 Update NixOS display format to show build hash and timestamp
- Change from showing version to build format: 'hash dd/mm/yy H:M:S'
- Parse nixos-version output to extract short hash and format date
- Update system widget to display 'Build:' instead of 'Version:'
- Remove version/build_date fields in favor of single build string
- Follow TODO.md specification for NixOS section layout
2025-10-23 14:48:25 +02:00
9e80d6b654 Remove hardcoded /tmp autodetection and implement proper tmpfs monitoring
- Remove /tmp autodetection from disk collector (57 lines removed)
- Add tmpfs monitoring to memory collector with get_tmpfs_metrics() method
- Generate memory_tmp_* metrics for proper RAM-based tmpfs monitoring
- Fix type annotations in tmpfs parsing for compilation
- System widget now correctly displays tmpfs usage in RAM section
2025-10-23 14:26:15 +02:00
39fc9cd22f Implement unified system widget with NixOS info, CPU, RAM, and Storage
- Create NixOS collector for version and active users detection
- Add SystemWidget combining all system information in TODO.md layout
- Replace separate CPU/Memory widgets with unified system display
- Add tree structure for storage with drive temperature/wear info
- Support NixOS version, active users, load averages, memory usage
- Follow exact decimal formatting from specification
2025-10-23 14:01:14 +02:00
c99e0bd8ee Remove hardcoded discovery interval in systemd collector
- Use config.interval_seconds instead of hardcoded 300 seconds
- Discovery now happens every 10 seconds (configurable) instead of 5 minutes
- Follows configuration-driven architecture requirements
2025-10-23 13:20:48 +02:00
0f12438ab4 Fix RwLock deadlock in systemd collector Phase 4
- Restructure get_monitored_services to avoid nested write locks
- Split discover_services into discover_services_internal that returns data
- Update state in separate scope to prevent deadlock
- Fix borrow checker errors with clone() for status cache
2025-10-23 13:12:53 +02:00
7607e971b8 Add debug logging to diagnose Phase 4 service discovery issue
Add detailed debug logging to track:
- Service discovery start
- Individual service parsing
- Final service count and list
- Empty results indication

This will help identify why cmbox disappeared from dashboard.
2025-10-23 12:57:10 +02:00
da6f3c3855 Phase 4: Cache service status from discovery to eliminate per-service calls
Major performance optimization:
- Parse and cache service status during discovery from systemctl list-units
- Eliminate per-service systemctl is-active and show calls
- Reduce systemctl calls from 1+2N to just 1 call total
- For 10 services: 21 calls → 1 call (95% reduction)
- Add fallback to systemctl for cache misses

This completes the major systemctl call reduction goal from TODO.md.
2025-10-23 12:51:17 +02:00
174b27f31a Phase 3: Add wildcard support for service pattern matching
Implement glob pattern matching for service filters:
- nginx* matches nginx, nginx-config-reload, etc.
- *backup matches any service ending with 'backup'
- docker*prune matches docker-weekly-prune, etc.
- Exact matches still work as before (backward compatible)

Addresses TODO.md requirement for '*' filtering support.
2025-10-23 12:37:16 +02:00
dc11538ae9 Phase 2b: Optimize to single systemctl command
Reduce from 2 systemctl commands to 1 by using only:
systemctl list-units --type=service --all

This captures all services (active, inactive, failed) in one call,
eliminating the redundant list-unit-files command.
Achieves the TODO.md goal of reducing systemctl calls.
2025-10-23 12:34:54 +02:00
9133e18090 Phase 2: Remove user service collection logic
Remove all sudo -u systemctl commands and user service processing.
Now only collects system services via systemctl list-units/list-unit-files.
Eliminates user service discovery completely as planned in TODO.md.
2025-10-23 12:32:19 +02:00
616fad2c5d Phase 1: Implement exact name filtering for service matching
Change service matching logic from contains-based to exact equality.
Services now match only if service_name == pattern exactly.
This is the first step in the systemd collector optimization plan.
2025-10-23 12:22:26 +02:00
08d3454683 Enhance disk collector with individual drive health monitoring
- Add StoragePool and DriveInfo structures for grouping drives by mount point
- Implement SMART data collection for individual drives (health, temperature, wear)
- Support for ext4, zfs, xfs, mergerfs, btrfs filesystem types
- Generate individual drive metrics: disk_[pool]_[drive]_health/temperature/wear
- Add storage_type and underlying_devices to filesystem configuration
- Move hardcoded service directory mappings to NixOS configuration
- Move hardcoded host-to-user mapping to NixOS configuration
- Remove all unused code and fix compilation warnings
- Clean implementation with zero warnings and no dead code

Individual drives now show health status per storage pool:
Storage root (ext4): nvme0n1 PASSED 42°C 5% wear
Storage steampool (mergerfs): sda/sdb/sdc with individual health data
2025-10-22 19:59:25 +02:00
34822bd835 Fix systemd collector to use Status::Pending for transitional states 2025-10-21 19:08:58 +02:00
41208aa2a0 Implement status aggregation with notification batching 2025-10-21 18:12:42 +02:00
a937032eb1 Remove hardcoded defaults, require configuration file
- Remove all Default implementations from agent configuration structs
- Make configuration file required for agent startup
- Update NixOS module to generate complete agent.toml configuration
- Add comprehensive configuration options to NixOS module including:
  - Service include/exclude patterns for systemd collector
  - All thresholds and intervals
  - ZMQ communication settings
  - Notification and cache configuration
- Agent now fails fast if no configuration provided
- Eliminates configuration drift between defaults and NixOS settings
2025-10-21 00:01:26 +02:00
1e8da8c187 Add user service discovery to systemd collector
- Use systemctl --user commands to discover user-level services
- Include both user unit files and loaded user units
- Gracefully handle cases where user commands fail (no user session)
- Treat user services same as system services in filtering
- Enables monitoring of user-level Docker, development servers, etc.
2025-10-20 23:11:11 +02:00
1cc31ec26a Update service filters for better discovery
- Add ark-permissions to exclusion list (maintenance service)
- Add sunshine to service_name_filters (game streaming server)
- Improves service discovery for game streaming infrastructure
2025-10-20 23:01:03 +02:00
b580cfde8c Add more services to exclusion list
- Add docker-prune (cleanup services don't need monitoring)
- Add sshd-unix-local@ and sshd@ (SSH instance services)
- Add docker-registry-gar (Google Artifact Registry services)
- Keep main sshd service monitored while excluding per-connection instances
2025-10-20 22:51:15 +02:00
5886426dac Fix service discovery to detect all services regardless of state
- Use systemctl list-unit-files and list-units --all to find inactive services
- Parse both outputs to ensure all services are discovered
- Remove special SSH detection logic since sshd is in service filters
- Rename interesting_services to service_name_filters for clarity
- Now detects services in any state: active, inactive, failed, dead, etc.
2025-10-20 22:41:21 +02:00
eb268922bd Remove all unused code and fix build warnings
- Remove unused struct fields: tier, config_name, last_collection_time
- Remove unused structs: PerformanceMetrics, PerfMonitor
- Remove unused methods: get_performance_metrics, get_collector_names, get_stats
- Remove unused utility functions and system helpers
- Remove unused config fields from CPU and Memory collectors
- Keep config fields that are actually used (DiskCollector, etc.)
- Remove unused proxy_pass_url variable and assignments
- Fix duplicate hostname variable declaration
- Achieve zero build warnings without functionality changes
2025-10-20 20:20:47 +02:00
00a8ed3da2 Implement hysteresis for metric status changes to prevent flapping
Add comprehensive hysteresis support to prevent status oscillation near
threshold boundaries while maintaining responsive alerting.

Key Features:
- HysteresisThresholds with configurable upper/lower limits
- StatusTracker for per-metric status history
- Default gaps: CPU load 10%, memory 5%, disk temp 5°C

Updated Components:
- CPU load collector (5-minute average with hysteresis)
- Memory usage collector (percentage-based thresholds)
- Disk temperature collector (SMART data monitoring)
- All collectors updated to support StatusTracker interface

Cache Interval Adjustments:
- Service status: 60s → 10s (faster response)
- Disk usage: 300s → 60s (more frequent checks)
- Backup status: 900s → 60s (quicker updates)
- SMART data: moved to 600s tier (10 minutes)

Architecture:
- Individual metric status calculation in collectors
- Centralized StatusTracker in MetricCollectionManager
- Status aggregation preserved in dashboard widgets
2025-10-20 18:45:41 +02:00
e998679901 Revert nginx monitoring to check all sites via public HTTPS URLs
- Remove proxy_pass backend checking
- All sites now checked using https://server_name format
- Maintains 10-second timeout for external site checks
- Simplifies monitoring to consistent external health checks
2025-10-20 15:06:42 +02:00
2ccfc4256a Fix nginx monitoring and services panel alignment
- Add support for both proxied and static nginx sites
- Proxied sites show 'P' prefix and check backend URLs
- Static sites check external HTTPS URLs
- Fix services panel column alignment for main services
- Keep 10-second timeout for all site checks
2025-10-20 14:56:26 +02:00
66a79574e0 Implement comprehensive monitoring improvements
- Add full email notifications with lettre and Stockholm timezone
- Add status persistence to prevent notification spam on restart
- Change nginx monitoring to check backend proxy_pass URLs instead of frontend domains
- Increase nginx site timeout to 10 seconds for backend health checks
- Fix cache intervals: disk (5min), backup (10min), systemd (30s), cpu/memory (5s)
- Remove rate limiting for immediate notifications on all status changes
- Store metric status in /var/lib/cm-dashboard/last-status.json
2025-10-20 14:32:44 +02:00
28896d0b1b Fix CPU load alerting to only trigger on 5-minute load average
Only the 5-minute load average should trigger warning/critical alerts.
1-minute and 15-minute load averages now always show Status::Ok.

Thresholds (Warning: 9.0, Critical: 10.0) apply only to cpu_load_5min metric.
2025-10-20 11:12:15 +02:00
47a7d5ae62 Simplify service disk usage detection - remove all estimation fallbacks
- Replace complex multi-strategy detection with single deterministic method
- Remove estimate_service_disk_usage and all fallback strategies
- Use simple get_service_disk_usage method with clear logic:
  * Defined path exists → use only that path
  * Defined path fails → return None (shows as '-')
  * No defined path → use systemctl WorkingDirectory
  * No estimates or guessing ever

Fixes misleading 5MB estimates when defined paths fail due to permissions.
2025-10-20 11:06:49 +02:00
fe18ace767 Fix service disk usage detection to use sudo du for permission access
ARK service directories require elevated permissions to access. The NixOS
configuration already allows sudo du with NOPASSWD, so use sudo du instead
of direct du command to properly detect disk usage for restricted directories.
2025-10-20 10:58:17 +02:00
a1c980ad31 Implement deterministic service disk usage detection with defined paths
- Prioritize defined service directories over systemctl WorkingDirectory fallbacks
- Add ARK Survival Ascended server mappings to correct NixOS-configured paths
- Remove legacy get_service_disk_usage method to eliminate code duplication
- Ensure deterministic behavior with single-purpose detection logic

Fixes ARK service disk usage reporting on srv02 by using actual data paths
from NixOS configuration instead of systemctl working directory detection.
2025-10-20 10:45:30 +02:00
a3c9ac3617 Add ARK server directory mappings for accurate disk usage detection
Map each ARK service to its specific data directory:
- ark-island -> /var/lib/ark-servers/island
- ark-scorched -> /var/lib/ark-servers/scorched
- ark-center -> /var/lib/ark-servers/center
- ark-aberration -> /var/lib/ark-servers/aberration
- ark-extinction -> /var/lib/ark-servers/extinction
- ark-ragnarok -> /var/lib/ark-servers/ragnarok
- ark-valguero -> /var/lib/ark-servers/valguero

Based on NixOS configuration in srv02/configuration.nix.
2025-10-20 10:15:30 +02:00
dfe9c11102 Fix disk metric naming to maintain dashboard compatibility
Keep numbered metric names (disk_0_*, disk_1_*) instead of named metrics
(disk_root_*, disk_boot_*) to ensure existing dashboard continues working.
UUID-based detection works internally but produces compatible metric names.
2025-10-20 10:07:34 +02:00
e7200fb1b0 Implement UUID-based disk detection for CMTEC infrastructure
Replace df-based auto-discovery with UUID-based detection using NixOS
hardware configuration data. Each host now has predefined filesystem
configurations with predictable metric names.

- Add FilesystemConfig struct with UUID, mount point, and filesystem type
- Remove auto_discover and devices fields from DiskConfig
- Add host-specific UUID defaults for cmbox, srv01, srv02, simonbox, steambox
- Remove legacy get_mounted_disks() df-based detection method
- Update DiskCollector to use UUID resolution via /dev/disk/by-uuid/
- Generate predictable metric names: disk_root_*, disk_boot_*, etc.
- Maintain fallback for labbox/wslbox (no UUIDs configured yet)

Provides consistent metric names across reboots and reliable detection
aligned with NixOS deployments without dependency on mount order.
2025-10-20 09:50:10 +02:00
f67779be9d Add ARK game servers to systemd service monitoring 2025-10-19 19:23:51 +02:00
0141a6e111 Remove unused code and eliminate build warnings
Removed unused widget subscription system, cache utilities, error variants,
theme functions, and struct fields. Replaced subscription-based widgets
with direct metric filtering. Build now completes with zero warnings.
2025-10-18 23:50:15 +02:00
7f85a6436e Clean up unused imports and fix build warnings
- Remove unused imports (Duration, HashMap, SharedError, DateTime, etc.)
- Fix unused variables by prefixing with underscore
- Remove redundant dashboard.toml config file
- Update theme imports to use only needed components
- Maintain all functionality while reducing warnings
- Add srv02 to predefined hosts configuration
- Remove unused broadcast_command methods
2025-10-18 23:12:07 +02:00
5d52c5b1aa Fix SMART data and site latency checking issues
- Add sudo to disk collector smartctl commands for proper SMART data access
- Add reqwest dependency with blocking feature for HTTP site checks
- Replace curl-based site latency with reqwest HTTP client implementation
- Maintain 2-second connect timeout and 5-second total timeout
- Fix disk health UNKNOWN status by enabling proper SMART permissions
- Fix nginx site timeout issues by using proper HTTP client with redirect support
2025-10-18 19:14:29 +02:00
125111ee99 Implement comprehensive backup monitoring and fix timestamp issues
- Add BackupCollector for reading TOML status files with disk space metrics
- Implement BackupWidget with disk usage display and service status details
- Fix backup script disk space parsing by adding missing capture_output=True
- Update backup widget to show actual disk usage instead of repository size
- Fix timestamp parsing to use backup completion time instead of start time
- Resolve timezone issues by using UTC timestamps in backup script
- Add disk identification metrics (product name, serial number) to backup status
- Enhance UI layout with proper backup monitoring integration
2025-10-18 18:33:41 +02:00
8a36472a3d Implement real-time process monitoring and fix UI hardcoded data
This commit addresses several key issues identified during development:

Major Changes:
- Replace hardcoded top CPU/RAM process display with real system data
- Add intelligent process monitoring to CpuCollector using ps command
- Fix disk metrics permission issues in systemd collector
- Optimize service collection to focus on status, memory, and disk only
- Update dashboard widgets to display live process information

Process Monitoring Implementation:
- Added collect_top_cpu_process() and collect_top_ram_process() methods
- Implemented ps-based monitoring with accurate CPU percentages
- Added filtering to prevent self-monitoring artifacts (ps commands)
- Enhanced error handling and validation for process data
- Dashboard now shows realistic values like "claude (PID 2974) 11.0%"

Service Collection Optimization:
- Removed CPU monitoring from systemd collector for efficiency
- Enhanced service directory permission error logging
- Simplified services widget to show essential metrics only
- Fixed service-to-directory mapping accuracy

UI and Dashboard Improvements:
- Reorganized dashboard layout with btop-inspired multi-panel design
- Updated system panel to include real top CPU/RAM process display
- Enhanced widget formatting and data presentation
- Removed placeholder/hardcoded data throughout the interface

Technical Details:
- Updated agent/src/collectors/cpu.rs with process monitoring
- Modified dashboard/src/ui/mod.rs for real-time process display
- Enhanced systemd collector error handling and disk metrics
- Updated CLAUDE.md documentation with implementation details
2025-10-16 23:55:05 +02:00
3a959e55ed Fix critical JSON data extraction issue in SystemCollector
The MetricCollector implementation was returning JSON with null values
because it was incorrectly extracting Option<&Value> instead of the
actual values. Fixed by using .cloned().unwrap_or() to properly
extract and default the JSON values.

This should resolve the 'No data received' issue as the dashboard
will now receive properly formatted metric data instead of null values.
2025-10-16 00:10:17 +02:00
ce2aeeff34 Implement metric-level caching architecture for granular CPU monitoring
Replace legacy SmartCache with MetricCollectionManager for precise control
over individual metric refresh intervals. CPU load and Service CPU usage
now update every 5 seconds as required, while other metrics use optimal
intervals based on volatility.

Key changes:
- ServiceCollector/SystemCollector implement MetricCollector trait
- Metric-specific cache tiers: RealTime(5s), Fast(30s), Medium(5min), Slow(15min)
- SmartAgent main loop uses metric-level scheduling instead of tier-based
- CPU metrics (load, temp, service CPU) refresh every 5 seconds
- Memory and processes refresh every 30 seconds
- Service status and C-states refresh every 5 minutes
- Disk usage refreshes every 15 minutes

Performance optimized architecture maintains <2% CPU usage while ensuring
dashboard responsiveness with precise metric timing control.
2025-10-15 23:08:33 +02:00
b0112dd8ab Fix immich disk quota and usage detection
- Update quota from 200GB to 500GB (matches NixOS config)
- Fix disk usage path: /var/lib/immich-server -> /var/lib/immich
- Add service-to-directory mapping for accurate disk usage detection

This should resolve the "<1MB disk usage of 200GB" issue -
immich should now correctly show usage of /var/lib/immich with 500GB quota.
2025-10-15 11:59:07 +02:00
1b442be9ad Fix service disk quota detection to use actual systemd quotas
- Implement proper quota detection for services with known systemd configurations
- Set gitea quota to 100GB (matches NixOS tmpfiles configuration)
- Add service-specific quotas: postgres/mysql 50GB, immich 200GB, unifi 10GB
- Fallback to service-appropriate defaults for other services
2025-10-15 09:57:05 +02:00
efdd713f62 Improve dashboard display and fix service issues
- Remove unreachable descriptions from failed nginx sites
- Show complete site URLs instead of truncating at first dot
- Implement service-specific disk quotas (docker: 4GB, immich: 4GB, others: 1-2GB)
- Truncate process names to show only executable name without full path
- Display only highest C-state instead of all C-states for cleaner output
- Format system RAM as xxxMB/GB (totalGB) to match services format
2025-10-15 09:36:03 +02:00
a64464142c Remove nginx site accessibility filtering to monitor all sites
- Remove check_site_accessibility function and filtering logic
- Monitor ALL nginx sites from config regardless of current status
- Site status determined by measure_site_latency, not accessibility filter
- Fixes missing git.cmtec.se when backend is down (502 errors)
- Sites with errors now show as failed instead of being filtered out
2025-10-14 22:46:06 +02:00
0cb69ea8fa Consolidate HTTP checking and improve display formatting
- Change site latency timeout from 5s to 2s for faster error detection
- Replace curl with reqwest for external connectivity checks (consistent timeouts)
- Remove unused gitea-specific monitoring functionality
- Update dashboard: show 'unreachable' for latency > 2000ms, add arrows (→) between site and latency
- Add percentage signs to CPU metrics display
- All HTTP requests now use reqwest with 2-second timeouts
2025-10-14 22:24:22 +02:00
819ca4ad73 Fix SystemCollector method placement and remove duplicates
- Move get_top_cpu_process() and get_top_ram_process() methods inside SystemCollector impl block
- Remove duplicate method definitions that were placed after trait implementation
- Ensures methods are properly accessible during compilation
2025-10-14 22:05:44 +02:00
f3b6d12f68 Add top CPU and RAM process monitoring to System widget
- Implement get_top_cpu_process() and get_top_ram_process() functions in SystemCollector
- Add top_cpu_process and top_ram_process fields to SystemSummary data structure
- Update System widget to display top processes as description rows
- Show process name and percentage usage for highest CPU and RAM consumers
- Skip kernel threads and filter out processes with minimal usage (<0.1%)
2025-10-14 21:47:52 +02:00
2bffbaa000 Change nginx site monitoring from HEAD to GET requests
- Fix false negatives for sites that don't handle HEAD requests properly
- Resolves photos.cmtec.se showing error when it actually works fine
- Improves compatibility with modern web applications
2025-10-14 21:22:30 +02:00
355a986582 Fix nginx site monitoring to properly detect errors
- Return error status for HTTP 502/5xx responses instead of success
- Show 'error' description for sites with connectivity but wrong status codes
- Show 'unreachable' description for complete connection failures
- Each nginx site now has independent status based on actual health
- Sites with timeouts or server errors will trigger notifications
2025-10-14 20:53:07 +02:00