16 Commits

Author SHA1 Message Date
00a8ed3da2 Implement hysteresis for metric status changes to prevent flapping
Add comprehensive hysteresis support to prevent status oscillation near
threshold boundaries while maintaining responsive alerting.

Key Features:
- HysteresisThresholds with configurable upper/lower limits
- StatusTracker for per-metric status history
- Default gaps: CPU load 10%, memory 5%, disk temp 5°C

Updated Components:
- CPU load collector (5-minute average with hysteresis)
- Memory usage collector (percentage-based thresholds)
- Disk temperature collector (SMART data monitoring)
- All collectors updated to support StatusTracker interface

Cache Interval Adjustments:
- Service status: 60s → 10s (faster response)
- Disk usage: 300s → 60s (more frequent checks)
- Backup status: 900s → 60s (quicker updates)
- SMART data: moved to 600s tier (10 minutes)

Architecture:
- Individual metric status calculation in collectors
- Centralized StatusTracker in MetricCollectionManager
- Status aggregation preserved in dashboard widgets
2025-10-20 18:45:41 +02:00
dcca5bbea3 Fix cache tier test to match actual configuration
- Update test expectations from 5s to 2s intervals for realtime tier
- Fix comment to reflect actual 2s interval instead of outdated 5s reference
- All tests now pass correctly
2025-10-18 18:44:13 +02:00
8a36472a3d Implement real-time process monitoring and fix UI hardcoded data
This commit addresses several key issues identified during development:

Major Changes:
- Replace hardcoded top CPU/RAM process display with real system data
- Add intelligent process monitoring to CpuCollector using ps command
- Fix disk metrics permission issues in systemd collector
- Optimize service collection to focus on status, memory, and disk only
- Update dashboard widgets to display live process information

Process Monitoring Implementation:
- Added collect_top_cpu_process() and collect_top_ram_process() methods
- Implemented ps-based monitoring with accurate CPU percentages
- Added filtering to prevent self-monitoring artifacts (ps commands)
- Enhanced error handling and validation for process data
- Dashboard now shows realistic values like "claude (PID 2974) 11.0%"

Service Collection Optimization:
- Removed CPU monitoring from systemd collector for efficiency
- Enhanced service directory permission error logging
- Simplified services widget to show essential metrics only
- Fixed service-to-directory mapping accuracy

UI and Dashboard Improvements:
- Reorganized dashboard layout with btop-inspired multi-panel design
- Updated system panel to include real top CPU/RAM process display
- Enhanced widget formatting and data presentation
- Removed placeholder/hardcoded data throughout the interface

Technical Details:
- Updated agent/src/collectors/cpu.rs with process monitoring
- Modified dashboard/src/ui/mod.rs for real-time process display
- Enhanced systemd collector error handling and disk metrics
- Updated CLAUDE.md documentation with implementation details
2025-10-16 23:55:05 +02:00
6bc7f97375 Add refresh shortkey 'r' for on-demand metrics refresh
Implements ZMQ command protocol for dashboard-to-agent communication:
- Agents listen on port 6131 for REQ/REP commands
- Dashboard sends "refresh" command when 'r' key is pressed
- Agents force immediate collection of all metrics via force_refresh_all()
- Fresh data is broadcast immediately to dashboard
- Updated help text to show "r: Refresh all metrics"

Also includes metric-level caching architecture foundation for future
granular control over individual metric update frequencies.
2025-10-15 22:30:04 +02:00
1b572c5c1d Implement intelligent caching system for optimal CPU performance
Replace traditional 5-second polling with tiered collection strategy:
- RealTime (5s): CPU load, memory usage
- Medium (5min): Service status, disk usage
- Slow (15min): SMART data, backup status

Key improvements:
- Reduce CPU usage from 9.5% to <2%
- Cache warming for instant dashboard responsiveness
- Background refresh at 80% of tier intervals
- Thread-safe cache with automatic cleanup

Remove legacy polling code - smart caching is now the default and only mode.
Agent startup enhanced with parallel cache population for immediate data availability.

Architecture: SmartCache + CachedCollector + tiered CollectionScheduler
2025-10-15 11:21:36 +02:00
efdd713f62 Improve dashboard display and fix service issues
- Remove unreachable descriptions from failed nginx sites
- Show complete site URLs instead of truncating at first dot
- Implement service-specific disk quotas (docker: 4GB, immich: 4GB, others: 1-2GB)
- Truncate process names to show only executable name without full path
- Display only highest C-state instead of all C-states for cleaner output
- Format system RAM as xxxMB/GB (totalGB) to match services format
2025-10-15 09:36:03 +02:00
1ee398e648 Improve widget formatting and add logged-in users support
Services widget:
- Fix disk quota formatting with proper rounding instead of truncation
- Remove decimals from RAM quotas and use GB instead of G
- Change quota display to use GB consistently

Backups widget:
- Change GiB to GB for consistency
- Remove spaces between numbers and units
- Update disk usage format to match other widgets: used (totalGB)
- Remove percentage display for cleaner format

System widget:
- Add support for logged-in users in description lines
- Format C-states with "C-State:" prefix on first line, indent subsequent lines
- Add logged_in_users field to SystemSummary data structure

Documentation:
- Add example hash error output to NixOS update instructions
2025-10-14 18:59:31 +02:00
3e5e91f078 Remove SB column and improve widget formatting
Services widget:
- Remove SB (sandbox) column and related formatting function
- Fix quota formatting to show decimals when needed (1.5G not 1G)
- Remove spaces in unit display (128MB not 128 MB)

Storage widget:
- Change usage format to 23GB (932GB) for better readability

Documentation:
- Add NixOS configuration update process to CLAUDE.md
2025-10-14 18:40:12 +02:00
b0d7d5ce35 Testing 2025-10-13 11:23:49 +02:00
c68ccf023e Testing 2025-10-13 00:28:06 +02:00
57b676ad25 Testing 2025-10-13 00:16:24 +02:00
9e344fb66d Testing 2025-10-12 22:31:46 +02:00
c3dbaeead2 Optimize agent CPU usage with throttled service descriptions 2025-10-12 15:13:42 +02:00
2239badc8a Testing 2025-10-12 14:53:27 +02:00
2581435b10 Implement per-service disk usage monitoring
Replaced system-wide disk usage with accurate per-service tracking by scanning
service-specific directories. Services like sshd now correctly show minimal
disk usage instead of misleading system totals.

- Rename storage widget and add drive capacity/usage columns
- Move host display to main dashboard title for cleaner layout
- Replace separate alert displays with color-coded row highlighting
- Add per-service disk usage collection using du command
- Update services widget formatting to handle small disk values
- Restructure into workspace with dedicated agent and dashboard packages
2025-10-11 22:59:16 +02:00
656cb5943b Switch dashboard to ZMQ gossip data source 2025-10-11 13:36:46 +02:00