- Implement per-collector interval timing respecting NixOS config
- Remove all hardcoded timeout/interval values and make configurable
- Add tmux session requirement check for TUI mode (bypassed for headless)
- Update agent to send config hash in Build field instead of nixos version
- Add nginx check interval, HTTP timeouts, and ZMQ transmission interval configs
- Update NixOS configuration with new configurable values
Breaking changes:
- Build field now shows nix store config hash (8 chars) instead of nixos version
- All intervals now follow individual collector configuration instead of global
New configuration fields:
- systemd.nginx_check_interval_seconds
- systemd.http_timeout_seconds
- systemd.http_connect_timeout_seconds
- zmq.transmission_interval_seconds
- Add cm-rebuild systemd service for process isolation
- Add sudo permissions for service control and journal access
- Remove verbose flag for cleaner output
- Ensures reliable rebuild operations without agent crashes
- Update agent, dashboard, and shared package versions from 0.1.0 to 0.1.11
- Ensures agent version reporting shows correct v0.1.11 instead of v0.1.0
- Synchronize package versions with git tag for consistent version tracking
- Add full email notifications with lettre and Stockholm timezone
- Add status persistence to prevent notification spam on restart
- Change nginx monitoring to check backend proxy_pass URLs instead of frontend domains
- Increase nginx site timeout to 10 seconds for backend health checks
- Fix cache intervals: disk (5min), backup (10min), systemd (30s), cpu/memory (5s)
- Remove rate limiting for immediate notifications on all status changes
- Store metric status in /var/lib/cm-dashboard/last-status.json
- Add sudo to disk collector smartctl commands for proper SMART data access
- Add reqwest dependency with blocking feature for HTTP site checks
- Replace curl-based site latency with reqwest HTTP client implementation
- Maintain 2-second connect timeout and 5-second total timeout
- Fix disk health UNKNOWN status by enabling proper SMART permissions
- Fix nginx site timeout issues by using proper HTTP client with redirect support
This commit addresses several key issues identified during development:
Major Changes:
- Replace hardcoded top CPU/RAM process display with real system data
- Add intelligent process monitoring to CpuCollector using ps command
- Fix disk metrics permission issues in systemd collector
- Optimize service collection to focus on status, memory, and disk only
- Update dashboard widgets to display live process information
Process Monitoring Implementation:
- Added collect_top_cpu_process() and collect_top_ram_process() methods
- Implemented ps-based monitoring with accurate CPU percentages
- Added filtering to prevent self-monitoring artifacts (ps commands)
- Enhanced error handling and validation for process data
- Dashboard now shows realistic values like "claude (PID 2974) 11.0%"
Service Collection Optimization:
- Removed CPU monitoring from systemd collector for efficiency
- Enhanced service directory permission error logging
- Simplified services widget to show essential metrics only
- Fixed service-to-directory mapping accuracy
UI and Dashboard Improvements:
- Reorganized dashboard layout with btop-inspired multi-panel design
- Updated system panel to include real top CPU/RAM process display
- Enhanced widget formatting and data presentation
- Removed placeholder/hardcoded data throughout the interface
Technical Details:
- Updated agent/src/collectors/cpu.rs with process monitoring
- Modified dashboard/src/ui/mod.rs for real-time process display
- Enhanced systemd collector error handling and disk metrics
- Updated CLAUDE.md documentation with implementation details
Agent improvements:
- Add reqwest dependency for HTTP latency testing
- Implement measure_site_latency() function for nginx sites
- Add latency_ms field to ServiceData structure
- Measure response times for nginx sites using HEAD requests
- Handle connection failures gracefully with 5-second timeout
- Use HTTPS for external sites, HTTP for localhost
Dashboard improvements:
- Add latency_ms field to ServiceInfo structure
- Display latency for nginx sites: "docker.cmtec.se 134ms"
- Only show latency for nginx sub-services, not other services
- Change disk usage "0" to "<1MB" for better readability
The Services widget now shows:
- Nginx sites with response times when measurable
- Cleaner disk usage formatting for small values
- Improved user experience with meaningful latency data
Agent Changes:
• Add CPU status thresholds (warning: ≥5.0, critical: ≥8.0)
• Add memory status thresholds (warning: ≥80%, critical: ≥95%)
• Add service status calculation (critical if failed>0, warning if degraded>0)
• All collectors now calculate and include status in output
Dashboard Changes:
• Update system widget to use agent-calculated cpu_status and memory_status
• Update services widget to use agent-calculated services_status
• Remove client-side status calculations in favor of agent status
• Add status_level_from_agent_status helper function
Notification System:
• Add SMTP email notification system using lettre crate
• Auto-configure notifications: hostname@cmtec.se → cm@cmtec.se
• Smart change detection with rate limiting (30min cooldown)
• Only notify on transitions to/from warning/critical states
• Rich email formatting with host, component, metric details
Replaced system-wide disk usage with accurate per-service tracking by scanning
service-specific directories. Services like sshd now correctly show minimal
disk usage instead of misleading system totals.
- Rename storage widget and add drive capacity/usage columns
- Move host display to main dashboard title for cleaner layout
- Replace separate alert displays with color-coded row highlighting
- Add per-service disk usage collection using du command
- Update services widget formatting to handle small disk values
- Restructure into workspace with dedicated agent and dashboard packages