cm-dashboard

Author	SHA1	Message	Date
Christoffer Martinsson	374b126446	Reduce all command timeouts to 2-3 seconds max With 10-second host heartbeat timeout, all command timeouts must be significantly lower to ensure total collection time stays under 10 seconds. Changed timeouts: - smartctl: 10s → 3s (critical: multiple drives queried sequentially) - du: 5s → 2s - lsblk: 5s → 2s - systemctl list commands: 5s → 3s - systemctl show/is-active: 3s → 2s - docker commands: 5s → 3s - df, ip commands: 3s → 2s Total worst-case collection time now capped at more reasonable levels, preventing false host offline alerts from blocking operations.	2025-11-27 16:38:54 +01:00
Christoffer Martinsson	1e0510be81	Add comprehensive timeouts to all blocking system commands Fixes random host disconnections caused by blocking operations preventing timely ZMQ packet transmission. Changes: - Add run_command_with_timeout() wrapper using tokio for async command execution - Apply 10s timeout to smartctl (prevents 30+ second hangs on failing drives) - Apply 5s timeout to du, lsblk, systemctl list commands - Apply 3s timeout to systemctl show/is-active, df, ip commands - Apply 2s timeout to hostname command - Use system 'timeout' command for sync operations where async not needed Critical fixes: - smartctl: Failing drives could block for 30+ seconds per drive - du: Large directories (Docker, PostgreSQL) could block 10-30+ seconds - systemctl/docker: Commands could block indefinitely during system issues With 1-second collection interval and 10-second heartbeat timeout, any blocking operation >10s causes false "host offline" alerts. These timeouts ensure collection completes quickly even during system degradation.	2025-11-27 16:34:08 +01:00
Christoffer Martinsson	66ab7a492d	Complete monitoring system restoration All checks were successful Build and Release / build-and-release (push) Successful in 2m39s Details Fully restored CM Dashboard as a complete monitoring system with working status evaluation and email notifications. COMPLETED PHASES: ✅ Phase 1: Fixed storage display issues - Use lsblk instead of findmnt (eliminates /nix/store bind mount) - Fixed NVMe SMART parsing (Temperature: and Percentage Used:) - Added sudo to smartctl for permissions - Consistent filesystem and tmpfs sorting ✅ Phase 2a: Fixed missing NixOS build information - Added build_version field to AgentData - NixOS collector now populates build info - Dashboard shows actual build instead of "unknown" ✅ Phase 2b: Restored status evaluation system - Added status fields to all structured data types - CPU: load and temperature status evaluation - Memory: usage status evaluation - Storage: temperature, health, and filesystem usage status - All collectors now use their threshold configurations ✅ Phase 3: Restored notification system - Status change detection between collection cycles - Email alerts on status degradation (OK→Warning/Critical) - Detailed notification content with metric values - Full NotificationManager integration CORE FUNCTIONALITY RESTORED: - Real-time monitoring with proper status evaluation - Email notifications on threshold violations - Correct storage display (nvme0n1 T: 28°C W: 1%) - Complete status-aware infrastructure monitoring - Dashboard is now a monitoring system, not just data viewer The CM Dashboard monitoring system is fully operational.	2025-11-24 19:58:26 +01:00
Christoffer Martinsson	4d615a7f45	Fix mount point ordering consistency - Sort filesystems by mount point in disk collector for consistent display - Sort tmpfs mounts by mount point in memory collector - Eliminates random swapping of / and /boot order between refreshes - Eliminates random swapping of tmpfs mount order in RAM section Ensures predictable, alphabetical ordering for all mount points.	2025-11-24 19:44:37 +01:00
Christoffer Martinsson	2b2cb2da3e	Complete atomic migration to structured data architecture All checks were successful Build and Release / build-and-release (push) Successful in 1m7s Details Implements clean structured data collection eliminating all string metric parsing bugs. Collectors now populate AgentData directly with type-safe field access. Key improvements: - Mount points preserved correctly (/ and /boot instead of root/boot) - Tmpfs discovery added to memory collector - Temperature data flows as typed f32 fields - Zero string parsing overhead - Complete removal of MetricCollectionManager bridge - Direct ZMQ transmission of structured JSON All functionality maintained: service tracking, notifications, status evaluation, and multi-host monitoring.	2025-11-24 18:53:31 +01:00
Christoffer Martinsson	2910b7d875	Update version to 0.1.22 and fix system metric status calculation All checks were successful Build and Release / build-and-release (push) Successful in 1m11s Details - Fix /tmp usage status to use proper thresholds instead of hardcoded Ok status - Fix wear level status to use configurable thresholds instead of hardcoded values - Add dedicated tmp_status field to SystemWidget for proper /tmp status display - Remove host-level hourglass icon during service operations - Implement immediate service status updates after start/stop/restart commands - Remove active users display and collection from NixOS section - Fix immediate host status aggregation transmission to dashboard	2025-10-28 13:21:56 +01:00
Christoffer Martinsson	4b54a59e35	Remove unused code and eliminate compiler warnings - Remove unused fields from CommandStatus variants - Clean up unused methods and unused collector fields - Fix lifetime syntax warning in SystemWidget - Delete unused cache module completely - Remove redundant render methods from widgets All agent and dashboard warnings eliminated while preserving panel switching and scrolling functionality.	2025-10-25 14:15:52 +02:00
Christoffer Martinsson	9e80d6b654	Remove hardcoded /tmp autodetection and implement proper tmpfs monitoring - Remove /tmp autodetection from disk collector (57 lines removed) - Add tmpfs monitoring to memory collector with get_tmpfs_metrics() method - Generate memory_tmp_* metrics for proper RAM-based tmpfs monitoring - Fix type annotations in tmpfs parsing for compilation - System widget now correctly displays tmpfs usage in RAM section	2025-10-23 14:26:15 +02:00
Christoffer Martinsson	eb268922bd	Remove all unused code and fix build warnings - Remove unused struct fields: tier, config_name, last_collection_time - Remove unused structs: PerformanceMetrics, PerfMonitor - Remove unused methods: get_performance_metrics, get_collector_names, get_stats - Remove unused utility functions and system helpers - Remove unused config fields from CPU and Memory collectors - Keep config fields that are actually used (DiskCollector, etc.) - Remove unused proxy_pass_url variable and assignments - Fix duplicate hostname variable declaration - Achieve zero build warnings without functionality changes	2025-10-20 20:20:47 +02:00
Christoffer Martinsson	00a8ed3da2	Implement hysteresis for metric status changes to prevent flapping Add comprehensive hysteresis support to prevent status oscillation near threshold boundaries while maintaining responsive alerting. Key Features: - HysteresisThresholds with configurable upper/lower limits - StatusTracker for per-metric status history - Default gaps: CPU load 10%, memory 5%, disk temp 5°C Updated Components: - CPU load collector (5-minute average with hysteresis) - Memory usage collector (percentage-based thresholds) - Disk temperature collector (SMART data monitoring) - All collectors updated to support StatusTracker interface Cache Interval Adjustments: - Service status: 60s → 10s (faster response) - Disk usage: 300s → 60s (more frequent checks) - Backup status: 900s → 60s (quicker updates) - SMART data: moved to 600s tier (10 minutes) Architecture: - Individual metric status calculation in collectors - Centralized StatusTracker in MetricCollectionManager - Status aggregation preserved in dashboard widgets	2025-10-20 18:45:41 +02:00
Christoffer Martinsson	7f85a6436e	Clean up unused imports and fix build warnings - Remove unused imports (Duration, HashMap, SharedError, DateTime, etc.) - Fix unused variables by prefixing with underscore - Remove redundant dashboard.toml config file - Update theme imports to use only needed components - Maintain all functionality while reducing warnings - Add srv02 to predefined hosts configuration - Remove unused broadcast_command methods	2025-10-18 23:12:07 +02:00
Christoffer Martinsson	8a36472a3d	Implement real-time process monitoring and fix UI hardcoded data This commit addresses several key issues identified during development: Major Changes: - Replace hardcoded top CPU/RAM process display with real system data - Add intelligent process monitoring to CpuCollector using ps command - Fix disk metrics permission issues in systemd collector - Optimize service collection to focus on status, memory, and disk only - Update dashboard widgets to display live process information Process Monitoring Implementation: - Added collect_top_cpu_process() and collect_top_ram_process() methods - Implemented ps-based monitoring with accurate CPU percentages - Added filtering to prevent self-monitoring artifacts (ps commands) - Enhanced error handling and validation for process data - Dashboard now shows realistic values like "claude (PID 2974) 11.0%" Service Collection Optimization: - Removed CPU monitoring from systemd collector for efficiency - Enhanced service directory permission error logging - Simplified services widget to show essential metrics only - Fixed service-to-directory mapping accuracy UI and Dashboard Improvements: - Reorganized dashboard layout with btop-inspired multi-panel design - Updated system panel to include real top CPU/RAM process display - Enhanced widget formatting and data presentation - Removed placeholder/hardcoded data throughout the interface Technical Details: - Updated agent/src/collectors/cpu.rs with process monitoring - Modified dashboard/src/ui/mod.rs for real-time process display - Enhanced systemd collector error handling and disk metrics - Updated CLAUDE.md documentation with implementation details	2025-10-16 23:55:05 +02:00

12 Commits