cm-dashboard

Author	SHA1	Message	Date
Christoffer Martinsson	dfe9c11102	Fix disk metric naming to maintain dashboard compatibility Keep numbered metric names (disk_0_, disk_1_) instead of named metrics (disk_root_, disk_boot_) to ensure existing dashboard continues working. UUID-based detection works internally but produces compatible metric names.	2025-10-20 10:07:34 +02:00
Christoffer Martinsson	e7200fb1b0	Implement UUID-based disk detection for CMTEC infrastructure Replace df-based auto-discovery with UUID-based detection using NixOS hardware configuration data. Each host now has predefined filesystem configurations with predictable metric names. - Add FilesystemConfig struct with UUID, mount point, and filesystem type - Remove auto_discover and devices fields from DiskConfig - Add host-specific UUID defaults for cmbox, srv01, srv02, simonbox, steambox - Remove legacy get_mounted_disks() df-based detection method - Update DiskCollector to use UUID resolution via /dev/disk/by-uuid/ - Generate predictable metric names: disk_root_, disk_boot_, etc. - Maintain fallback for labbox/wslbox (no UUIDs configured yet) Provides consistent metric names across reboots and reliable detection aligned with NixOS deployments without dependency on mount order.	2025-10-20 09:50:10 +02:00
Christoffer Martinsson	f67779be9d	Add ARK game servers to systemd service monitoring	2025-10-19 19:23:51 +02:00
Christoffer Martinsson	ca160c9627	Fix tab navigation to respect user choice and prevent jumping back to localhost - Add user_navigated_away flag to track manual navigation - Only auto-switch to localhost if user hasn't manually navigated away - Reset flag when host disconnects to allow auto-selection - Preserves user's tab navigation choices while still prioritizing localhost initially	2025-10-19 11:21:59 +02:00
Christoffer Martinsson	bf2f066029	Fix localhost prioritization to always switch when localhost connects - Dashboard now switches to localhost even if another host is already selected - Ensures localhost is always preferred regardless of connection order - Resolves issue where srv01 connecting first would prevent localhost selection	2025-10-19 11:12:05 +02:00
Christoffer Martinsson	07633e4e0e	Implement localhost prioritization and status display in dashboard - Always select localhost as default host at startup - Order hosts with localhost first, then predefined sequence - Display hostname status colors in title bar based on metric aggregation - Add gethostname dependency for localhost detection	2025-10-19 10:56:42 +02:00
Christoffer Martinsson	0141a6e111	Remove unused code and eliminate build warnings Removed unused widget subscription system, cache utilities, error variants, theme functions, and struct fields. Replaced subscription-based widgets with direct metric filtering. Build now completes with zero warnings.	2025-10-18 23:50:15 +02:00
Christoffer Martinsson	7f85a6436e	Clean up unused imports and fix build warnings - Remove unused imports (Duration, HashMap, SharedError, DateTime, etc.) - Fix unused variables by prefixing with underscore - Remove redundant dashboard.toml config file - Update theme imports to use only needed components - Maintain all functionality while reducing warnings - Add srv02 to predefined hosts configuration - Remove unused broadcast_command methods	2025-10-18 23:12:07 +02:00
Christoffer Martinsson	f0eec38655	Fix SMART data collection and clean up configuration - Restore sudo smartctl commands for proper SMART data collection - Add srv02 to host configuration for dashboard discovery - Remove redundant hosts.toml file, consolidate into dashboard.toml - Clean up base_url fields that were unused in ZMQ architecture The SMART data collection now works properly with systemd service by using sudo permissions configured in NixOS. Dashboard can now discover and connect to srv02 alongside existing hosts.	2025-10-18 22:22:02 +02:00
Christoffer Martinsson	8cf8d37556	Add srv02 to predefined host list	2025-10-18 20:43:25 +02:00
Christoffer Martinsson	792ad066c9	Fix per-host widget cache to prevent overwriting cached data Only update widgets when metrics are available for the current host, preventing immediate overwrite of cached widget states when switching hosts.	2025-10-18 20:20:58 +02:00
Christoffer Martinsson	4b7d08153c	Implement per-host widget cache for instant host switching Resolves widget data persistence issue where switching hosts left stale data from the previous host displayed in widgets. Key improvements: - Add Clone derives to all widget structs (CpuWidget, MemoryWidget, ServicesWidget, BackupWidget) - Create HostWidgets struct to cache widget states per hostname - Update TuiApp with HashMap<String, HostWidgets> for per-host storage - Fix borrowing issues by cloning hostname before mutable self borrow - Implement instant widget state restoration when switching hosts Tab key host switching now displays cached widget data for each host without stale information persistence between switches.	2025-10-18 19:54:08 +02:00
Christoffer Martinsson	46cc813a68	Implement Tab key host switching functionality - Add KeyCode::Tab support to main dashboard event loop - Add Tab key handling to TuiApp handle_input method - Tab key now cycles to next host using existing navigate_host logic - Host switching infrastructure was already implemented, just needed Tab key support - Current host displayed in bold in title bar, other hosts shown normally - Metrics filtered by selected host, full navigation working	2025-10-18 19:26:58 +02:00
Christoffer Martinsson	5d52c5b1aa	Fix SMART data and site latency checking issues - Add sudo to disk collector smartctl commands for proper SMART data access - Add reqwest dependency with blocking feature for HTTP site checks - Replace curl-based site latency with reqwest HTTP client implementation - Maintain 2-second connect timeout and 5-second total timeout - Fix disk health UNKNOWN status by enabling proper SMART permissions - Fix nginx site timeout issues by using proper HTTP client with redirect support	2025-10-18 19:14:29 +02:00
Christoffer Martinsson	dcca5bbea3	Fix cache tier test to match actual configuration - Update test expectations from 5s to 2s intervals for realtime tier - Fix comment to reflect actual 2s interval instead of outdated 5s reference - All tests now pass correctly	2025-10-18 18:44:13 +02:00
Christoffer Martinsson	125111ee99	Implement comprehensive backup monitoring and fix timestamp issues - Add BackupCollector for reading TOML status files with disk space metrics - Implement BackupWidget with disk usage display and service status details - Fix backup script disk space parsing by adding missing capture_output=True - Update backup widget to show actual disk usage instead of repository size - Fix timestamp parsing to use backup completion time instead of start time - Resolve timezone issues by using UTC timestamps in backup script - Add disk identification metrics (product name, serial number) to backup status - Enhance UI layout with proper backup monitoring integration	2025-10-18 18:33:41 +02:00
Christoffer Martinsson	8a36472a3d	Implement real-time process monitoring and fix UI hardcoded data This commit addresses several key issues identified during development: Major Changes: - Replace hardcoded top CPU/RAM process display with real system data - Add intelligent process monitoring to CpuCollector using ps command - Fix disk metrics permission issues in systemd collector - Optimize service collection to focus on status, memory, and disk only - Update dashboard widgets to display live process information Process Monitoring Implementation: - Added collect_top_cpu_process() and collect_top_ram_process() methods - Implemented ps-based monitoring with accurate CPU percentages - Added filtering to prevent self-monitoring artifacts (ps commands) - Enhanced error handling and validation for process data - Dashboard now shows realistic values like "claude (PID 2974) 11.0%" Service Collection Optimization: - Removed CPU monitoring from systemd collector for efficiency - Enhanced service directory permission error logging - Simplified services widget to show essential metrics only - Fixed service-to-directory mapping accuracy UI and Dashboard Improvements: - Reorganized dashboard layout with btop-inspired multi-panel design - Updated system panel to include real top CPU/RAM process display - Enhanced widget formatting and data presentation - Removed placeholder/hardcoded data throughout the interface Technical Details: - Updated agent/src/collectors/cpu.rs with process monitoring - Modified dashboard/src/ui/mod.rs for real-time process display - Enhanced systemd collector error handling and disk metrics - Updated CLAUDE.md documentation with implementation details	2025-10-16 23:55:05 +02:00
Christoffer Martinsson	7a664ef0fb	Remove refresh functionality that causes dashboard to hang - Remove 'r' key handler that was causing hang on refresh - Remove RefreshRequested event and check_refresh_request method - Remove send_refresh_commands function and ZMQ command protocol - Remove refresh_requested field from App struct - Clean up status line text (refresh -> tick) The refresh functionality was causing the dashboard to become unresponsive when pressing 'r' key. This removes all refresh-related code to fix the issue.	2025-10-16 01:00:39 +02:00
Christoffer Martinsson	cfc89e7312	Implement metric-level caching system for optimal CPU performance Replace legacy SmartCache with MetricCollectionManager for granular control: - RealTime tier (5s): CPU load, CPU temperature, Service CPU usage - Fast tier (30s): Memory usage, top processes - Medium tier (5min): Service status, C-states, users - Slow tier (15min): Disk usage All CPU-related metrics now update consistently every 5 seconds as requested, eliminating the previous inconsistency where only CPU load was updating at the correct frequency while service CPU usage was on 5-minute intervals.	2025-10-16 00:44:15 +02:00
Christoffer Martinsson	246973ebf6	Fix dashboard connectivity by aggregating metric fragments The issue was that the metric-level system was sending individual metric fragments (CPU load, temperature separately) instead of complete System/Service messages that the dashboard expects. Now aggregates individual metrics into complete messages: - CPU load + temperature -> complete System message - Memory + processes -> complete System message - Service metrics remain as complete messages This should resolve 'No data received' on srv01 while maintaining the 5-second CPU metric update frequency.	2025-10-16 00:25:23 +02:00
Christoffer Martinsson	3a959e55ed	Fix critical JSON data extraction issue in SystemCollector The MetricCollector implementation was returning JSON with null values because it was incorrectly extracting Option<&Value> instead of the actual values. Fixed by using .cloned().unwrap_or() to properly extract and default the JSON values. This should resolve the 'No data received' issue as the dashboard will now receive properly formatted metric data instead of null values.	2025-10-16 00:10:17 +02:00
Christoffer Martinsson	925988896a	Add ZMQ send debugging to identify data transmission issues Added detailed logging for ZMQ data sending to see exactly what data is being transmitted and whether sends are successful. This will help identify if the issue is in data format, sending, or dashboard reception.	2025-10-16 00:00:40 +02:00
Christoffer Martinsson	6bc2ffd94b	Add detailed error logging for metric collection debugging Added comprehensive error logging to identify why metrics are not being collected successfully. This will help diagnose the 'No data received' issue on srv01 by showing exactly which metrics are failing and why.	2025-10-15 23:29:42 +02:00
Christoffer Martinsson	10aa72816d	Fix critical ZMQ command loop causing agent failure The handle_commands() function was being called continuously in the main tokio::select! loop, causing thousands of ZMQ state errors that prevented the agent from functioning properly. Temporarily disabled command handling to restore basic functionality. Agent now properly collects and sends metrics without ZMQ errors. Fixes 'No data received' issue on hosts running the new metric-level agent.	2025-10-15 23:19:44 +02:00
Christoffer Martinsson	ce2aeeff34	Implement metric-level caching architecture for granular CPU monitoring Replace legacy SmartCache with MetricCollectionManager for precise control over individual metric refresh intervals. CPU load and Service CPU usage now update every 5 seconds as required, while other metrics use optimal intervals based on volatility. Key changes: - ServiceCollector/SystemCollector implement MetricCollector trait - Metric-specific cache tiers: RealTime(5s), Fast(30s), Medium(5min), Slow(15min) - SmartAgent main loop uses metric-level scheduling instead of tier-based - CPU metrics (load, temp, service CPU) refresh every 5 seconds - Memory and processes refresh every 30 seconds - Service status and C-states refresh every 5 minutes - Disk usage refreshes every 15 minutes Performance optimized architecture maintains <2% CPU usage while ensuring dashboard responsiveness with precise metric timing control.	2025-10-15 23:08:33 +02:00
Christoffer Martinsson	6bc7f97375	Add refresh shortkey 'r' for on-demand metrics refresh Implements ZMQ command protocol for dashboard-to-agent communication: - Agents listen on port 6131 for REQ/REP commands - Dashboard sends "refresh" command when 'r' key is pressed - Agents force immediate collection of all metrics via force_refresh_all() - Fresh data is broadcast immediately to dashboard - Updated help text to show "r: Refresh all metrics" Also includes metric-level caching architecture foundation for future granular control over individual metric update frequencies.	2025-10-15 22:30:04 +02:00
Christoffer Martinsson	244cade7d8	Fix critical ZMQ broadcast issue in smart agent Root cause: Smart agent only sent data when tier intervals triggered: - System (5s): sent data frequently ✓ - Services (5min): sent data only every 5 minutes ✗ - SMART (15min): sent data only every 15 minutes ✗ Dashboard needs continuous data flow every ~5 seconds. Solution: Add broadcast_all_data() method that sends all available cached data every 5 seconds, separate from collection intervals. This ensures dashboard receives all collector data continuously while maintaining smart caching benefits (reduced CPU from tier-based collection). Expected result: All widgets (System/Services/SMART/Backup) should populate immediately after agent restart and stay updated.	2025-10-15 21:21:34 +02:00
Christoffer Martinsson	996b89aa47	Fix critical cache key mismatch in smart agent Cache storage was using keys like 'hostname_service' but lookup was using 'hostname_CollectorName', causing all non-System collectors to fail. Changes: - Standardize cache keys to use collector names ('SystemCollector', 'ServiceCollector', etc.) - Add cache_key() getter method to CachedCollector - Fix cache lookup to use consistent keys This should resolve the issue where srv01 only shows System data but no Services/SMART/Backup data in the dashboard.	2025-10-15 12:12:45 +02:00
Christoffer Martinsson	b0112dd8ab	Fix immich disk quota and usage detection - Update quota from 200GB to 500GB (matches NixOS config) - Fix disk usage path: /var/lib/immich-server -> /var/lib/immich - Add service-to-directory mapping for accurate disk usage detection This should resolve the "<1MB disk usage of 200GB" issue - immich should now correctly show usage of /var/lib/immich with 500GB quota.	2025-10-15 11:59:07 +02:00
Christoffer Martinsson	1b572c5c1d	Implement intelligent caching system for optimal CPU performance Replace traditional 5-second polling with tiered collection strategy: - RealTime (5s): CPU load, memory usage - Medium (5min): Service status, disk usage - Slow (15min): SMART data, backup status Key improvements: - Reduce CPU usage from 9.5% to <2% - Cache warming for instant dashboard responsiveness - Background refresh at 80% of tier intervals - Thread-safe cache with automatic cleanup Remove legacy polling code - smart caching is now the default and only mode. Agent startup enhanced with parallel cache population for immediate data availability. Architecture: SmartCache + CachedCollector + tiered CollectionScheduler	2025-10-15 11:21:36 +02:00
Christoffer Martinsson	1b442be9ad	Fix service disk quota detection to use actual systemd quotas - Implement proper quota detection for services with known systemd configurations - Set gitea quota to 100GB (matches NixOS tmpfiles configuration) - Add service-specific quotas: postgres/mysql 50GB, immich 200GB, unifi 10GB - Fallback to service-appropriate defaults for other services	2025-10-15 09:57:05 +02:00
Christoffer Martinsson	efdd713f62	Improve dashboard display and fix service issues - Remove unreachable descriptions from failed nginx sites - Show complete site URLs instead of truncating at first dot - Implement service-specific disk quotas (docker: 4GB, immich: 4GB, others: 1-2GB) - Truncate process names to show only executable name without full path - Display only highest C-state instead of all C-states for cleaner output - Format system RAM as xxxMB/GB (totalGB) to match services format	2025-10-15 09:36:03 +02:00
Christoffer Martinsson	672c8bebc9	Fix recursive async function for notification system - Convert recursive async function to synchronous with return values - Collect all status changes first, then process them asynchronously - Resolves Rust compiler error E0733 for recursive async functions - Maintains same functionality without boxing requirement - Verified with full workspace build matching NixOS configuration	2025-10-14 23:22:30 +02:00
Christoffer Martinsson	407329657f	Implement unified notification system for all agents - Replace hardcoded agent-specific notification logic with generic scanner - Automatically detect all '_status' fields across all collectors recursively - Send email notifications from hostname@cmtec.se to cm@cmtec.se for any status changes - Include agent name, component, and source path in notification description - Works identically for System, Service, Smart, Backup, and future collectors - Supports nested objects and arrays for comprehensive monitoring	2025-10-14 23:10:15 +02:00
Christoffer Martinsson	a64464142c	Remove nginx site accessibility filtering to monitor all sites - Remove check_site_accessibility function and filtering logic - Monitor ALL nginx sites from config regardless of current status - Site status determined by measure_site_latency, not accessibility filter - Fixes missing git.cmtec.se when backend is down (502 errors) - Sites with errors now show as failed instead of being filtered out	2025-10-14 22:46:06 +02:00
Christoffer Martinsson	0cb69ea8fa	Consolidate HTTP checking and improve display formatting - Change site latency timeout from 5s to 2s for faster error detection - Replace curl with reqwest for external connectivity checks (consistent timeouts) - Remove unused gitea-specific monitoring functionality - Update dashboard: show 'unreachable' for latency > 2000ms, add arrows (→) between site and latency - Add percentage signs to CPU metrics display - All HTTP requests now use reqwest with 2-second timeouts	2025-10-14 22:24:22 +02:00
Christoffer Martinsson	819ca4ad73	Fix SystemCollector method placement and remove duplicates - Move get_top_cpu_process() and get_top_ram_process() methods inside SystemCollector impl block - Remove duplicate method definitions that were placed after trait implementation - Ensures methods are properly accessible during compilation	2025-10-14 22:05:44 +02:00
Christoffer Martinsson	f3b6d12f68	Add top CPU and RAM process monitoring to System widget - Implement get_top_cpu_process() and get_top_ram_process() functions in SystemCollector - Add top_cpu_process and top_ram_process fields to SystemSummary data structure - Update System widget to display top processes as description rows - Show process name and percentage usage for highest CPU and RAM consumers - Skip kernel threads and filter out processes with minimal usage (<0.1%)	2025-10-14 21:47:52 +02:00
Christoffer Martinsson	2bffbaa000	Change nginx site monitoring from HEAD to GET requests - Fix false negatives for sites that don't handle HEAD requests properly - Resolves photos.cmtec.se showing error when it actually works fine - Improves compatibility with modern web applications	2025-10-14 21:22:30 +02:00
Christoffer Martinsson	355a986582	Fix nginx site monitoring to properly detect errors - Return error status for HTTP 502/5xx responses instead of success - Show 'error' description for sites with connectivity but wrong status codes - Show 'unreachable' description for complete connection failures - Each nginx site now has independent status based on actual health - Sites with timeouts or server errors will trigger notifications	2025-10-14 20:53:07 +02:00
Christoffer Martinsson	e64527ce2f	Fix compilation error in nginx site status handling	2025-10-14 20:30:47 +02:00
Christoffer Martinsson	77795c44d3	Implement nginx site status monitoring with unreachable detection - Show 'unreachable' status for nginx sites that fail connection tests - Set service status to error (red) for unreachable sites - Display latency in milliseconds for responsive sites - Properly count failed sites in service summary statistics - Improve nginx site monitoring reliability and visibility	2025-10-14 20:19:39 +02:00
Christoffer Martinsson	f10a4e25e6	Update Cargo.lock for reqwest dependency	2025-10-14 19:42:49 +02:00
Christoffer Martinsson	fd8aa0678e	Implement nginx site latency monitoring and improve disk usage display Agent improvements: - Add reqwest dependency for HTTP latency testing - Implement measure_site_latency() function for nginx sites - Add latency_ms field to ServiceData structure - Measure response times for nginx sites using HEAD requests - Handle connection failures gracefully with 5-second timeout - Use HTTPS for external sites, HTTP for localhost Dashboard improvements: - Add latency_ms field to ServiceInfo structure - Display latency for nginx sites: "docker.cmtec.se 134ms" - Only show latency for nginx sub-services, not other services - Change disk usage "0" to "<1MB" for better readability The Services widget now shows: - Nginx sites with response times when measurable - Cleaner disk usage formatting for small values - Improved user experience with meaningful latency data	2025-10-14 19:38:36 +02:00
Christoffer Martinsson	c6e8749ddd	Implement logged-in users monitoring and improve widget formatting Agent improvements: - Add get_logged_in_users() function to SystemCollector using 'who' command - Collect unique, sorted list of currently logged-in users - Include logged_in_users field in system metrics JSON output - Change C-state formatting to show 2 states per row instead of 4 Dashboard improvements: - Update Backups widget to show "Archives: XX, ..." format - System widget ready to display logged-in users with proper formatting The System widget will now show: - C-states formatted as 2 per row for better readability - Logged-in users displayed as "Logged in: user" or "Logged in: X users (user1, user2)"	2025-10-14 19:23:26 +02:00
Christoffer Martinsson	1ee398e648	Improve widget formatting and add logged-in users support Services widget: - Fix disk quota formatting with proper rounding instead of truncation - Remove decimals from RAM quotas and use GB instead of G - Change quota display to use GB consistently Backups widget: - Change GiB to GB for consistency - Remove spaces between numbers and units - Update disk usage format to match other widgets: used (totalGB) - Remove percentage display for cleaner format System widget: - Add support for logged-in users in description lines - Format C-states with "C-State:" prefix on first line, indent subsequent lines - Add logged_in_users field to SystemSummary data structure Documentation: - Add example hash error output to NixOS update instructions	2025-10-14 18:59:31 +02:00
Christoffer Martinsson	3e5e91f078	Remove SB column and improve widget formatting Services widget: - Remove SB (sandbox) column and related formatting function - Fix quota formatting to show decimals when needed (1.5G not 1G) - Remove spaces in unit display (128MB not 128 MB) Storage widget: - Change usage format to 23GB (932GB) for better readability Documentation: - Add NixOS configuration update process to CLAUDE.md	2025-10-14 18:40:12 +02:00
Christoffer Martinsson	b0d3d85fb9	Improve services widget column headers and value formatting - Update column headers to be more concise: RAM (GB) → RAM, CPU (%) → CPU, Disk (GB) → Disk - Change sandbox column "no(ok)" to "-" for excluded services - Implement smart unit formatting for memory and disk values (kB/MB/GB) - Display quotas as (XG) format without decimals when limits exist - Add format_bytes() helper for consistent unit display across metrics	2025-10-14 18:21:45 +02:00
Christoffer Martinsson	c6d5a3f2a5	Add sandbox exclusion list for system services Implement exclusion list for services that don't require sandboxing due to their nature (SSH, Docker, system services). These services now show "no(ok)" in SB column and maintain green status instead of warning. Changes: - Add is_sandbox_excluded field to ServiceData and ServiceInfo structs - Add is_sandbox_excluded() method with system service exclusions: - sshd/ssh (needs system access for auth/shell) - docker (needs broad system access) - systemd services, dbus, NetworkManager, etc. - Update status determination to accept excluded services as ok - Update format_sandbox_value to show "no(ok)" for excluded services - Update all ServiceData constructors with exclusion field Service status logic: - Sandboxed: Status=Running, SB="yes" - Excluded: Status=Running, SB="no(ok)" - Should be sandboxed but isn't: Status=Degraded, SB="no" This provides clear distinction between services that legitimately don't need sandboxing vs. those requiring security attention.	2025-10-14 11:35:42 +02:00
Christoffer Martinsson	4fa2b079f1	Add sandbox column and security-based service status Add new "SB" column to services widget showing systemd sandboxing status. Service status now reflects security posture with unsandboxed services showing as degraded/warning status. Changes: - Add is_sandboxed field to ServiceData and ServiceInfo structs - Add check_service_sandbox method detecting systemd hardening features - Add format_sandbox_value function showing "yes"/"no" for sandboxing - Update service status determination to consider sandbox status: - Sandboxed + Running = "Running" (green/ok) - Unsandboxed + Running = "Degraded" (yellow/warning) - Failed services = "Stopped" (red/critical) - Add "SB" column header to services widget Services without proper NixOS hardening (PrivateTmp, ProtectSystem, etc.) now show warning status to highlight security concerns.	2025-10-14 11:18:07 +02:00

1 2 3 4 5

219 Commits