Bump version to v0.1.194

Fix data duplication in cached collector architecture
Critical bug fix: Collectors were appending to Vecs instead of replacing them, causing duplicate entries with each collection cycle. Fixed by adding .clear() calls before populating: - Memory collector: tmpfs Vec (was showing 11+ duplicates) - Disk collector: drives and pools Vecs - Systemd collector: services Vec - Network collector: Already correct (assigns new Vec) This prevents the exponential growth of duplicate entries in the dashboard UI.
2025-11-27 22:46:17 +01:00 · 2025-11-27 22:45:44 +01:00 · 2025-11-27 22:37:20 +01:00 · 2025-11-27 21:49:44 +01:00 · 2025-11-27 18:34:53 +01:00 · 2025-11-27 18:34:27 +01:00
13 changed files with 369 additions and 138 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -156,6 +156,86 @@ Complete migration from string-based metrics to structured JSON data. Eliminates
 - ✅ Backward compatibility via bridge conversion to existing UI widgets
 - ✅ All string parsing bugs eliminated

+### Cached Collector Architecture (✅ IMPLEMENTED)
+
+**Problem:** Blocking collectors prevent timely ZMQ transmission, causing false "host offline" alerts.
+
+**Previous (Sequential Blocking):**
+```
+Every 1 second:
+  └─ collect_all_data() [BLOCKS for 2-10+ seconds]
+      ├─ CPU (fast: 10ms)
+      ├─ Memory (fast: 20ms)
+      ├─ Disk SMART (slow: 3s per drive × 4 drives = 12s)
+      ├─ Service disk usage (slow: 2-8s per service)
+      └─ Docker (medium: 500ms)
+  └─ send_via_zmq()  [Only after ALL collection completes]
+
+Result: If any collector takes >10s → "host offline" false alert
+```
+
+**New (Cached Independent Collectors):**
+```
+Shared Cache: Arc<RwLock<AgentData>>
+
+Background Collectors (independent async tasks):
+├─ Fast collectors (CPU, RAM, Network)
+│   └─ Update cache every 1 second
+├─ Medium collectors (Services, Docker)
+│   └─ Update cache every 5 seconds
+└─ Slow collectors (Disk usage, SMART data)
+    └─ Update cache every 60 seconds
+
+ZMQ Sender (separate async task):
+Every 1 second:
+  └─ Read current cache
+  └─ Send via ZMQ [Always instant, never blocked]
+```
+
+**Benefits:**
+- ✅ ZMQ sends every 1 second regardless of collector speed
+- ✅ No false "host offline" alerts from slow collectors
+- ✅ Different update rates for different metrics (CPU=1s, SMART=60s)
+- ✅ System stays responsive even with slow operations
+- ✅ Slow collectors can use longer timeouts without blocking
+
+**Implementation Details:**
+- **Shared cache**: `Arc<RwLock<AgentData>>` initialized at agent startup
+- **Collector intervals**: Fully configurable via NixOS config (`interval_seconds` per collector)
+  - Recommended: Fast (1-10s): CPU, Memory, Network
+  - Recommended: Medium (30-60s): Backup, NixOS
+  - Recommended: Slow (60-300s): Disk, Systemd
+- **Independent tasks**: Each collector spawned as separate tokio task in `Agent::new()`
+- **Cache updates**: Collectors acquire write lock → update → release immediately
+- **ZMQ sender**: Main loop reads cache every `collection_interval_seconds` and broadcasts
+- **Notification check**: Runs every `notifications.check_interval_seconds`
+- **Lock strategy**: Short-lived write locks prevent blocking, read locks for transmission
+- **Stale data**: Acceptable for slow-changing metrics (SMART data, disk usage)
+
+**Configuration (NixOS):**
+All intervals and timeouts configurable in `services/cm-dashboard.nix`:
+
+Collection Intervals:
+- `collectors.cpu.interval_seconds` (default: 10s)
+- `collectors.memory.interval_seconds` (default: 2s)
+- `collectors.disk.interval_seconds` (default: 300s)
+- `collectors.systemd.interval_seconds` (default: 10s)
+- `collectors.backup.interval_seconds` (default: 60s)
+- `collectors.network.interval_seconds` (default: 10s)
+- `collectors.nixos.interval_seconds` (default: 60s)
+- `notifications.check_interval_seconds` (default: 30s)
+- `collection_interval_seconds` - ZMQ transmission rate (default: 2s)
+
+Command Timeouts (prevent resource leaks from hung commands):
+- `collectors.disk.command_timeout_seconds` (default: 30s) - lsblk, smartctl, etc.
+- `collectors.systemd.command_timeout_seconds` (default: 15s) - systemctl, docker, du
+- `collectors.network.command_timeout_seconds` (default: 10s) - ip route, ip addr
+
+**Code Locations:**
+- agent/src/agent.rs:59-133 - Collector task spawning
+- agent/src/agent.rs:151-179 - Independent collector task runner
+- agent/src/agent.rs:199-207 - ZMQ sender in main loop
+
 ### Maintenance Mode

 - Agent checks for `/tmp/cm-maintenance` file before sending notifications
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -279,7 +279,7 @@ checksum = "a1d728cc89cf3aee9ff92b05e62b19ee65a02b5702cff7d5a377e32c6ae29d8d"

 [[package]]
 name = "cm-dashboard"
-version = "0.1.188"
+version = "0.1.193"
 dependencies = [
 "anyhow",
 "chrono",
@@ -301,7 +301,7 @@ dependencies = [

 [[package]]
 name = "cm-dashboard-agent"
-version = "0.1.188"
+version = "0.1.193"
 dependencies = [
 "anyhow",
 "async-trait",
@@ -324,7 +324,7 @@ dependencies = [

 [[package]]
 name = "cm-dashboard-shared"
-version = "0.1.188"
+version = "0.1.193"
 dependencies = [
 "chrono",
 "serde",
--- a/agent/Cargo.toml
+++ b/agent/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "cm-dashboard-agent"
-version = "0.1.189"
+version = "0.1.194"
 edition = "2021"

 [dependencies]
--- a/agent/src/agent.rs
+++ b/agent/src/agent.rs
@@ -1,13 +1,14 @@
 use anyhow::Result;
 use gethostname::gethostname;
+use std::sync::Arc;
 use std::time::Duration;
+use tokio::sync::RwLock;
 use tokio::time::interval;
 use tracing::{debug, error, info};

 use crate::communication::{AgentCommand, ZmqHandler};
 use crate::config::AgentConfig;
 use crate::collectors::{
-    Collector,
    backup::BackupCollector,
    cpu::CpuCollector,
    disk::DiskCollector,
@@ -23,7 +24,7 @@ pub struct Agent {
    hostname: String,
    config: AgentConfig,
    zmq_handler: ZmqHandler,
-    collectors: Vec<Box<dyn Collector>>,
+    cache: Arc<RwLock<AgentData>>,
    notification_manager: NotificationManager,
    previous_status: Option<SystemStatus>,
 }
@@ -55,39 +56,94 @@ impl Agent {
            config.zmq.publisher_port
        );

-        // Initialize collectors
-        let mut collectors: Vec<Box<dyn Collector>> = Vec::new();
-        
-        // Add enabled collectors
+        // Initialize shared cache
+        let cache = Arc::new(RwLock::new(AgentData::new(
+            hostname.clone(),
+            env!("CARGO_PKG_VERSION").to_string()
+        )));
+        info!("Initialized shared agent data cache");
+
+        // Spawn independent collector tasks
+        let mut collector_count = 0;
+
+        // CPU collector
        if config.collectors.cpu.enabled {
-            collectors.push(Box::new(CpuCollector::new(config.collectors.cpu.clone())));
+            let cache_clone = cache.clone();
+            let collector = CpuCollector::new(config.collectors.cpu.clone());
+            let interval = config.collectors.cpu.interval_seconds;
+            tokio::spawn(async move {
+                Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "CPU").await;
+            });
+            collector_count += 1;
        }
-        
+
+        // Memory collector
        if config.collectors.memory.enabled {
-            collectors.push(Box::new(MemoryCollector::new(config.collectors.memory.clone())));
-        }
-        
-        if config.collectors.disk.enabled {
-            collectors.push(Box::new(DiskCollector::new(config.collectors.disk.clone())));
-        }
-        
-        if config.collectors.systemd.enabled {
-            collectors.push(Box::new(SystemdCollector::new(config.collectors.systemd.clone())));
-        }
-        
-        if config.collectors.backup.enabled {
-            collectors.push(Box::new(BackupCollector::new()));
+            let cache_clone = cache.clone();
+            let collector = MemoryCollector::new(config.collectors.memory.clone());
+            let interval = config.collectors.memory.interval_seconds;
+            tokio::spawn(async move {
+                Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Memory").await;
+            });
+            collector_count += 1;
        }

+        // Network collector
        if config.collectors.network.enabled {
-            collectors.push(Box::new(NetworkCollector::new(config.collectors.network.clone())));
+            let cache_clone = cache.clone();
+            let collector = NetworkCollector::new(config.collectors.network.clone());
+            let interval = config.collectors.network.interval_seconds;
+            tokio::spawn(async move {
+                Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Network").await;
+            });
+            collector_count += 1;
        }

+        // Backup collector
+        if config.collectors.backup.enabled {
+            let cache_clone = cache.clone();
+            let collector = BackupCollector::new();
+            let interval = config.collectors.backup.interval_seconds;
+            tokio::spawn(async move {
+                Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Backup").await;
+            });
+            collector_count += 1;
+        }
+
+        // NixOS collector
        if config.collectors.nixos.enabled {
-            collectors.push(Box::new(NixOSCollector::new(config.collectors.nixos.clone())));
+            let cache_clone = cache.clone();
+            let collector = NixOSCollector::new(config.collectors.nixos.clone());
+            let interval = config.collectors.nixos.interval_seconds;
+            tokio::spawn(async move {
+                Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "NixOS").await;
+            });
+            collector_count += 1;
        }

-        info!("Initialized {} collectors", collectors.len());
+        // Disk collector
+        if config.collectors.disk.enabled {
+            let cache_clone = cache.clone();
+            let collector = DiskCollector::new(config.collectors.disk.clone());
+            let interval = config.collectors.disk.interval_seconds;
+            tokio::spawn(async move {
+                Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Disk").await;
+            });
+            collector_count += 1;
+        }
+
+        // Systemd collector
+        if config.collectors.systemd.enabled {
+            let cache_clone = cache.clone();
+            let collector = SystemdCollector::new(config.collectors.systemd.clone());
+            let interval = config.collectors.systemd.interval_seconds;
+            tokio::spawn(async move {
+                Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Systemd").await;
+            });
+            collector_count += 1;
+        }
+
+        info!("Spawned {} independent collector tasks", collector_count);

        // Initialize notification manager
        let notification_manager = NotificationManager::new(&config.notifications, &hostname)?;
@@ -97,45 +153,79 @@ impl Agent {
            hostname,
            config,
            zmq_handler,
-            collectors,
+            cache,
            notification_manager,
            previous_status: None,
        })
    }

-    /// Main agent loop with structured data collection
-    pub async fn run(&mut self, mut shutdown_rx: tokio::sync::oneshot::Receiver<()>) -> Result<()> {
-        info!("Starting agent main loop");
+    /// Independent collector task runner
+    async fn run_collector_task<C>(
+        cache: Arc<RwLock<AgentData>>,
+        collector: C,
+        interval_duration: Duration,
+        name: &str,
+    ) where
+        C: crate::collectors::Collector + Send + 'static,
+    {
+        let mut interval_timer = interval(interval_duration);
+        info!("{} collector task started (interval: {:?})", name, interval_duration);

-        // Initial collection
-        if let Err(e) = self.collect_and_broadcast().await {
-            error!("Initial metric collection failed: {}", e);
+        loop {
+            interval_timer.tick().await;
+
+            // Acquire write lock and update cache
+            {
+                let mut agent_data = cache.write().await;
+                match collector.collect_structured(&mut *agent_data).await {
+                    Ok(_) => {
+                        debug!("{} collector updated cache", name);
+                    }
+                    Err(e) => {
+                        error!("{} collector failed: {}", name, e);
+                    }
+                }
+            } // Release lock immediately after collection
        }
+    }

-        // Set up intervals
+    /// Main agent loop with cached data architecture
+    pub async fn run(&mut self, mut shutdown_rx: tokio::sync::oneshot::Receiver<()>) -> Result<()> {
+        info!("Starting agent main loop with cached collector architecture");
+
+        // Set up intervals from config
        let mut transmission_interval = interval(Duration::from_secs(
            self.config.collection_interval_seconds,
        ));
-        let mut notification_interval = interval(Duration::from_secs(30)); // Check notifications every 30s
+        let mut notification_interval = interval(Duration::from_secs(
+            self.config.notifications.check_interval_seconds,
+        ));
+        let mut command_interval = interval(Duration::from_millis(100));

-        // Skip initial ticks to avoid immediate execution
+        // Skip initial ticks
        transmission_interval.tick().await;
        notification_interval.tick().await;
+        command_interval.tick().await;

        loop {
            tokio::select! {
                _ = transmission_interval.tick() => {
-                    if let Err(e) = self.collect_and_broadcast().await {
-                        error!("Failed to collect and broadcast metrics: {}", e);
+                    // Read current cache state and broadcast via ZMQ
+                    let agent_data = self.cache.read().await.clone();
+                    if let Err(e) = self.zmq_handler.publish_agent_data(&agent_data).await {
+                        error!("Failed to broadcast agent data: {}", e);
+                    } else {
+                        debug!("Successfully broadcast agent data");
                    }
                }
                _ = notification_interval.tick() => {
-                    // Process any pending notifications
-                    // NOTE: With structured data, we might need to implement status tracking differently
-                    // For now, we skip this until status evaluation is migrated
+                    // Read cache and check for status changes
+                    let agent_data = self.cache.read().await.clone();
+                    if let Err(e) = self.check_status_changes_and_notify(&agent_data).await {
+                        error!("Failed to check status changes: {}", e);
+                    }
                }
-                // Handle incoming commands (check periodically)
-                _ = tokio::time::sleep(Duration::from_millis(100)) => {
+                _ = command_interval.tick() => {
                    if let Err(e) = self.handle_commands().await {
                        error!("Error handling commands: {}", e);
                    }
@@ -151,35 +241,6 @@ impl Agent {
        Ok(())
    }

-    /// Collect structured data from all collectors and broadcast via ZMQ
-    async fn collect_and_broadcast(&mut self) -> Result<()> {
-        debug!("Starting structured data collection");
-
-        // Initialize empty AgentData
-        let mut agent_data = AgentData::new(self.hostname.clone(), env!("CARGO_PKG_VERSION").to_string());
-
-        // Collect data from all collectors
-        for collector in &self.collectors {
-            if let Err(e) = collector.collect_structured(&mut agent_data).await {
-                error!("Collector failed: {}", e);
-                // Continue with other collectors even if one fails
-            }
-        }
-
-        // Check for status changes and send notifications
-        if let Err(e) = self.check_status_changes_and_notify(&agent_data).await {
-            error!("Failed to check status changes: {}", e);
-        }
-
-        // Broadcast the structured data via ZMQ
-        if let Err(e) = self.zmq_handler.publish_agent_data(&agent_data).await {
-            error!("Failed to broadcast agent data: {}", e);
-        } else {
-            debug!("Successfully broadcast structured agent data");
-        }
-
-        Ok(())
-    }

    /// Check for status changes and send notifications
    async fn check_status_changes_and_notify(&mut self, agent_data: &AgentData) -> Result<()> {
@@ -267,9 +328,12 @@ impl Agent {

            match command {
                AgentCommand::CollectNow => {
-                    info!("Received immediate collection request");
-                    if let Err(e) = self.collect_and_broadcast().await {
-                        error!("Failed to collect on demand: {}", e);
+                    info!("Received immediate transmission request");
+                    // With cached architecture, collectors run independently
+                    // Just send current cache state immediately
+                    let agent_data = self.cache.read().await.clone();
+                    if let Err(e) = self.zmq_handler.publish_agent_data(&agent_data).await {
+                        error!("Failed to broadcast on demand: {}", e);
                    }
                }
                AgentCommand::SetInterval { seconds } => {
--- a/agent/src/collectors/disk.rs
+++ b/agent/src/collectors/disk.rs
@@ -117,7 +117,7 @@ impl DiskCollector {
        let mut cmd = Command::new("lsblk");
        cmd.args(&["-rn", "-o", "NAME,MOUNTPOINT"]);

-        let output = run_command_with_timeout(cmd, 2).await
+        let output = run_command_with_timeout(cmd, self.config.command_timeout_seconds).await
            .map_err(|e| CollectorError::SystemRead {
                path: "block devices".to_string(),
                error: e.to_string(),
@@ -530,6 +530,9 @@ impl DiskCollector {

    /// Populate drives data into AgentData
    fn populate_drives_data(&self, physical_drives: &[PhysicalDrive], smart_data: &HashMap<String, SmartData>, agent_data: &mut AgentData) -> Result<(), CollectorError> {
+        // Clear existing drives data to prevent duplicates in cached architecture
+        agent_data.system.storage.drives.clear();
+
        for drive in physical_drives {
            let smart = smart_data.get(&drive.name);
            
@@ -567,6 +570,9 @@ impl DiskCollector {

    /// Populate pools data into AgentData
    fn populate_pools_data(&self, mergerfs_pools: &[MergerfsPool], smart_data: &HashMap<String, SmartData>, agent_data: &mut AgentData) -> Result<(), CollectorError> {
+        // Clear existing pools data to prevent duplicates in cached architecture
+        agent_data.system.storage.pools.clear();
+
        for pool in mergerfs_pools {
            // Calculate pool health and statuses based on member drive health
            let (pool_health, health_status, usage_status, data_drive_data, parity_drive_data) = self.calculate_pool_health(pool, smart_data);
--- a/agent/src/collectors/memory.rs
+++ b/agent/src/collectors/memory.rs
@@ -97,9 +97,12 @@ impl MemoryCollector {

    /// Populate tmpfs data into AgentData
    async fn populate_tmpfs_data(&self, agent_data: &mut AgentData) -> Result<(), CollectorError> {
+        // Clear existing tmpfs data to prevent duplicates in cached architecture
+        agent_data.system.memory.tmpfs.clear();
+
        // Discover all tmpfs mount points
        let tmpfs_mounts = self.discover_tmpfs_mounts()?;
-        
+
        if tmpfs_mounts.is_empty() {
            debug!("No tmpfs mounts found to monitor");
            return Ok(());
--- a/agent/src/collectors/network.rs
+++ b/agent/src/collectors/network.rs
@@ -8,12 +8,12 @@ use crate::config::NetworkConfig;

 /// Network interface collector with physical/virtual classification and link status
 pub struct NetworkCollector {
-    _config: NetworkConfig,
+    config: NetworkConfig,
 }

 impl NetworkCollector {
    pub fn new(config: NetworkConfig) -> Self {
-        Self { _config: config }
+        Self { config }
    }

    /// Check if interface is physical (not virtual)
@@ -50,8 +50,9 @@ impl NetworkCollector {
    }

    /// Get the primary physical interface (the one with default route)
-    fn get_primary_physical_interface() -> Option<String> {
-        match Command::new("timeout").args(["2", "ip", "route", "show", "default"]).output() {
+    fn get_primary_physical_interface(&self) -> Option<String> {
+        let timeout_str = self.config.command_timeout_seconds.to_string();
+        match Command::new("timeout").args([&timeout_str, "ip", "route", "show", "default"]).output() {
            Ok(output) if output.status.success() => {
                let output_str = String::from_utf8_lossy(&output.stdout);
                // Parse: "default via 192.168.1.1 dev eno1 ..."
@@ -110,7 +111,8 @@ impl NetworkCollector {
        // Parse VLAN configuration
        let vlan_map = Self::parse_vlan_config();

-        match Command::new("timeout").args(["2", "ip", "-j", "addr"]).output() {
+        let timeout_str = self.config.command_timeout_seconds.to_string();
+        match Command::new("timeout").args([&timeout_str, "ip", "-j", "addr"]).output() {
            Ok(output) if output.status.success() => {
                let json_str = String::from_utf8_lossy(&output.stdout);

@@ -195,7 +197,7 @@ impl NetworkCollector {
        }

        // Assign primary physical interface as parent to virtual interfaces without explicit parent
-        let primary_interface = Self::get_primary_physical_interface();
+        let primary_interface = self.get_primary_physical_interface();
        if let Some(primary) = primary_interface {
            for interface in interfaces.iter_mut() {
                // Only assign parent to virtual interfaces that don't already have one
--- a/agent/src/collectors/systemd.rs
+++ b/agent/src/collectors/systemd.rs
@@ -113,6 +113,7 @@ impl SystemdCollector {
                                name: site_name.clone(),
                                service_status: self.calculate_service_status(&site_name, &site_status),
                                metrics,
+                                service_type: "nginx_site".to_string(),
                            });
                        }
                    }
@@ -128,12 +129,13 @@ impl SystemdCollector {
                                name: container_name.clone(),
                                service_status: self.calculate_service_status(&container_name, &container_status),
                                metrics,
+                                service_type: "container".to_string(),
                            });
                        }

                        // Add Docker images
                        let docker_images = self.get_docker_images();
-                        for (image_name, image_status, image_size_str, image_size_mb) in docker_images {
+                        for (image_name, image_status, image_size_mb) in docker_images {
                            let mut metrics = Vec::new();
                            metrics.push(SubServiceMetric {
                                label: "size".to_string(),
@@ -142,9 +144,10 @@ impl SystemdCollector {
                            });

                            sub_services.push(SubServiceData {
-                                name: format!("{} ({})", image_name, image_size_str),
+                                name: image_name.to_string(),
                                service_status: self.calculate_service_status(&image_name, &image_status),
                                metrics,
+                                service_type: "image".to_string(),
                            });
                        }
                    }
@@ -251,18 +254,19 @@ impl SystemdCollector {

    /// Auto-discover interesting services to monitor
    fn discover_services_internal(&self) -> Result<(Vec<String>, std::collections::HashMap<String, ServiceStatusInfo>)> {
-        // First: Get all service unit files (with 3 second timeout)
+        // First: Get all service unit files
+        let timeout_str = self.config.command_timeout_seconds.to_string();
        let unit_files_output = Command::new("timeout")
-            .args(&["3", "systemctl", "list-unit-files", "--type=service", "--no-pager", "--plain"])
+            .args(&[&timeout_str, "systemctl", "list-unit-files", "--type=service", "--no-pager", "--plain"])
            .output()?;

        if !unit_files_output.status.success() {
            return Err(anyhow::anyhow!("systemctl list-unit-files command failed"));
        }

-        // Second: Get runtime status of all units (with 3 second timeout)
+        // Second: Get runtime status of all units
        let units_status_output = Command::new("timeout")
-            .args(&["3", "systemctl", "list-units", "--type=service", "--all", "--no-pager", "--plain"])
+            .args(&[&timeout_str, "systemctl", "list-units", "--type=service", "--all", "--no-pager", "--plain"])
            .output()?;

        if !units_status_output.status.success() {
@@ -358,16 +362,17 @@ impl SystemdCollector {
            }
        }

-        // Fallback to systemctl if not in cache (with 2 second timeout)
+        // Fallback to systemctl if not in cache
+        let timeout_str = self.config.command_timeout_seconds.to_string();
        let output = Command::new("timeout")
-            .args(&["2", "systemctl", "is-active", &format!("{}.service", service)])
+            .args(&[&timeout_str, "systemctl", "is-active", &format!("{}.service", service)])
            .output()?;

        let active_status = String::from_utf8(output.stdout)?.trim().to_string();

-        // Get more detailed info (with 2 second timeout)
+        // Get more detailed info
        let output = Command::new("timeout")
-            .args(&["2", "systemctl", "show", &format!("{}.service", service), "--property=LoadState,ActiveState,SubState"])
+            .args(&[&timeout_str, "systemctl", "show", &format!("{}.service", service), "--property=LoadState,ActiveState,SubState"])
            .output()?;

        let detailed_info = String::from_utf8(output.stdout)?;
@@ -427,9 +432,10 @@ impl SystemdCollector {
            return Ok(0.0);
        }

-        // No configured path - try to get WorkingDirectory from systemctl (with 2 second timeout)
+        // No configured path - try to get WorkingDirectory from systemctl
+        let timeout_str = self.config.command_timeout_seconds.to_string();
        let output = Command::new("timeout")
-            .args(&["2", "systemctl", "show", &format!("{}.service", service_name), "--property=WorkingDirectory"])
+            .args(&[&timeout_str, "systemctl", "show", &format!("{}.service", service_name), "--property=WorkingDirectory"])
            .output()
            .map_err(|e| CollectorError::SystemRead {
                path: format!("WorkingDirectory for {}", service_name),
@@ -449,15 +455,15 @@ impl SystemdCollector {
        Ok(0.0)
    }
    
-    /// Get size of a directory in GB (with 2 second timeout)
+    /// Get size of a directory in GB
    async fn get_directory_size(&self, path: &str) -> Option<f32> {
        use super::run_command_with_timeout;

-        // Use -s (summary) and --apparent-size for speed, 2 second timeout
+        // Use -s (summary) and --apparent-size for speed
        let mut cmd = Command::new("sudo");
        cmd.args(&["du", "-s", "--apparent-size", "--block-size=1", path]);

-        let output = run_command_with_timeout(cmd, 2).await.ok()?;
+        let output = run_command_with_timeout(cmd, self.config.command_timeout_seconds).await.ok()?;

        if !output.status.success() {
            // Log permission errors for debugging but don't spam logs
@@ -783,9 +789,10 @@ impl SystemdCollector {
        let mut containers = Vec::new();

        // Check if docker is available (cm-agent user is in docker group)
-        // Use -a to show ALL containers (running and stopped) with 3 second timeout
+        // Use -a to show ALL containers (running and stopped)
+        let timeout_str = self.config.command_timeout_seconds.to_string();
        let output = Command::new("timeout")
-            .args(&["3", "docker", "ps", "-a", "--format", "{{.Names}},{{.Status}}"])
+            .args(&[&timeout_str, "docker", "ps", "-a", "--format", "{{.Names}},{{.Status}}"])
            .output();

        let output = match output {
@@ -824,11 +831,12 @@ impl SystemdCollector {
    }

    /// Get docker images as sub-services
-    fn get_docker_images(&self) -> Vec<(String, String, String, f32)> {
+    fn get_docker_images(&self) -> Vec<(String, String, f32)> {
        let mut images = Vec::new();
-        // Check if docker is available (cm-agent user is in docker group) with 3 second timeout
+        // Check if docker is available (cm-agent user is in docker group)
+        let timeout_str = self.config.command_timeout_seconds.to_string();
        let output = Command::new("timeout")
-            .args(&["3", "docker", "images", "--format", "{{.Repository}}:{{.Tag}},{{.Size}}"])
+            .args(&[&timeout_str, "docker", "images", "--format", "{{.Repository}}:{{.Tag}},{{.Size}}"])
            .output();

        let output = match output {
@@ -865,9 +873,8 @@ impl SystemdCollector {
                let size_mb = self.parse_docker_size(size_str);

                images.push((
-                    format!("I {}", image_name),
+                    image_name.to_string(),
                    "inactive".to_string(), // Images are informational - use inactive for neutral display
-                    size_str.to_string(),
                    size_mb
                ));
            }
@@ -908,6 +915,9 @@ impl SystemdCollector {
 #[async_trait]
 impl Collector for SystemdCollector {
    async fn collect_structured(&self, agent_data: &mut AgentData) -> Result<(), CollectorError> {
+        // Clear existing services data to prevent duplicates in cached architecture
+        agent_data.services.clear();
+
        // Use cached complete data if available and fresh
        if let Some(cached_complete_services) = self.get_cached_complete_services() {
            for service_data in cached_complete_services {
--- a/agent/src/config/mod.rs
+++ b/agent/src/config/mod.rs
@@ -79,6 +79,9 @@ pub struct DiskConfig {
    pub temperature_critical_celsius: f32,
    pub wear_warning_percent: f32,
    pub wear_critical_percent: f32,
+    /// Command timeout in seconds for lsblk, smartctl, etc.
+    #[serde(default = "default_disk_command_timeout")]
+    pub command_timeout_seconds: u64,
 }

 /// Filesystem configuration entry
@@ -108,6 +111,9 @@ pub struct SystemdConfig {
    pub http_timeout_seconds: u64,
    pub http_connect_timeout_seconds: u64,
    pub nginx_latency_critical_ms: f32,
+    /// Command timeout in seconds for systemctl, docker, du commands
+    #[serde(default = "default_systemd_command_timeout")]
+    pub command_timeout_seconds: u64,
 }


@@ -132,6 +138,9 @@ pub struct BackupConfig {
 pub struct NetworkConfig {
    pub enabled: bool,
    pub interval_seconds: u64,
+    /// Command timeout in seconds for ip route, ip addr commands
+    #[serde(default = "default_network_command_timeout")]
+    pub command_timeout_seconds: u64,
 }

 /// Notification configuration
@@ -145,6 +154,9 @@ pub struct NotificationConfig {
    pub rate_limit_minutes: u64,
    /// Email notification batching interval in seconds (default: 60)
    pub aggregation_interval_seconds: u64,
+    /// Status check interval in seconds for detecting changes (default: 30)
+    #[serde(default = "default_notification_check_interval")]
+    pub check_interval_seconds: u64,
    /// List of metric names to exclude from email notifications
    #[serde(default)]
    pub exclude_email_metrics: Vec<String>,
@@ -158,10 +170,26 @@ fn default_heartbeat_interval_seconds() -> u64 {
    5
 }

+fn default_notification_check_interval() -> u64 {
+    30
+}
+
 fn default_maintenance_mode_file() -> String {
    "/tmp/cm-maintenance".to_string()
 }

+fn default_disk_command_timeout() -> u64 {
+    30
+}
+
+fn default_systemd_command_timeout() -> u64 {
+    15
+}
+
+fn default_network_command_timeout() -> u64 {
+    10
+}
+
 impl AgentConfig {
    pub fn from_file<P: AsRef<Path>>(path: P) -> Result<Self> {
        loader::load_config(path)
--- a/dashboard/Cargo.toml
+++ b/dashboard/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "cm-dashboard"
-version = "0.1.189"
+version = "0.1.194"
 edition = "2021"

 [dependencies]
--- a/dashboard/src/ui/widgets/services.rs
+++ b/dashboard/src/ui/widgets/services.rs
@@ -32,6 +32,7 @@ struct ServiceInfo {
    disk_gb: Option<f32>,
    metrics: Vec<(String, f32, Option<String>)>, // (label, value, unit)
    widget_status: Status,
+    service_type: String, // "nginx_site", "container", "image", or empty for parent services
 }

 impl ServicesWidget {
@@ -169,7 +170,7 @@ impl ServicesWidget {
            // Convert Status enum to display text for sub-services
            match info.widget_status {
                Status::Ok => "active",
-                Status::Inactive => "inactive", 
+                Status::Inactive => "inactive",
                Status::Critical => "failed",
                Status::Pending => "pending",
                Status::Warning => "warning",
@@ -179,32 +180,62 @@ impl ServicesWidget {
        };
        let tree_symbol = if is_last { "└─" } else { "├─" };

-        vec![
-            // Indentation and tree prefix
-            ratatui::text::Span::styled(
-                format!("  {} ", tree_symbol),
-                Typography::tree(),
-            ),
-            // Status icon
-            ratatui::text::Span::styled(
-                format!("{} ", icon),
-                Style::default().fg(status_color).bg(Theme::background()),
-            ),
-            // Service name
-            ratatui::text::Span::styled(
-                format!("{:<18} ", short_name),
-                Style::default()
-                    .fg(Theme::secondary_text())
-                    .bg(Theme::background()),
-            ),
-            // Status/latency text
-            ratatui::text::Span::styled(
-                status_str,
-                Style::default()
-                    .fg(Theme::secondary_text())
-                    .bg(Theme::background()),
-            ),
-        ]
+        // Docker images use docker whale icon
+        if info.service_type == "image" {
+            vec![
+                // Indentation and tree prefix
+                ratatui::text::Span::styled(
+                    format!("  {} ", tree_symbol),
+                    Typography::tree(),
+                ),
+                // Docker icon (simple character for performance)
+                ratatui::text::Span::styled(
+                    "D ".to_string(),
+                    Style::default().fg(Theme::highlight()).bg(Theme::background()),
+                ),
+                // Service name
+                ratatui::text::Span::styled(
+                    format!("{:<18} ", short_name),
+                    Style::default()
+                        .fg(Theme::secondary_text())
+                        .bg(Theme::background()),
+                ),
+                // Status/metrics text
+                ratatui::text::Span::styled(
+                    status_str,
+                    Style::default()
+                        .fg(Theme::secondary_text())
+                        .bg(Theme::background()),
+                ),
+            ]
+        } else {
+            vec![
+                // Indentation and tree prefix
+                ratatui::text::Span::styled(
+                    format!("  {} ", tree_symbol),
+                    Typography::tree(),
+                ),
+                // Status icon
+                ratatui::text::Span::styled(
+                    format!("{} ", icon),
+                    Style::default().fg(status_color).bg(Theme::background()),
+                ),
+                // Service name
+                ratatui::text::Span::styled(
+                    format!("{:<18} ", short_name),
+                    Style::default()
+                        .fg(Theme::secondary_text())
+                        .bg(Theme::background()),
+                ),
+                // Status/latency text
+                ratatui::text::Span::styled(
+                    status_str,
+                    Style::default()
+                        .fg(Theme::secondary_text())
+                        .bg(Theme::background()),
+                ),
+            ]
+        }
    }

    /// Move selection up
@@ -282,9 +313,10 @@ impl Widget for ServicesWidget {
                disk_gb: Some(service.disk_gb),
                metrics: Vec::new(), // Parent services don't have custom metrics
                widget_status: service.service_status,
+                service_type: String::new(), // Parent services have no type
            };
            self.parent_services.insert(service.name.clone(), parent_info);
-            
+
            // Process sub-services if any
            if !service.sub_services.is_empty() {
                let mut sub_list = Vec::new();
@@ -293,12 +325,13 @@ impl Widget for ServicesWidget {
                    let metrics: Vec<(String, f32, Option<String>)> = sub_service.metrics.iter()
                        .map(|m| (m.label.clone(), m.value, m.unit.clone()))
                        .collect();
-                    
+
                    let sub_info = ServiceInfo {
                        memory_mb: None, // Not used for sub-services
                        disk_gb: None,   // Not used for sub-services
                        metrics,
                        widget_status: sub_service.service_status,
+                        service_type: sub_service.service_type.clone(),
                    };
                    sub_list.push((sub_service.name.clone(), sub_info));
                }
@@ -342,6 +375,7 @@ impl ServicesWidget {
                                    disk_gb: None,
                                    metrics: Vec::new(),
                                    widget_status: Status::Unknown,
+                                    service_type: String::new(),
                                });

                        if metric.name.ends_with("_status") {
@@ -377,6 +411,7 @@ impl ServicesWidget {
                                    disk_gb: None,
                                    metrics: Vec::new(),
                                    widget_status: Status::Unknown,
+                                    service_type: String::new(), // Unknown type in legacy path
                                },
                            ));
                            &mut sub_service_list.last_mut().unwrap().1
--- a/shared/Cargo.toml
+++ b/shared/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "cm-dashboard-shared"
-version = "0.1.189"
+version = "0.1.194"
 edition = "2021"

 [dependencies]
--- a/shared/src/agent_data.rs
+++ b/shared/src/agent_data.rs
@@ -149,6 +149,9 @@ pub struct SubServiceData {
    pub name: String,
    pub service_status: Status,
    pub metrics: Vec<SubServiceMetric>,
+    /// Type of sub-service: "nginx_site", "container", "image"
+    #[serde(default)]
+    pub service_type: String,
 }

 /// Individual metric for a sub-service
Author	SHA1	Message	Date
Christoffer Martinsson	ed6399b914	Bump version to v0.1.194 All checks were successful Build and Release / build-and-release (push) Successful in 1m20s Details	2025-11-27 22:46:17 +01:00
Christoffer Martinsson	14618c59c6	Fix data duplication in cached collector architecture Critical bug fix: Collectors were appending to Vecs instead of replacing them, causing duplicate entries with each collection cycle. Fixed by adding .clear() calls before populating: - Memory collector: tmpfs Vec (was showing 11+ duplicates) - Disk collector: drives and pools Vecs - Systemd collector: services Vec - Network collector: Already correct (assigns new Vec) This prevents the exponential growth of duplicate entries in the dashboard UI.	2025-11-27 22:45:44 +01:00
Christoffer Martinsson	2740de9b54	Implement cached collector architecture with configurable timeouts All checks were successful Build and Release / build-and-release (push) Successful in 1m20s Details Major architectural refactor to eliminate false "host offline" alerts: - Replace sequential blocking collectors with independent async tasks - Each collector runs at configurable interval and updates shared cache - ZMQ sender reads cache every 1-2s regardless of collector speed - Collector intervals: CPU/Memory (1-10s), Backup/NixOS (30-60s), Disk/Systemd (60-300s) All intervals now configurable via NixOS config: - collectors..interval_seconds (collection frequency per collector) - collectors..command_timeout_seconds (timeout for shell commands) - notifications.check_interval_seconds (status change detection rate) Command timeouts increased from hardcoded 2-3s to configurable 10-30s: - Disk collector: 30s (SMART operations, lsblk) - Systemd collector: 15s (systemctl, docker, du commands) - Network collector: 10s (ip route, ip addr) Benefits: - No false "offline" alerts when slow collectors take >10s - Different update rates for different metric types - Better resource management with longer timeouts - Full NixOS configuration control Bump version to v0.1.193	2025-11-27 22:37:20 +01:00
Christoffer Martinsson	37f2650200	Document cached collector architecture plan Add architectural plan for separating ZMQ sending from data collection to prevent false 'host offline' alerts caused by slow collectors. Key concepts: - Shared cache (Arc<RwLock<AgentData>>) - Independent async collector tasks with different update rates - ZMQ sender always sends every 1s from cache - Fast collectors (1s), medium (5s), slow (60s) - No blocking regardless of collector speed	2025-11-27 21:49:44 +01:00
Christoffer Martinsson	833010e270	Bump version to v0.1.192 All checks were successful Build and Release / build-and-release (push) Successful in 1m8s Details	2025-11-27 18:34:53 +01:00
Christoffer Martinsson	549d9d1c72	Replace whale emoji with ASCII 'D' for performance Emoji rendering in terminals can be very slow, especially when rendered in the hot path (every frame for every docker image). The whale emoji 🐋 was causing significant rendering delays. Temporary change to ASCII 'D' to test if emoji was the performance issue.	2025-11-27 18:34:27 +01:00
Christoffer Martinsson	9b84b70581	Bump version to v0.1.191 All checks were successful Build and Release / build-and-release (push) Successful in 1m8s Details	2025-11-27 18:16:49 +01:00
Christoffer Martinsson	92c3ee3f2a	Add Docker whale icon for docker images Docker images now display with distinctive 🐋 whale icon in blue (highlight color) instead of status icons. This provides clear visual identification that these are docker images while not implying operational status.	2025-11-27 18:16:33 +01:00
Christoffer Martinsson	1be55f765d	Bump version to v0.1.190 All checks were successful Build and Release / build-and-release (push) Successful in 1m9s Details	2025-11-27 18:09:49 +01:00
Christoffer Martinsson	2f94a4b853	Add service_type field to separate data from presentation Changes: - Add service_type field to SubServiceData: 'nginx_site', 'container', 'image' - Agent sends pure data without display formatting - Dashboard checks service_type to decide presentation - Docker images now display without status icon (service_type='image') - Remove unused image_size_str from docker images tuple Clean separation: agent provides data, dashboard handles display logic.	2025-11-27 18:09:20 +01:00