Implement cached collector architecture with configurable timeouts
All checks were successful
Build and Release / build-and-release (push) Successful in 1m20s

Major architectural refactor to eliminate false "host offline" alerts:

- Replace sequential blocking collectors with independent async tasks
- Each collector runs at configurable interval and updates shared cache
- ZMQ sender reads cache every 1-2s regardless of collector speed
- Collector intervals: CPU/Memory (1-10s), Backup/NixOS (30-60s), Disk/Systemd (60-300s)

All intervals now configurable via NixOS config:
- collectors.*.interval_seconds (collection frequency per collector)
- collectors.*.command_timeout_seconds (timeout for shell commands)
- notifications.check_interval_seconds (status change detection rate)

Command timeouts increased from hardcoded 2-3s to configurable 10-30s:
- Disk collector: 30s (SMART operations, lsblk)
- Systemd collector: 15s (systemctl, docker, du commands)
- Network collector: 10s (ip route, ip addr)

Benefits:
- No false "offline" alerts when slow collectors take >10s
- Different update rates for different metric types
- Better resource management with longer timeouts
- Full NixOS configuration control

Bump version to v0.1.193
This commit is contained in:
Christoffer Martinsson 2025-11-27 22:37:20 +01:00
parent 37f2650200
commit 2740de9b54
10 changed files with 239 additions and 110 deletions

View File

@ -156,7 +156,7 @@ Complete migration from string-based metrics to structured JSON data. Eliminates
- ✅ Backward compatibility via bridge conversion to existing UI widgets - ✅ Backward compatibility via bridge conversion to existing UI widgets
- ✅ All string parsing bugs eliminated - ✅ All string parsing bugs eliminated
### Cached Collector Architecture (🚧 PLANNED) ### Cached Collector Architecture (✅ IMPLEMENTED)
**Problem:** Blocking collectors prevent timely ZMQ transmission, causing false "host offline" alerts. **Problem:** Blocking collectors prevent timely ZMQ transmission, causing false "host offline" alerts.
@ -199,12 +199,42 @@ Every 1 second:
- ✅ System stays responsive even with slow operations - ✅ System stays responsive even with slow operations
- ✅ Slow collectors can use longer timeouts without blocking - ✅ Slow collectors can use longer timeouts without blocking
**Implementation:** **Implementation Details:**
- Shared `AgentData` cache wrapped in `Arc<RwLock<>>` - **Shared cache**: `Arc<RwLock<AgentData>>` initialized at agent startup
- Each collector spawned as independent tokio task - **Collector intervals**: Fully configurable via NixOS config (`interval_seconds` per collector)
- Collectors update their section of cache at their own rate - Recommended: Fast (1-10s): CPU, Memory, Network
- ZMQ sender reads cache every 1s and transmits - Recommended: Medium (30-60s): Backup, NixOS
- Stale data acceptable for slow-changing metrics (disk usage, SMART) - Recommended: Slow (60-300s): Disk, Systemd
- **Independent tasks**: Each collector spawned as separate tokio task in `Agent::new()`
- **Cache updates**: Collectors acquire write lock → update → release immediately
- **ZMQ sender**: Main loop reads cache every `collection_interval_seconds` and broadcasts
- **Notification check**: Runs every `notifications.check_interval_seconds`
- **Lock strategy**: Short-lived write locks prevent blocking, read locks for transmission
- **Stale data**: Acceptable for slow-changing metrics (SMART data, disk usage)
**Configuration (NixOS):**
All intervals and timeouts configurable in `services/cm-dashboard.nix`:
Collection Intervals:
- `collectors.cpu.interval_seconds` (default: 10s)
- `collectors.memory.interval_seconds` (default: 2s)
- `collectors.disk.interval_seconds` (default: 300s)
- `collectors.systemd.interval_seconds` (default: 10s)
- `collectors.backup.interval_seconds` (default: 60s)
- `collectors.network.interval_seconds` (default: 10s)
- `collectors.nixos.interval_seconds` (default: 60s)
- `notifications.check_interval_seconds` (default: 30s)
- `collection_interval_seconds` - ZMQ transmission rate (default: 2s)
Command Timeouts (prevent resource leaks from hung commands):
- `collectors.disk.command_timeout_seconds` (default: 30s) - lsblk, smartctl, etc.
- `collectors.systemd.command_timeout_seconds` (default: 15s) - systemctl, docker, du
- `collectors.network.command_timeout_seconds` (default: 10s) - ip route, ip addr
**Code Locations:**
- agent/src/agent.rs:59-133 - Collector task spawning
- agent/src/agent.rs:151-179 - Independent collector task runner
- agent/src/agent.rs:199-207 - ZMQ sender in main loop
### Maintenance Mode ### Maintenance Mode

6
Cargo.lock generated
View File

@ -279,7 +279,7 @@ checksum = "a1d728cc89cf3aee9ff92b05e62b19ee65a02b5702cff7d5a377e32c6ae29d8d"
[[package]] [[package]]
name = "cm-dashboard" name = "cm-dashboard"
version = "0.1.191" version = "0.1.192"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"chrono", "chrono",
@ -301,7 +301,7 @@ dependencies = [
[[package]] [[package]]
name = "cm-dashboard-agent" name = "cm-dashboard-agent"
version = "0.1.191" version = "0.1.192"
dependencies = [ dependencies = [
"anyhow", "anyhow",
"async-trait", "async-trait",
@ -324,7 +324,7 @@ dependencies = [
[[package]] [[package]]
name = "cm-dashboard-shared" name = "cm-dashboard-shared"
version = "0.1.191" version = "0.1.192"
dependencies = [ dependencies = [
"chrono", "chrono",
"serde", "serde",

View File

@ -1,6 +1,6 @@
[package] [package]
name = "cm-dashboard-agent" name = "cm-dashboard-agent"
version = "0.1.192" version = "0.1.193"
edition = "2021" edition = "2021"
[dependencies] [dependencies]

View File

@ -1,13 +1,14 @@
use anyhow::Result; use anyhow::Result;
use gethostname::gethostname; use gethostname::gethostname;
use std::sync::Arc;
use std::time::Duration; use std::time::Duration;
use tokio::sync::RwLock;
use tokio::time::interval; use tokio::time::interval;
use tracing::{debug, error, info}; use tracing::{debug, error, info};
use crate::communication::{AgentCommand, ZmqHandler}; use crate::communication::{AgentCommand, ZmqHandler};
use crate::config::AgentConfig; use crate::config::AgentConfig;
use crate::collectors::{ use crate::collectors::{
Collector,
backup::BackupCollector, backup::BackupCollector,
cpu::CpuCollector, cpu::CpuCollector,
disk::DiskCollector, disk::DiskCollector,
@ -23,7 +24,7 @@ pub struct Agent {
hostname: String, hostname: String,
config: AgentConfig, config: AgentConfig,
zmq_handler: ZmqHandler, zmq_handler: ZmqHandler,
collectors: Vec<Box<dyn Collector>>, cache: Arc<RwLock<AgentData>>,
notification_manager: NotificationManager, notification_manager: NotificationManager,
previous_status: Option<SystemStatus>, previous_status: Option<SystemStatus>,
} }
@ -55,39 +56,94 @@ impl Agent {
config.zmq.publisher_port config.zmq.publisher_port
); );
// Initialize collectors // Initialize shared cache
let mut collectors: Vec<Box<dyn Collector>> = Vec::new(); let cache = Arc::new(RwLock::new(AgentData::new(
hostname.clone(),
// Add enabled collectors env!("CARGO_PKG_VERSION").to_string()
)));
info!("Initialized shared agent data cache");
// Spawn independent collector tasks
let mut collector_count = 0;
// CPU collector
if config.collectors.cpu.enabled { if config.collectors.cpu.enabled {
collectors.push(Box::new(CpuCollector::new(config.collectors.cpu.clone()))); let cache_clone = cache.clone();
let collector = CpuCollector::new(config.collectors.cpu.clone());
let interval = config.collectors.cpu.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "CPU").await;
});
collector_count += 1;
} }
// Memory collector
if config.collectors.memory.enabled { if config.collectors.memory.enabled {
collectors.push(Box::new(MemoryCollector::new(config.collectors.memory.clone()))); let cache_clone = cache.clone();
} let collector = MemoryCollector::new(config.collectors.memory.clone());
let interval = config.collectors.memory.interval_seconds;
if config.collectors.disk.enabled { tokio::spawn(async move {
collectors.push(Box::new(DiskCollector::new(config.collectors.disk.clone()))); Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Memory").await;
} });
collector_count += 1;
if config.collectors.systemd.enabled {
collectors.push(Box::new(SystemdCollector::new(config.collectors.systemd.clone())));
}
if config.collectors.backup.enabled {
collectors.push(Box::new(BackupCollector::new()));
} }
// Network collector
if config.collectors.network.enabled { if config.collectors.network.enabled {
collectors.push(Box::new(NetworkCollector::new(config.collectors.network.clone()))); let cache_clone = cache.clone();
let collector = NetworkCollector::new(config.collectors.network.clone());
let interval = config.collectors.network.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Network").await;
});
collector_count += 1;
} }
// Backup collector
if config.collectors.backup.enabled {
let cache_clone = cache.clone();
let collector = BackupCollector::new();
let interval = config.collectors.backup.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Backup").await;
});
collector_count += 1;
}
// NixOS collector
if config.collectors.nixos.enabled { if config.collectors.nixos.enabled {
collectors.push(Box::new(NixOSCollector::new(config.collectors.nixos.clone()))); let cache_clone = cache.clone();
let collector = NixOSCollector::new(config.collectors.nixos.clone());
let interval = config.collectors.nixos.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "NixOS").await;
});
collector_count += 1;
} }
info!("Initialized {} collectors", collectors.len()); // Disk collector
if config.collectors.disk.enabled {
let cache_clone = cache.clone();
let collector = DiskCollector::new(config.collectors.disk.clone());
let interval = config.collectors.disk.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Disk").await;
});
collector_count += 1;
}
// Systemd collector
if config.collectors.systemd.enabled {
let cache_clone = cache.clone();
let collector = SystemdCollector::new(config.collectors.systemd.clone());
let interval = config.collectors.systemd.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Systemd").await;
});
collector_count += 1;
}
info!("Spawned {} independent collector tasks", collector_count);
// Initialize notification manager // Initialize notification manager
let notification_manager = NotificationManager::new(&config.notifications, &hostname)?; let notification_manager = NotificationManager::new(&config.notifications, &hostname)?;
@ -97,45 +153,79 @@ impl Agent {
hostname, hostname,
config, config,
zmq_handler, zmq_handler,
collectors, cache,
notification_manager, notification_manager,
previous_status: None, previous_status: None,
}) })
} }
/// Main agent loop with structured data collection /// Independent collector task runner
pub async fn run(&mut self, mut shutdown_rx: tokio::sync::oneshot::Receiver<()>) -> Result<()> { async fn run_collector_task<C>(
info!("Starting agent main loop"); cache: Arc<RwLock<AgentData>>,
collector: C,
interval_duration: Duration,
name: &str,
) where
C: crate::collectors::Collector + Send + 'static,
{
let mut interval_timer = interval(interval_duration);
info!("{} collector task started (interval: {:?})", name, interval_duration);
// Initial collection loop {
if let Err(e) = self.collect_and_broadcast().await { interval_timer.tick().await;
error!("Initial metric collection failed: {}", e);
// Acquire write lock and update cache
{
let mut agent_data = cache.write().await;
match collector.collect_structured(&mut *agent_data).await {
Ok(_) => {
debug!("{} collector updated cache", name);
}
Err(e) => {
error!("{} collector failed: {}", name, e);
}
}
} // Release lock immediately after collection
} }
}
// Set up intervals /// Main agent loop with cached data architecture
pub async fn run(&mut self, mut shutdown_rx: tokio::sync::oneshot::Receiver<()>) -> Result<()> {
info!("Starting agent main loop with cached collector architecture");
// Set up intervals from config
let mut transmission_interval = interval(Duration::from_secs( let mut transmission_interval = interval(Duration::from_secs(
self.config.collection_interval_seconds, self.config.collection_interval_seconds,
)); ));
let mut notification_interval = interval(Duration::from_secs(30)); // Check notifications every 30s let mut notification_interval = interval(Duration::from_secs(
self.config.notifications.check_interval_seconds,
));
let mut command_interval = interval(Duration::from_millis(100));
// Skip initial ticks to avoid immediate execution // Skip initial ticks
transmission_interval.tick().await; transmission_interval.tick().await;
notification_interval.tick().await; notification_interval.tick().await;
command_interval.tick().await;
loop { loop {
tokio::select! { tokio::select! {
_ = transmission_interval.tick() => { _ = transmission_interval.tick() => {
if let Err(e) = self.collect_and_broadcast().await { // Read current cache state and broadcast via ZMQ
error!("Failed to collect and broadcast metrics: {}", e); let agent_data = self.cache.read().await.clone();
if let Err(e) = self.zmq_handler.publish_agent_data(&agent_data).await {
error!("Failed to broadcast agent data: {}", e);
} else {
debug!("Successfully broadcast agent data");
} }
} }
_ = notification_interval.tick() => { _ = notification_interval.tick() => {
// Process any pending notifications // Read cache and check for status changes
// NOTE: With structured data, we might need to implement status tracking differently let agent_data = self.cache.read().await.clone();
// For now, we skip this until status evaluation is migrated if let Err(e) = self.check_status_changes_and_notify(&agent_data).await {
error!("Failed to check status changes: {}", e);
}
} }
// Handle incoming commands (check periodically) _ = command_interval.tick() => {
_ = tokio::time::sleep(Duration::from_millis(100)) => {
if let Err(e) = self.handle_commands().await { if let Err(e) = self.handle_commands().await {
error!("Error handling commands: {}", e); error!("Error handling commands: {}", e);
} }
@ -151,35 +241,6 @@ impl Agent {
Ok(()) Ok(())
} }
/// Collect structured data from all collectors and broadcast via ZMQ
async fn collect_and_broadcast(&mut self) -> Result<()> {
debug!("Starting structured data collection");
// Initialize empty AgentData
let mut agent_data = AgentData::new(self.hostname.clone(), env!("CARGO_PKG_VERSION").to_string());
// Collect data from all collectors
for collector in &self.collectors {
if let Err(e) = collector.collect_structured(&mut agent_data).await {
error!("Collector failed: {}", e);
// Continue with other collectors even if one fails
}
}
// Check for status changes and send notifications
if let Err(e) = self.check_status_changes_and_notify(&agent_data).await {
error!("Failed to check status changes: {}", e);
}
// Broadcast the structured data via ZMQ
if let Err(e) = self.zmq_handler.publish_agent_data(&agent_data).await {
error!("Failed to broadcast agent data: {}", e);
} else {
debug!("Successfully broadcast structured agent data");
}
Ok(())
}
/// Check for status changes and send notifications /// Check for status changes and send notifications
async fn check_status_changes_and_notify(&mut self, agent_data: &AgentData) -> Result<()> { async fn check_status_changes_and_notify(&mut self, agent_data: &AgentData) -> Result<()> {
@ -267,9 +328,12 @@ impl Agent {
match command { match command {
AgentCommand::CollectNow => { AgentCommand::CollectNow => {
info!("Received immediate collection request"); info!("Received immediate transmission request");
if let Err(e) = self.collect_and_broadcast().await { // With cached architecture, collectors run independently
error!("Failed to collect on demand: {}", e); // Just send current cache state immediately
let agent_data = self.cache.read().await.clone();
if let Err(e) = self.zmq_handler.publish_agent_data(&agent_data).await {
error!("Failed to broadcast on demand: {}", e);
} }
} }
AgentCommand::SetInterval { seconds } => { AgentCommand::SetInterval { seconds } => {

View File

@ -117,7 +117,7 @@ impl DiskCollector {
let mut cmd = Command::new("lsblk"); let mut cmd = Command::new("lsblk");
cmd.args(&["-rn", "-o", "NAME,MOUNTPOINT"]); cmd.args(&["-rn", "-o", "NAME,MOUNTPOINT"]);
let output = run_command_with_timeout(cmd, 2).await let output = run_command_with_timeout(cmd, self.config.command_timeout_seconds).await
.map_err(|e| CollectorError::SystemRead { .map_err(|e| CollectorError::SystemRead {
path: "block devices".to_string(), path: "block devices".to_string(),
error: e.to_string(), error: e.to_string(),

View File

@ -8,12 +8,12 @@ use crate::config::NetworkConfig;
/// Network interface collector with physical/virtual classification and link status /// Network interface collector with physical/virtual classification and link status
pub struct NetworkCollector { pub struct NetworkCollector {
_config: NetworkConfig, config: NetworkConfig,
} }
impl NetworkCollector { impl NetworkCollector {
pub fn new(config: NetworkConfig) -> Self { pub fn new(config: NetworkConfig) -> Self {
Self { _config: config } Self { config }
} }
/// Check if interface is physical (not virtual) /// Check if interface is physical (not virtual)
@ -50,8 +50,9 @@ impl NetworkCollector {
} }
/// Get the primary physical interface (the one with default route) /// Get the primary physical interface (the one with default route)
fn get_primary_physical_interface() -> Option<String> { fn get_primary_physical_interface(&self) -> Option<String> {
match Command::new("timeout").args(["2", "ip", "route", "show", "default"]).output() { let timeout_str = self.config.command_timeout_seconds.to_string();
match Command::new("timeout").args([&timeout_str, "ip", "route", "show", "default"]).output() {
Ok(output) if output.status.success() => { Ok(output) if output.status.success() => {
let output_str = String::from_utf8_lossy(&output.stdout); let output_str = String::from_utf8_lossy(&output.stdout);
// Parse: "default via 192.168.1.1 dev eno1 ..." // Parse: "default via 192.168.1.1 dev eno1 ..."
@ -110,7 +111,8 @@ impl NetworkCollector {
// Parse VLAN configuration // Parse VLAN configuration
let vlan_map = Self::parse_vlan_config(); let vlan_map = Self::parse_vlan_config();
match Command::new("timeout").args(["2", "ip", "-j", "addr"]).output() { let timeout_str = self.config.command_timeout_seconds.to_string();
match Command::new("timeout").args([&timeout_str, "ip", "-j", "addr"]).output() {
Ok(output) if output.status.success() => { Ok(output) if output.status.success() => {
let json_str = String::from_utf8_lossy(&output.stdout); let json_str = String::from_utf8_lossy(&output.stdout);
@ -195,7 +197,7 @@ impl NetworkCollector {
} }
// Assign primary physical interface as parent to virtual interfaces without explicit parent // Assign primary physical interface as parent to virtual interfaces without explicit parent
let primary_interface = Self::get_primary_physical_interface(); let primary_interface = self.get_primary_physical_interface();
if let Some(primary) = primary_interface { if let Some(primary) = primary_interface {
for interface in interfaces.iter_mut() { for interface in interfaces.iter_mut() {
// Only assign parent to virtual interfaces that don't already have one // Only assign parent to virtual interfaces that don't already have one

View File

@ -254,18 +254,19 @@ impl SystemdCollector {
/// Auto-discover interesting services to monitor /// Auto-discover interesting services to monitor
fn discover_services_internal(&self) -> Result<(Vec<String>, std::collections::HashMap<String, ServiceStatusInfo>)> { fn discover_services_internal(&self) -> Result<(Vec<String>, std::collections::HashMap<String, ServiceStatusInfo>)> {
// First: Get all service unit files (with 3 second timeout) // First: Get all service unit files
let timeout_str = self.config.command_timeout_seconds.to_string();
let unit_files_output = Command::new("timeout") let unit_files_output = Command::new("timeout")
.args(&["3", "systemctl", "list-unit-files", "--type=service", "--no-pager", "--plain"]) .args(&[&timeout_str, "systemctl", "list-unit-files", "--type=service", "--no-pager", "--plain"])
.output()?; .output()?;
if !unit_files_output.status.success() { if !unit_files_output.status.success() {
return Err(anyhow::anyhow!("systemctl list-unit-files command failed")); return Err(anyhow::anyhow!("systemctl list-unit-files command failed"));
} }
// Second: Get runtime status of all units (with 3 second timeout) // Second: Get runtime status of all units
let units_status_output = Command::new("timeout") let units_status_output = Command::new("timeout")
.args(&["3", "systemctl", "list-units", "--type=service", "--all", "--no-pager", "--plain"]) .args(&[&timeout_str, "systemctl", "list-units", "--type=service", "--all", "--no-pager", "--plain"])
.output()?; .output()?;
if !units_status_output.status.success() { if !units_status_output.status.success() {
@ -361,16 +362,17 @@ impl SystemdCollector {
} }
} }
// Fallback to systemctl if not in cache (with 2 second timeout) // Fallback to systemctl if not in cache
let timeout_str = self.config.command_timeout_seconds.to_string();
let output = Command::new("timeout") let output = Command::new("timeout")
.args(&["2", "systemctl", "is-active", &format!("{}.service", service)]) .args(&[&timeout_str, "systemctl", "is-active", &format!("{}.service", service)])
.output()?; .output()?;
let active_status = String::from_utf8(output.stdout)?.trim().to_string(); let active_status = String::from_utf8(output.stdout)?.trim().to_string();
// Get more detailed info (with 2 second timeout) // Get more detailed info
let output = Command::new("timeout") let output = Command::new("timeout")
.args(&["2", "systemctl", "show", &format!("{}.service", service), "--property=LoadState,ActiveState,SubState"]) .args(&[&timeout_str, "systemctl", "show", &format!("{}.service", service), "--property=LoadState,ActiveState,SubState"])
.output()?; .output()?;
let detailed_info = String::from_utf8(output.stdout)?; let detailed_info = String::from_utf8(output.stdout)?;
@ -430,9 +432,10 @@ impl SystemdCollector {
return Ok(0.0); return Ok(0.0);
} }
// No configured path - try to get WorkingDirectory from systemctl (with 2 second timeout) // No configured path - try to get WorkingDirectory from systemctl
let timeout_str = self.config.command_timeout_seconds.to_string();
let output = Command::new("timeout") let output = Command::new("timeout")
.args(&["2", "systemctl", "show", &format!("{}.service", service_name), "--property=WorkingDirectory"]) .args(&[&timeout_str, "systemctl", "show", &format!("{}.service", service_name), "--property=WorkingDirectory"])
.output() .output()
.map_err(|e| CollectorError::SystemRead { .map_err(|e| CollectorError::SystemRead {
path: format!("WorkingDirectory for {}", service_name), path: format!("WorkingDirectory for {}", service_name),
@ -452,15 +455,15 @@ impl SystemdCollector {
Ok(0.0) Ok(0.0)
} }
/// Get size of a directory in GB (with 2 second timeout) /// Get size of a directory in GB
async fn get_directory_size(&self, path: &str) -> Option<f32> { async fn get_directory_size(&self, path: &str) -> Option<f32> {
use super::run_command_with_timeout; use super::run_command_with_timeout;
// Use -s (summary) and --apparent-size for speed, 2 second timeout // Use -s (summary) and --apparent-size for speed
let mut cmd = Command::new("sudo"); let mut cmd = Command::new("sudo");
cmd.args(&["du", "-s", "--apparent-size", "--block-size=1", path]); cmd.args(&["du", "-s", "--apparent-size", "--block-size=1", path]);
let output = run_command_with_timeout(cmd, 2).await.ok()?; let output = run_command_with_timeout(cmd, self.config.command_timeout_seconds).await.ok()?;
if !output.status.success() { if !output.status.success() {
// Log permission errors for debugging but don't spam logs // Log permission errors for debugging but don't spam logs
@ -786,9 +789,10 @@ impl SystemdCollector {
let mut containers = Vec::new(); let mut containers = Vec::new();
// Check if docker is available (cm-agent user is in docker group) // Check if docker is available (cm-agent user is in docker group)
// Use -a to show ALL containers (running and stopped) with 3 second timeout // Use -a to show ALL containers (running and stopped)
let timeout_str = self.config.command_timeout_seconds.to_string();
let output = Command::new("timeout") let output = Command::new("timeout")
.args(&["3", "docker", "ps", "-a", "--format", "{{.Names}},{{.Status}}"]) .args(&[&timeout_str, "docker", "ps", "-a", "--format", "{{.Names}},{{.Status}}"])
.output(); .output();
let output = match output { let output = match output {
@ -829,9 +833,10 @@ impl SystemdCollector {
/// Get docker images as sub-services /// Get docker images as sub-services
fn get_docker_images(&self) -> Vec<(String, String, f32)> { fn get_docker_images(&self) -> Vec<(String, String, f32)> {
let mut images = Vec::new(); let mut images = Vec::new();
// Check if docker is available (cm-agent user is in docker group) with 3 second timeout // Check if docker is available (cm-agent user is in docker group)
let timeout_str = self.config.command_timeout_seconds.to_string();
let output = Command::new("timeout") let output = Command::new("timeout")
.args(&["3", "docker", "images", "--format", "{{.Repository}}:{{.Tag}},{{.Size}}"]) .args(&[&timeout_str, "docker", "images", "--format", "{{.Repository}}:{{.Tag}},{{.Size}}"])
.output(); .output();
let output = match output { let output = match output {

View File

@ -79,6 +79,9 @@ pub struct DiskConfig {
pub temperature_critical_celsius: f32, pub temperature_critical_celsius: f32,
pub wear_warning_percent: f32, pub wear_warning_percent: f32,
pub wear_critical_percent: f32, pub wear_critical_percent: f32,
/// Command timeout in seconds for lsblk, smartctl, etc.
#[serde(default = "default_disk_command_timeout")]
pub command_timeout_seconds: u64,
} }
/// Filesystem configuration entry /// Filesystem configuration entry
@ -108,6 +111,9 @@ pub struct SystemdConfig {
pub http_timeout_seconds: u64, pub http_timeout_seconds: u64,
pub http_connect_timeout_seconds: u64, pub http_connect_timeout_seconds: u64,
pub nginx_latency_critical_ms: f32, pub nginx_latency_critical_ms: f32,
/// Command timeout in seconds for systemctl, docker, du commands
#[serde(default = "default_systemd_command_timeout")]
pub command_timeout_seconds: u64,
} }
@ -132,6 +138,9 @@ pub struct BackupConfig {
pub struct NetworkConfig { pub struct NetworkConfig {
pub enabled: bool, pub enabled: bool,
pub interval_seconds: u64, pub interval_seconds: u64,
/// Command timeout in seconds for ip route, ip addr commands
#[serde(default = "default_network_command_timeout")]
pub command_timeout_seconds: u64,
} }
/// Notification configuration /// Notification configuration
@ -145,6 +154,9 @@ pub struct NotificationConfig {
pub rate_limit_minutes: u64, pub rate_limit_minutes: u64,
/// Email notification batching interval in seconds (default: 60) /// Email notification batching interval in seconds (default: 60)
pub aggregation_interval_seconds: u64, pub aggregation_interval_seconds: u64,
/// Status check interval in seconds for detecting changes (default: 30)
#[serde(default = "default_notification_check_interval")]
pub check_interval_seconds: u64,
/// List of metric names to exclude from email notifications /// List of metric names to exclude from email notifications
#[serde(default)] #[serde(default)]
pub exclude_email_metrics: Vec<String>, pub exclude_email_metrics: Vec<String>,
@ -158,10 +170,26 @@ fn default_heartbeat_interval_seconds() -> u64 {
5 5
} }
fn default_notification_check_interval() -> u64 {
30
}
fn default_maintenance_mode_file() -> String { fn default_maintenance_mode_file() -> String {
"/tmp/cm-maintenance".to_string() "/tmp/cm-maintenance".to_string()
} }
fn default_disk_command_timeout() -> u64 {
30
}
fn default_systemd_command_timeout() -> u64 {
15
}
fn default_network_command_timeout() -> u64 {
10
}
impl AgentConfig { impl AgentConfig {
pub fn from_file<P: AsRef<Path>>(path: P) -> Result<Self> { pub fn from_file<P: AsRef<Path>>(path: P) -> Result<Self> {
loader::load_config(path) loader::load_config(path)

View File

@ -1,6 +1,6 @@
[package] [package]
name = "cm-dashboard" name = "cm-dashboard"
version = "0.1.192" version = "0.1.193"
edition = "2021" edition = "2021"
[dependencies] [dependencies]

View File

@ -1,6 +1,6 @@
[package] [package]
name = "cm-dashboard-shared" name = "cm-dashboard-shared"
version = "0.1.192" version = "0.1.193"
edition = "2021" edition = "2021"
[dependencies] [dependencies]