Compare commits

..

10 Commits

Author SHA1 Message Date
ed6399b914 Bump version to v0.1.194
All checks were successful
Build and Release / build-and-release (push) Successful in 1m20s
2025-11-27 22:46:17 +01:00
14618c59c6 Fix data duplication in cached collector architecture
Critical bug fix: Collectors were appending to Vecs instead of replacing them,
causing duplicate entries with each collection cycle.

Fixed by adding .clear() calls before populating:
- Memory collector: tmpfs Vec (was showing 11+ duplicates)
- Disk collector: drives and pools Vecs
- Systemd collector: services Vec
- Network collector: Already correct (assigns new Vec)

This prevents the exponential growth of duplicate entries in the dashboard UI.
2025-11-27 22:45:44 +01:00
2740de9b54 Implement cached collector architecture with configurable timeouts
All checks were successful
Build and Release / build-and-release (push) Successful in 1m20s
Major architectural refactor to eliminate false "host offline" alerts:

- Replace sequential blocking collectors with independent async tasks
- Each collector runs at configurable interval and updates shared cache
- ZMQ sender reads cache every 1-2s regardless of collector speed
- Collector intervals: CPU/Memory (1-10s), Backup/NixOS (30-60s), Disk/Systemd (60-300s)

All intervals now configurable via NixOS config:
- collectors.*.interval_seconds (collection frequency per collector)
- collectors.*.command_timeout_seconds (timeout for shell commands)
- notifications.check_interval_seconds (status change detection rate)

Command timeouts increased from hardcoded 2-3s to configurable 10-30s:
- Disk collector: 30s (SMART operations, lsblk)
- Systemd collector: 15s (systemctl, docker, du commands)
- Network collector: 10s (ip route, ip addr)

Benefits:
- No false "offline" alerts when slow collectors take >10s
- Different update rates for different metric types
- Better resource management with longer timeouts
- Full NixOS configuration control

Bump version to v0.1.193
2025-11-27 22:37:20 +01:00
37f2650200 Document cached collector architecture plan
Add architectural plan for separating ZMQ sending from data collection to prevent false 'host offline' alerts caused by slow collectors.

Key concepts:
- Shared cache (Arc<RwLock<AgentData>>)
- Independent async collector tasks with different update rates
- ZMQ sender always sends every 1s from cache
- Fast collectors (1s), medium (5s), slow (60s)
- No blocking regardless of collector speed
2025-11-27 21:49:44 +01:00
833010e270 Bump version to v0.1.192
All checks were successful
Build and Release / build-and-release (push) Successful in 1m8s
2025-11-27 18:34:53 +01:00
549d9d1c72 Replace whale emoji with ASCII 'D' for performance
Emoji rendering in terminals can be very slow, especially when rendered in the hot path (every frame for every docker image). The whale emoji 🐋 was causing significant rendering delays.

Temporary change to ASCII 'D' to test if emoji was the performance issue.
2025-11-27 18:34:27 +01:00
9b84b70581 Bump version to v0.1.191
All checks were successful
Build and Release / build-and-release (push) Successful in 1m8s
2025-11-27 18:16:49 +01:00
92c3ee3f2a Add Docker whale icon for docker images
Docker images now display with distinctive 🐋 whale icon in blue (highlight color) instead of status icons. This provides clear visual identification that these are docker images while not implying operational status.
2025-11-27 18:16:33 +01:00
1be55f765d Bump version to v0.1.190
All checks were successful
Build and Release / build-and-release (push) Successful in 1m9s
2025-11-27 18:09:49 +01:00
2f94a4b853 Add service_type field to separate data from presentation
Changes:
- Add service_type field to SubServiceData: 'nginx_site', 'container', 'image'
- Agent sends pure data without display formatting
- Dashboard checks service_type to decide presentation
- Docker images now display without status icon (service_type='image')
- Remove unused image_size_str from docker images tuple

Clean separation: agent provides data, dashboard handles display logic.
2025-11-27 18:09:20 +01:00
13 changed files with 369 additions and 138 deletions

View File

@@ -156,6 +156,86 @@ Complete migration from string-based metrics to structured JSON data. Eliminates
- ✅ Backward compatibility via bridge conversion to existing UI widgets
- ✅ All string parsing bugs eliminated
### Cached Collector Architecture (✅ IMPLEMENTED)
**Problem:** Blocking collectors prevent timely ZMQ transmission, causing false "host offline" alerts.
**Previous (Sequential Blocking):**
```
Every 1 second:
└─ collect_all_data() [BLOCKS for 2-10+ seconds]
├─ CPU (fast: 10ms)
├─ Memory (fast: 20ms)
├─ Disk SMART (slow: 3s per drive × 4 drives = 12s)
├─ Service disk usage (slow: 2-8s per service)
└─ Docker (medium: 500ms)
└─ send_via_zmq() [Only after ALL collection completes]
Result: If any collector takes >10s → "host offline" false alert
```
**New (Cached Independent Collectors):**
```
Shared Cache: Arc<RwLock<AgentData>>
Background Collectors (independent async tasks):
├─ Fast collectors (CPU, RAM, Network)
│ └─ Update cache every 1 second
├─ Medium collectors (Services, Docker)
│ └─ Update cache every 5 seconds
└─ Slow collectors (Disk usage, SMART data)
└─ Update cache every 60 seconds
ZMQ Sender (separate async task):
Every 1 second:
└─ Read current cache
└─ Send via ZMQ [Always instant, never blocked]
```
**Benefits:**
- ✅ ZMQ sends every 1 second regardless of collector speed
- ✅ No false "host offline" alerts from slow collectors
- ✅ Different update rates for different metrics (CPU=1s, SMART=60s)
- ✅ System stays responsive even with slow operations
- ✅ Slow collectors can use longer timeouts without blocking
**Implementation Details:**
- **Shared cache**: `Arc<RwLock<AgentData>>` initialized at agent startup
- **Collector intervals**: Fully configurable via NixOS config (`interval_seconds` per collector)
- Recommended: Fast (1-10s): CPU, Memory, Network
- Recommended: Medium (30-60s): Backup, NixOS
- Recommended: Slow (60-300s): Disk, Systemd
- **Independent tasks**: Each collector spawned as separate tokio task in `Agent::new()`
- **Cache updates**: Collectors acquire write lock → update → release immediately
- **ZMQ sender**: Main loop reads cache every `collection_interval_seconds` and broadcasts
- **Notification check**: Runs every `notifications.check_interval_seconds`
- **Lock strategy**: Short-lived write locks prevent blocking, read locks for transmission
- **Stale data**: Acceptable for slow-changing metrics (SMART data, disk usage)
**Configuration (NixOS):**
All intervals and timeouts configurable in `services/cm-dashboard.nix`:
Collection Intervals:
- `collectors.cpu.interval_seconds` (default: 10s)
- `collectors.memory.interval_seconds` (default: 2s)
- `collectors.disk.interval_seconds` (default: 300s)
- `collectors.systemd.interval_seconds` (default: 10s)
- `collectors.backup.interval_seconds` (default: 60s)
- `collectors.network.interval_seconds` (default: 10s)
- `collectors.nixos.interval_seconds` (default: 60s)
- `notifications.check_interval_seconds` (default: 30s)
- `collection_interval_seconds` - ZMQ transmission rate (default: 2s)
Command Timeouts (prevent resource leaks from hung commands):
- `collectors.disk.command_timeout_seconds` (default: 30s) - lsblk, smartctl, etc.
- `collectors.systemd.command_timeout_seconds` (default: 15s) - systemctl, docker, du
- `collectors.network.command_timeout_seconds` (default: 10s) - ip route, ip addr
**Code Locations:**
- agent/src/agent.rs:59-133 - Collector task spawning
- agent/src/agent.rs:151-179 - Independent collector task runner
- agent/src/agent.rs:199-207 - ZMQ sender in main loop
### Maintenance Mode
- Agent checks for `/tmp/cm-maintenance` file before sending notifications

6
Cargo.lock generated
View File

@@ -279,7 +279,7 @@ checksum = "a1d728cc89cf3aee9ff92b05e62b19ee65a02b5702cff7d5a377e32c6ae29d8d"
[[package]]
name = "cm-dashboard"
version = "0.1.188"
version = "0.1.193"
dependencies = [
"anyhow",
"chrono",
@@ -301,7 +301,7 @@ dependencies = [
[[package]]
name = "cm-dashboard-agent"
version = "0.1.188"
version = "0.1.193"
dependencies = [
"anyhow",
"async-trait",
@@ -324,7 +324,7 @@ dependencies = [
[[package]]
name = "cm-dashboard-shared"
version = "0.1.188"
version = "0.1.193"
dependencies = [
"chrono",
"serde",

View File

@@ -1,6 +1,6 @@
[package]
name = "cm-dashboard-agent"
version = "0.1.189"
version = "0.1.194"
edition = "2021"
[dependencies]

View File

@@ -1,13 +1,14 @@
use anyhow::Result;
use gethostname::gethostname;
use std::sync::Arc;
use std::time::Duration;
use tokio::sync::RwLock;
use tokio::time::interval;
use tracing::{debug, error, info};
use crate::communication::{AgentCommand, ZmqHandler};
use crate::config::AgentConfig;
use crate::collectors::{
Collector,
backup::BackupCollector,
cpu::CpuCollector,
disk::DiskCollector,
@@ -23,7 +24,7 @@ pub struct Agent {
hostname: String,
config: AgentConfig,
zmq_handler: ZmqHandler,
collectors: Vec<Box<dyn Collector>>,
cache: Arc<RwLock<AgentData>>,
notification_manager: NotificationManager,
previous_status: Option<SystemStatus>,
}
@@ -55,39 +56,94 @@ impl Agent {
config.zmq.publisher_port
);
// Initialize collectors
let mut collectors: Vec<Box<dyn Collector>> = Vec::new();
// Add enabled collectors
// Initialize shared cache
let cache = Arc::new(RwLock::new(AgentData::new(
hostname.clone(),
env!("CARGO_PKG_VERSION").to_string()
)));
info!("Initialized shared agent data cache");
// Spawn independent collector tasks
let mut collector_count = 0;
// CPU collector
if config.collectors.cpu.enabled {
collectors.push(Box::new(CpuCollector::new(config.collectors.cpu.clone())));
let cache_clone = cache.clone();
let collector = CpuCollector::new(config.collectors.cpu.clone());
let interval = config.collectors.cpu.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "CPU").await;
});
collector_count += 1;
}
// Memory collector
if config.collectors.memory.enabled {
collectors.push(Box::new(MemoryCollector::new(config.collectors.memory.clone())));
}
if config.collectors.disk.enabled {
collectors.push(Box::new(DiskCollector::new(config.collectors.disk.clone())));
}
if config.collectors.systemd.enabled {
collectors.push(Box::new(SystemdCollector::new(config.collectors.systemd.clone())));
}
if config.collectors.backup.enabled {
collectors.push(Box::new(BackupCollector::new()));
let cache_clone = cache.clone();
let collector = MemoryCollector::new(config.collectors.memory.clone());
let interval = config.collectors.memory.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Memory").await;
});
collector_count += 1;
}
// Network collector
if config.collectors.network.enabled {
collectors.push(Box::new(NetworkCollector::new(config.collectors.network.clone())));
let cache_clone = cache.clone();
let collector = NetworkCollector::new(config.collectors.network.clone());
let interval = config.collectors.network.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Network").await;
});
collector_count += 1;
}
// Backup collector
if config.collectors.backup.enabled {
let cache_clone = cache.clone();
let collector = BackupCollector::new();
let interval = config.collectors.backup.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Backup").await;
});
collector_count += 1;
}
// NixOS collector
if config.collectors.nixos.enabled {
collectors.push(Box::new(NixOSCollector::new(config.collectors.nixos.clone())));
let cache_clone = cache.clone();
let collector = NixOSCollector::new(config.collectors.nixos.clone());
let interval = config.collectors.nixos.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "NixOS").await;
});
collector_count += 1;
}
info!("Initialized {} collectors", collectors.len());
// Disk collector
if config.collectors.disk.enabled {
let cache_clone = cache.clone();
let collector = DiskCollector::new(config.collectors.disk.clone());
let interval = config.collectors.disk.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Disk").await;
});
collector_count += 1;
}
// Systemd collector
if config.collectors.systemd.enabled {
let cache_clone = cache.clone();
let collector = SystemdCollector::new(config.collectors.systemd.clone());
let interval = config.collectors.systemd.interval_seconds;
tokio::spawn(async move {
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Systemd").await;
});
collector_count += 1;
}
info!("Spawned {} independent collector tasks", collector_count);
// Initialize notification manager
let notification_manager = NotificationManager::new(&config.notifications, &hostname)?;
@@ -97,45 +153,79 @@ impl Agent {
hostname,
config,
zmq_handler,
collectors,
cache,
notification_manager,
previous_status: None,
})
}
/// Main agent loop with structured data collection
pub async fn run(&mut self, mut shutdown_rx: tokio::sync::oneshot::Receiver<()>) -> Result<()> {
info!("Starting agent main loop");
/// Independent collector task runner
async fn run_collector_task<C>(
cache: Arc<RwLock<AgentData>>,
collector: C,
interval_duration: Duration,
name: &str,
) where
C: crate::collectors::Collector + Send + 'static,
{
let mut interval_timer = interval(interval_duration);
info!("{} collector task started (interval: {:?})", name, interval_duration);
// Initial collection
if let Err(e) = self.collect_and_broadcast().await {
error!("Initial metric collection failed: {}", e);
loop {
interval_timer.tick().await;
// Acquire write lock and update cache
{
let mut agent_data = cache.write().await;
match collector.collect_structured(&mut *agent_data).await {
Ok(_) => {
debug!("{} collector updated cache", name);
}
Err(e) => {
error!("{} collector failed: {}", name, e);
}
}
} // Release lock immediately after collection
}
}
// Set up intervals
/// Main agent loop with cached data architecture
pub async fn run(&mut self, mut shutdown_rx: tokio::sync::oneshot::Receiver<()>) -> Result<()> {
info!("Starting agent main loop with cached collector architecture");
// Set up intervals from config
let mut transmission_interval = interval(Duration::from_secs(
self.config.collection_interval_seconds,
));
let mut notification_interval = interval(Duration::from_secs(30)); // Check notifications every 30s
let mut notification_interval = interval(Duration::from_secs(
self.config.notifications.check_interval_seconds,
));
let mut command_interval = interval(Duration::from_millis(100));
// Skip initial ticks to avoid immediate execution
// Skip initial ticks
transmission_interval.tick().await;
notification_interval.tick().await;
command_interval.tick().await;
loop {
tokio::select! {
_ = transmission_interval.tick() => {
if let Err(e) = self.collect_and_broadcast().await {
error!("Failed to collect and broadcast metrics: {}", e);
// Read current cache state and broadcast via ZMQ
let agent_data = self.cache.read().await.clone();
if let Err(e) = self.zmq_handler.publish_agent_data(&agent_data).await {
error!("Failed to broadcast agent data: {}", e);
} else {
debug!("Successfully broadcast agent data");
}
}
_ = notification_interval.tick() => {
// Process any pending notifications
// NOTE: With structured data, we might need to implement status tracking differently
// For now, we skip this until status evaluation is migrated
// Read cache and check for status changes
let agent_data = self.cache.read().await.clone();
if let Err(e) = self.check_status_changes_and_notify(&agent_data).await {
error!("Failed to check status changes: {}", e);
}
}
// Handle incoming commands (check periodically)
_ = tokio::time::sleep(Duration::from_millis(100)) => {
_ = command_interval.tick() => {
if let Err(e) = self.handle_commands().await {
error!("Error handling commands: {}", e);
}
@@ -151,35 +241,6 @@ impl Agent {
Ok(())
}
/// Collect structured data from all collectors and broadcast via ZMQ
async fn collect_and_broadcast(&mut self) -> Result<()> {
debug!("Starting structured data collection");
// Initialize empty AgentData
let mut agent_data = AgentData::new(self.hostname.clone(), env!("CARGO_PKG_VERSION").to_string());
// Collect data from all collectors
for collector in &self.collectors {
if let Err(e) = collector.collect_structured(&mut agent_data).await {
error!("Collector failed: {}", e);
// Continue with other collectors even if one fails
}
}
// Check for status changes and send notifications
if let Err(e) = self.check_status_changes_and_notify(&agent_data).await {
error!("Failed to check status changes: {}", e);
}
// Broadcast the structured data via ZMQ
if let Err(e) = self.zmq_handler.publish_agent_data(&agent_data).await {
error!("Failed to broadcast agent data: {}", e);
} else {
debug!("Successfully broadcast structured agent data");
}
Ok(())
}
/// Check for status changes and send notifications
async fn check_status_changes_and_notify(&mut self, agent_data: &AgentData) -> Result<()> {
@@ -267,9 +328,12 @@ impl Agent {
match command {
AgentCommand::CollectNow => {
info!("Received immediate collection request");
if let Err(e) = self.collect_and_broadcast().await {
error!("Failed to collect on demand: {}", e);
info!("Received immediate transmission request");
// With cached architecture, collectors run independently
// Just send current cache state immediately
let agent_data = self.cache.read().await.clone();
if let Err(e) = self.zmq_handler.publish_agent_data(&agent_data).await {
error!("Failed to broadcast on demand: {}", e);
}
}
AgentCommand::SetInterval { seconds } => {

View File

@@ -117,7 +117,7 @@ impl DiskCollector {
let mut cmd = Command::new("lsblk");
cmd.args(&["-rn", "-o", "NAME,MOUNTPOINT"]);
let output = run_command_with_timeout(cmd, 2).await
let output = run_command_with_timeout(cmd, self.config.command_timeout_seconds).await
.map_err(|e| CollectorError::SystemRead {
path: "block devices".to_string(),
error: e.to_string(),
@@ -530,6 +530,9 @@ impl DiskCollector {
/// Populate drives data into AgentData
fn populate_drives_data(&self, physical_drives: &[PhysicalDrive], smart_data: &HashMap<String, SmartData>, agent_data: &mut AgentData) -> Result<(), CollectorError> {
// Clear existing drives data to prevent duplicates in cached architecture
agent_data.system.storage.drives.clear();
for drive in physical_drives {
let smart = smart_data.get(&drive.name);
@@ -567,6 +570,9 @@ impl DiskCollector {
/// Populate pools data into AgentData
fn populate_pools_data(&self, mergerfs_pools: &[MergerfsPool], smart_data: &HashMap<String, SmartData>, agent_data: &mut AgentData) -> Result<(), CollectorError> {
// Clear existing pools data to prevent duplicates in cached architecture
agent_data.system.storage.pools.clear();
for pool in mergerfs_pools {
// Calculate pool health and statuses based on member drive health
let (pool_health, health_status, usage_status, data_drive_data, parity_drive_data) = self.calculate_pool_health(pool, smart_data);

View File

@@ -97,9 +97,12 @@ impl MemoryCollector {
/// Populate tmpfs data into AgentData
async fn populate_tmpfs_data(&self, agent_data: &mut AgentData) -> Result<(), CollectorError> {
// Clear existing tmpfs data to prevent duplicates in cached architecture
agent_data.system.memory.tmpfs.clear();
// Discover all tmpfs mount points
let tmpfs_mounts = self.discover_tmpfs_mounts()?;
if tmpfs_mounts.is_empty() {
debug!("No tmpfs mounts found to monitor");
return Ok(());

View File

@@ -8,12 +8,12 @@ use crate::config::NetworkConfig;
/// Network interface collector with physical/virtual classification and link status
pub struct NetworkCollector {
_config: NetworkConfig,
config: NetworkConfig,
}
impl NetworkCollector {
pub fn new(config: NetworkConfig) -> Self {
Self { _config: config }
Self { config }
}
/// Check if interface is physical (not virtual)
@@ -50,8 +50,9 @@ impl NetworkCollector {
}
/// Get the primary physical interface (the one with default route)
fn get_primary_physical_interface() -> Option<String> {
match Command::new("timeout").args(["2", "ip", "route", "show", "default"]).output() {
fn get_primary_physical_interface(&self) -> Option<String> {
let timeout_str = self.config.command_timeout_seconds.to_string();
match Command::new("timeout").args([&timeout_str, "ip", "route", "show", "default"]).output() {
Ok(output) if output.status.success() => {
let output_str = String::from_utf8_lossy(&output.stdout);
// Parse: "default via 192.168.1.1 dev eno1 ..."
@@ -110,7 +111,8 @@ impl NetworkCollector {
// Parse VLAN configuration
let vlan_map = Self::parse_vlan_config();
match Command::new("timeout").args(["2", "ip", "-j", "addr"]).output() {
let timeout_str = self.config.command_timeout_seconds.to_string();
match Command::new("timeout").args([&timeout_str, "ip", "-j", "addr"]).output() {
Ok(output) if output.status.success() => {
let json_str = String::from_utf8_lossy(&output.stdout);
@@ -195,7 +197,7 @@ impl NetworkCollector {
}
// Assign primary physical interface as parent to virtual interfaces without explicit parent
let primary_interface = Self::get_primary_physical_interface();
let primary_interface = self.get_primary_physical_interface();
if let Some(primary) = primary_interface {
for interface in interfaces.iter_mut() {
// Only assign parent to virtual interfaces that don't already have one

View File

@@ -113,6 +113,7 @@ impl SystemdCollector {
name: site_name.clone(),
service_status: self.calculate_service_status(&site_name, &site_status),
metrics,
service_type: "nginx_site".to_string(),
});
}
}
@@ -128,12 +129,13 @@ impl SystemdCollector {
name: container_name.clone(),
service_status: self.calculate_service_status(&container_name, &container_status),
metrics,
service_type: "container".to_string(),
});
}
// Add Docker images
let docker_images = self.get_docker_images();
for (image_name, image_status, image_size_str, image_size_mb) in docker_images {
for (image_name, image_status, image_size_mb) in docker_images {
let mut metrics = Vec::new();
metrics.push(SubServiceMetric {
label: "size".to_string(),
@@ -142,9 +144,10 @@ impl SystemdCollector {
});
sub_services.push(SubServiceData {
name: format!("{} ({})", image_name, image_size_str),
name: image_name.to_string(),
service_status: self.calculate_service_status(&image_name, &image_status),
metrics,
service_type: "image".to_string(),
});
}
}
@@ -251,18 +254,19 @@ impl SystemdCollector {
/// Auto-discover interesting services to monitor
fn discover_services_internal(&self) -> Result<(Vec<String>, std::collections::HashMap<String, ServiceStatusInfo>)> {
// First: Get all service unit files (with 3 second timeout)
// First: Get all service unit files
let timeout_str = self.config.command_timeout_seconds.to_string();
let unit_files_output = Command::new("timeout")
.args(&["3", "systemctl", "list-unit-files", "--type=service", "--no-pager", "--plain"])
.args(&[&timeout_str, "systemctl", "list-unit-files", "--type=service", "--no-pager", "--plain"])
.output()?;
if !unit_files_output.status.success() {
return Err(anyhow::anyhow!("systemctl list-unit-files command failed"));
}
// Second: Get runtime status of all units (with 3 second timeout)
// Second: Get runtime status of all units
let units_status_output = Command::new("timeout")
.args(&["3", "systemctl", "list-units", "--type=service", "--all", "--no-pager", "--plain"])
.args(&[&timeout_str, "systemctl", "list-units", "--type=service", "--all", "--no-pager", "--plain"])
.output()?;
if !units_status_output.status.success() {
@@ -358,16 +362,17 @@ impl SystemdCollector {
}
}
// Fallback to systemctl if not in cache (with 2 second timeout)
// Fallback to systemctl if not in cache
let timeout_str = self.config.command_timeout_seconds.to_string();
let output = Command::new("timeout")
.args(&["2", "systemctl", "is-active", &format!("{}.service", service)])
.args(&[&timeout_str, "systemctl", "is-active", &format!("{}.service", service)])
.output()?;
let active_status = String::from_utf8(output.stdout)?.trim().to_string();
// Get more detailed info (with 2 second timeout)
// Get more detailed info
let output = Command::new("timeout")
.args(&["2", "systemctl", "show", &format!("{}.service", service), "--property=LoadState,ActiveState,SubState"])
.args(&[&timeout_str, "systemctl", "show", &format!("{}.service", service), "--property=LoadState,ActiveState,SubState"])
.output()?;
let detailed_info = String::from_utf8(output.stdout)?;
@@ -427,9 +432,10 @@ impl SystemdCollector {
return Ok(0.0);
}
// No configured path - try to get WorkingDirectory from systemctl (with 2 second timeout)
// No configured path - try to get WorkingDirectory from systemctl
let timeout_str = self.config.command_timeout_seconds.to_string();
let output = Command::new("timeout")
.args(&["2", "systemctl", "show", &format!("{}.service", service_name), "--property=WorkingDirectory"])
.args(&[&timeout_str, "systemctl", "show", &format!("{}.service", service_name), "--property=WorkingDirectory"])
.output()
.map_err(|e| CollectorError::SystemRead {
path: format!("WorkingDirectory for {}", service_name),
@@ -449,15 +455,15 @@ impl SystemdCollector {
Ok(0.0)
}
/// Get size of a directory in GB (with 2 second timeout)
/// Get size of a directory in GB
async fn get_directory_size(&self, path: &str) -> Option<f32> {
use super::run_command_with_timeout;
// Use -s (summary) and --apparent-size for speed, 2 second timeout
// Use -s (summary) and --apparent-size for speed
let mut cmd = Command::new("sudo");
cmd.args(&["du", "-s", "--apparent-size", "--block-size=1", path]);
let output = run_command_with_timeout(cmd, 2).await.ok()?;
let output = run_command_with_timeout(cmd, self.config.command_timeout_seconds).await.ok()?;
if !output.status.success() {
// Log permission errors for debugging but don't spam logs
@@ -783,9 +789,10 @@ impl SystemdCollector {
let mut containers = Vec::new();
// Check if docker is available (cm-agent user is in docker group)
// Use -a to show ALL containers (running and stopped) with 3 second timeout
// Use -a to show ALL containers (running and stopped)
let timeout_str = self.config.command_timeout_seconds.to_string();
let output = Command::new("timeout")
.args(&["3", "docker", "ps", "-a", "--format", "{{.Names}},{{.Status}}"])
.args(&[&timeout_str, "docker", "ps", "-a", "--format", "{{.Names}},{{.Status}}"])
.output();
let output = match output {
@@ -824,11 +831,12 @@ impl SystemdCollector {
}
/// Get docker images as sub-services
fn get_docker_images(&self) -> Vec<(String, String, String, f32)> {
fn get_docker_images(&self) -> Vec<(String, String, f32)> {
let mut images = Vec::new();
// Check if docker is available (cm-agent user is in docker group) with 3 second timeout
// Check if docker is available (cm-agent user is in docker group)
let timeout_str = self.config.command_timeout_seconds.to_string();
let output = Command::new("timeout")
.args(&["3", "docker", "images", "--format", "{{.Repository}}:{{.Tag}},{{.Size}}"])
.args(&[&timeout_str, "docker", "images", "--format", "{{.Repository}}:{{.Tag}},{{.Size}}"])
.output();
let output = match output {
@@ -865,9 +873,8 @@ impl SystemdCollector {
let size_mb = self.parse_docker_size(size_str);
images.push((
format!("I {}", image_name),
image_name.to_string(),
"inactive".to_string(), // Images are informational - use inactive for neutral display
size_str.to_string(),
size_mb
));
}
@@ -908,6 +915,9 @@ impl SystemdCollector {
#[async_trait]
impl Collector for SystemdCollector {
async fn collect_structured(&self, agent_data: &mut AgentData) -> Result<(), CollectorError> {
// Clear existing services data to prevent duplicates in cached architecture
agent_data.services.clear();
// Use cached complete data if available and fresh
if let Some(cached_complete_services) = self.get_cached_complete_services() {
for service_data in cached_complete_services {

View File

@@ -79,6 +79,9 @@ pub struct DiskConfig {
pub temperature_critical_celsius: f32,
pub wear_warning_percent: f32,
pub wear_critical_percent: f32,
/// Command timeout in seconds for lsblk, smartctl, etc.
#[serde(default = "default_disk_command_timeout")]
pub command_timeout_seconds: u64,
}
/// Filesystem configuration entry
@@ -108,6 +111,9 @@ pub struct SystemdConfig {
pub http_timeout_seconds: u64,
pub http_connect_timeout_seconds: u64,
pub nginx_latency_critical_ms: f32,
/// Command timeout in seconds for systemctl, docker, du commands
#[serde(default = "default_systemd_command_timeout")]
pub command_timeout_seconds: u64,
}
@@ -132,6 +138,9 @@ pub struct BackupConfig {
pub struct NetworkConfig {
pub enabled: bool,
pub interval_seconds: u64,
/// Command timeout in seconds for ip route, ip addr commands
#[serde(default = "default_network_command_timeout")]
pub command_timeout_seconds: u64,
}
/// Notification configuration
@@ -145,6 +154,9 @@ pub struct NotificationConfig {
pub rate_limit_minutes: u64,
/// Email notification batching interval in seconds (default: 60)
pub aggregation_interval_seconds: u64,
/// Status check interval in seconds for detecting changes (default: 30)
#[serde(default = "default_notification_check_interval")]
pub check_interval_seconds: u64,
/// List of metric names to exclude from email notifications
#[serde(default)]
pub exclude_email_metrics: Vec<String>,
@@ -158,10 +170,26 @@ fn default_heartbeat_interval_seconds() -> u64 {
5
}
fn default_notification_check_interval() -> u64 {
30
}
fn default_maintenance_mode_file() -> String {
"/tmp/cm-maintenance".to_string()
}
fn default_disk_command_timeout() -> u64 {
30
}
fn default_systemd_command_timeout() -> u64 {
15
}
fn default_network_command_timeout() -> u64 {
10
}
impl AgentConfig {
pub fn from_file<P: AsRef<Path>>(path: P) -> Result<Self> {
loader::load_config(path)

View File

@@ -1,6 +1,6 @@
[package]
name = "cm-dashboard"
version = "0.1.189"
version = "0.1.194"
edition = "2021"
[dependencies]

View File

@@ -32,6 +32,7 @@ struct ServiceInfo {
disk_gb: Option<f32>,
metrics: Vec<(String, f32, Option<String>)>, // (label, value, unit)
widget_status: Status,
service_type: String, // "nginx_site", "container", "image", or empty for parent services
}
impl ServicesWidget {
@@ -169,7 +170,7 @@ impl ServicesWidget {
// Convert Status enum to display text for sub-services
match info.widget_status {
Status::Ok => "active",
Status::Inactive => "inactive",
Status::Inactive => "inactive",
Status::Critical => "failed",
Status::Pending => "pending",
Status::Warning => "warning",
@@ -179,32 +180,62 @@ impl ServicesWidget {
};
let tree_symbol = if is_last { "└─" } else { "├─" };
vec![
// Indentation and tree prefix
ratatui::text::Span::styled(
format!(" {} ", tree_symbol),
Typography::tree(),
),
// Status icon
ratatui::text::Span::styled(
format!("{} ", icon),
Style::default().fg(status_color).bg(Theme::background()),
),
// Service name
ratatui::text::Span::styled(
format!("{:<18} ", short_name),
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
// Status/latency text
ratatui::text::Span::styled(
status_str,
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
]
// Docker images use docker whale icon
if info.service_type == "image" {
vec![
// Indentation and tree prefix
ratatui::text::Span::styled(
format!(" {} ", tree_symbol),
Typography::tree(),
),
// Docker icon (simple character for performance)
ratatui::text::Span::styled(
"D ".to_string(),
Style::default().fg(Theme::highlight()).bg(Theme::background()),
),
// Service name
ratatui::text::Span::styled(
format!("{:<18} ", short_name),
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
// Status/metrics text
ratatui::text::Span::styled(
status_str,
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
]
} else {
vec![
// Indentation and tree prefix
ratatui::text::Span::styled(
format!(" {} ", tree_symbol),
Typography::tree(),
),
// Status icon
ratatui::text::Span::styled(
format!("{} ", icon),
Style::default().fg(status_color).bg(Theme::background()),
),
// Service name
ratatui::text::Span::styled(
format!("{:<18} ", short_name),
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
// Status/latency text
ratatui::text::Span::styled(
status_str,
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
]
}
}
/// Move selection up
@@ -282,9 +313,10 @@ impl Widget for ServicesWidget {
disk_gb: Some(service.disk_gb),
metrics: Vec::new(), // Parent services don't have custom metrics
widget_status: service.service_status,
service_type: String::new(), // Parent services have no type
};
self.parent_services.insert(service.name.clone(), parent_info);
// Process sub-services if any
if !service.sub_services.is_empty() {
let mut sub_list = Vec::new();
@@ -293,12 +325,13 @@ impl Widget for ServicesWidget {
let metrics: Vec<(String, f32, Option<String>)> = sub_service.metrics.iter()
.map(|m| (m.label.clone(), m.value, m.unit.clone()))
.collect();
let sub_info = ServiceInfo {
memory_mb: None, // Not used for sub-services
disk_gb: None, // Not used for sub-services
metrics,
widget_status: sub_service.service_status,
service_type: sub_service.service_type.clone(),
};
sub_list.push((sub_service.name.clone(), sub_info));
}
@@ -342,6 +375,7 @@ impl ServicesWidget {
disk_gb: None,
metrics: Vec::new(),
widget_status: Status::Unknown,
service_type: String::new(),
});
if metric.name.ends_with("_status") {
@@ -377,6 +411,7 @@ impl ServicesWidget {
disk_gb: None,
metrics: Vec::new(),
widget_status: Status::Unknown,
service_type: String::new(), // Unknown type in legacy path
},
));
&mut sub_service_list.last_mut().unwrap().1

View File

@@ -1,6 +1,6 @@
[package]
name = "cm-dashboard-shared"
version = "0.1.189"
version = "0.1.194"
edition = "2021"
[dependencies]

View File

@@ -149,6 +149,9 @@ pub struct SubServiceData {
pub name: String,
pub service_status: Status,
pub metrics: Vec<SubServiceMetric>,
/// Type of sub-service: "nginx_site", "container", "image"
#[serde(default)]
pub service_type: String,
}
/// Individual metric for a sub-service