Compare commits

...

48 Commits

Author SHA1 Message Date
5e08b34280 Move C-state name cleaning to agent for smaller JSON
All checks were successful
Build and Release / build-and-release (push) Successful in 1m32s
- Agent now extracts "C" + digits pattern (C3, C10) using char parsing
- Removes suffixes like "_ACPI", "_MWAIT" at source
- Reduces JSON payload size over ZMQ
- No regex dependency - uses fast char iteration (~1μs overhead)
- Robust fallback to original name if pattern not found
- Dashboard simplified to use clean names directly

Bump version to v0.1.212

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 14:05:55 +01:00
0d8284b69c Clean C-state display to show only CX format
All checks were successful
Build and Release / build-and-release (push) Successful in 1m18s
- Strip suffixes like "_ACPI" from C-state names
- Display changes from "C3_ACPI:51%" to "C3:51%"
- Cleaner, more concise presentation

Bump version to v0.1.211

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 13:34:01 +01:00
d84690cb3b Move transmission interval to ZMQ config section
All checks were successful
Build and Release / build-and-release (push) Successful in 1m43s
- Changed code to use zmq.transmission_interval_seconds instead of top-level collection_interval_seconds
- Removed collection_interval_seconds from AgentConfig
- Updated validation to check zmq.transmission_interval_seconds
- Improves config organization by grouping all ZMQ settings together

Bump version to v0.1.210

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 13:31:39 +01:00
7c030b33d6 Show top 3 C-states with usage percentages
All checks were successful
Build and Release / build-and-release (push) Successful in 1m21s
- Changed CpuData.cstate from String to Vec<CStateInfo>
- Added CStateInfo struct with name and percent fields
- Collector calculates percentage for each C-state based on accumulated time
- Sorts and returns top 3 C-states by usage
- Dashboard displays: "C10:79% C8:10% C6:8%"

Provides better visibility into CPU idle state distribution.

Bump version to v0.1.209

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-28 23:45:46 +01:00
c6817537a8 Replace CPU frequency with C-state monitoring
All checks were successful
Build and Release / build-and-release (push) Successful in 1m20s
- Changed CpuData.frequency_mhz to CpuData.cstate (String)
- Implemented collect_cstate() to read CPU idle depth from sysfs
- Finds deepest C-state with most accumulated time (C0-C10)
- Updated dashboard to display C-state instead of frequency
- More accurate indicator of CPU activity vs power management

Bump version to v0.1.208

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-28 23:30:14 +01:00
2189d34b16 Bump version to v0.1.207
All checks were successful
Build and Release / build-and-release (push) Successful in 1m9s
2025-11-28 23:16:33 +01:00
28cfd5758f Fix service metrics not showing - remove cache check
The service_status_cache from discovery only has active_state with
all detailed metrics set to None. During collection, get_service_status()
was returning cached data instead of fetching fresh systemctl show data.

Now always fetch fresh data to populate memory_bytes, restart_count,
and uptime_seconds properly.
2025-11-28 23:15:51 +01:00
5deb8cf8d8 Bump version to v0.1.206
All checks were successful
Build and Release / build-and-release (push) Successful in 1m10s
2025-11-28 23:07:20 +01:00
0e01813ff5 Add service metrics from systemctl (memory, uptime, restarts)
Shared:
- Add memory_bytes, restart_count, uptime_seconds to ServiceData

Agent:
- Add new fields to ServiceStatusInfo struct
- Fetch MemoryCurrent, NRestarts, ExecMainStartTimestamp from systemctl show
- Calculate uptime from start timestamp
- Parse and populate new fields in ServiceData
- Remove unused load_state and sub_state fields

Dashboard:
- Add memory_bytes, restart_count, uptime_seconds to ServiceInfo
- Update header: Service, Status, RAM, Uptime, ↻ (restarts)
- Format memory as MB/GB
- Format uptime as Xd Xh, Xh Xm, or Xm
- Show restart count with ! prefix if > 0 to indicate instability

All metrics obtained from single systemctl show call - zero overhead.
2025-11-28 23:06:13 +01:00
c3c9507a42 Bump version to v0.1.205
All checks were successful
Build and Release / build-and-release (push) Successful in 1m22s
2025-11-28 22:37:28 +01:00
4d77ffe17e Remove RAM and Disk columns from services widget header
Changed header from 4 columns to 2 columns:
- Before: Service, Status, RAM, Disk
- After: Service, Status

Matches the removal of memory_mb and disk_gb fields.
2025-11-28 22:37:14 +01:00
14f74b4cac Bump version to v0.1.204
All checks were successful
Build and Release / build-and-release (push) Successful in 1m20s
2025-11-28 14:33:19 +01:00
67b686f8c7 Remove RAM and disk collection for services
Complete removal of service resource metrics:

Agent:
- Remove memory_mb and disk_gb fields from ServiceData struct
- Remove get_service_memory_usage() method
- Remove get_service_disk_usage() method
- Remove get_directory_size() method
- Remove unused warn import

Dashboard:
- Remove memory_mb and disk_gb from ServiceInfo struct
- Remove memory/disk display from format_parent_service_line
- Remove memory/disk parsing in legacy metric path
- Remove unused format_disk_size() function

Service resource metrics were slow, unreliable, and never worked
properly since structured data migration. Will be handled differently
in the future.
2025-11-28 14:25:12 +01:00
e3996fdb84 Fix compilation errors from command receiver removal
All checks were successful
Build and Release / build-and-release (push) Successful in 1m8s
- Remove AgentCommand import from agent.rs
- Remove handle_commands() method
- Remove command handling from main loop
- Remove command_port validation checks
2025-11-28 13:01:36 +01:00
f94ca60e69 Bump version to v0.1.203
Some checks failed
Build and Release / build-and-release (push) Failing after 1m36s
2025-11-28 12:53:56 +01:00
c19ff56df8 Remove unused ZMQ command receiver (port 6131)
Service control migrated to SSH, command receiver no longer needed.
- Remove command_receiver Socket from ZmqHandler
- Remove try_receive_command method
- Remove AgentCommand enum
- Remove command_port from ZmqConfig
2025-11-28 12:52:43 +01:00
fe2f604703 Bump version to v0.1.202
All checks were successful
Build and Release / build-and-release (push) Successful in 1m8s
2025-11-28 12:45:25 +01:00
8bfd416327 Revert to v0.1.192 - fix agent hang issue
Some checks failed
Build and Release / build-and-release (push) Failing after 1m8s
2025-11-28 12:42:24 +01:00
85c6c624fb Revert D-Bus usage, use systemctl commands only
All checks were successful
Build and Release / build-and-release (push) Successful in 1m20s
- Remove zbus dependency from agent
- Replace D-Bus Connection calls with systemctl show commands
- Fix agent hang by eliminating blocking D-Bus operations
- get_unit_property now uses systemctl show with property flags
- Memory, disk usage, and nginx config queries use systemctl
- Simpler, more reliable service monitoring
2025-11-28 12:15:04 +01:00
eab3f17428 Fix agent hang by reverting service discovery to systemctl
All checks were successful
Build and Release / build-and-release (push) Successful in 1m31s
The D-Bus ListUnits call in discover_services_internal() was causing
the agent to hang on startup.

**Root cause:**
- D-Bus ListUnits call with complex tuple destructuring hung indefinitely
- Agent never completed first collection cycle
- No collector output in logs

**Fix:**
- Revert discover_services_internal() to use systemctl list-units/list-unit-files
- Keep D-Bus-based property queries (WorkingDirectory, MemoryCurrent, ExecStart)
- Hybrid approach: systemctl for discovery, D-Bus for individual queries

**External commands still used:**
- systemctl list-units, list-unit-files (service discovery)
- smartctl (SMART data)
- sudo du (directory sizes)
- nginx -T (config fallback)

Version bump: 0.1.198 → 0.1.199
2025-11-28 11:57:31 +01:00
7ad149bbe4 Replace all systemctl commands with zbus D-Bus API
All checks were successful
Build and Release / build-and-release (push) Successful in 1m31s
Complete migration from systemctl subprocess calls to native D-Bus communication:

**Removed systemctl commands:**
- systemctl is-active (fallback) - use D-Bus cache from ListUnits
- systemctl show --property=LoadState,ActiveState,SubState - use D-Bus cache
- systemctl show --property=WorkingDirectory - use D-Bus Properties.Get
- systemctl show --property=MemoryCurrent - use D-Bus Properties.Get
- systemctl show nginx --property=ExecStart - use D-Bus Properties.Get

**Implementation details:**
- Added get_unit_property() helper for D-Bus property access
- Made get_nginx_site_metrics() async to support D-Bus calls
- Made get_nginx_sites_internal() async
- Made discover_nginx_sites() async
- Made get_nginx_config_from_systemd() async
- Fixed RwLock guard Send issues by using scoped locks

**Remaining external commands:**
- smartctl (disk.rs) - No Rust alternative for SMART data
- sudo du (systemd.rs) - Directory size measurement
- nginx -T (systemd.rs) - Nginx config fallback
- timeout hostname (nixos.rs) - Rare fallback only

Version bump: 0.1.197 → 0.1.198
2025-11-28 11:46:28 +01:00
b444c88ea0 Replace external commands with native Rust APIs
All checks were successful
Build and Release / build-and-release (push) Successful in 1m54s
Significant performance improvements by eliminating subprocess spawning:

- Replace 'ip' commands with rtnetlink for network interface discovery
- Replace 'docker ps/images' with bollard Docker API client
- Replace 'systemctl list-units' with zbus D-Bus for systemd interaction
- Replace 'df' with statvfs() syscall for filesystem statistics
- Replace 'lsblk' with /proc/mounts parsing

Add interval-based caching to collectors:
- DiskCollector now respects interval_seconds configuration
- SystemdCollector now respects interval_seconds configuration
- CpuCollector now respects interval_seconds configuration

Remove unused command communication infrastructure:
- Remove port 6131 ZMQ command receiver
- Clean up unused AgentCommand types

Dependencies added:
- rtnetlink = "0.14"
- netlink-packet-route = "0.19"
- bollard = "0.17"
- zbus = "4.0"
- nix (fs features for statvfs)
2025-11-28 11:27:33 +01:00
317cf76bd1 Bump version to v0.1.196
All checks were successful
Build and Release / build-and-release (push) Successful in 1m19s
2025-11-27 23:16:40 +01:00
0db1a165b9 Revert "Implement cached collector architecture with configurable timeouts"
This reverts commit 2740de9b54.
2025-11-27 23:12:08 +01:00
3c2955376d Revert "Fix ZMQ sender blocking - move to independent thread with try_read"
This reverts commit 01e1f33b66.
2025-11-27 23:10:55 +01:00
f09ccabc7f Revert "Fix data duplication in cached collector architecture"
This reverts commit 14618c59c6.
2025-11-27 23:09:40 +01:00
43dd5a901a Update CLAUDE.md with correct ZMQ sender architecture 2025-11-27 22:59:38 +01:00
01e1f33b66 Fix ZMQ sender blocking - move to independent thread with try_read
All checks were successful
Build and Release / build-and-release (push) Successful in 1m21s
CRITICAL FIX: The previous cached collector architecture still had ZMQ sending
in the main event loop, where it could block waiting for RwLock when collectors
were writing. This caused the 3-8 second delays you observed.

Changes:
- Move ZMQ publisher to dedicated std::thread (ZMQ sockets aren't thread-safe)
- Use try_read() instead of read() to avoid blocking on write locks
- Send previous data if cache is locked by collector
- ZMQ now sends every 2s regardless of collector timing
- Remove publisher from ZmqHandler (now only handles commands)

Architecture:
- Collectors: Independent tokio tasks updating shared cache
- ZMQ Sender: Dedicated OS thread with its own publisher socket
- Main Loop: Only handles commands and notifications

This ensures ZMQ transmission is NEVER blocked by slow collectors.

Bump version to v0.1.195
2025-11-27 22:56:58 +01:00
ed6399b914 Bump version to v0.1.194
All checks were successful
Build and Release / build-and-release (push) Successful in 1m20s
2025-11-27 22:46:17 +01:00
14618c59c6 Fix data duplication in cached collector architecture
Critical bug fix: Collectors were appending to Vecs instead of replacing them,
causing duplicate entries with each collection cycle.

Fixed by adding .clear() calls before populating:
- Memory collector: tmpfs Vec (was showing 11+ duplicates)
- Disk collector: drives and pools Vecs
- Systemd collector: services Vec
- Network collector: Already correct (assigns new Vec)

This prevents the exponential growth of duplicate entries in the dashboard UI.
2025-11-27 22:45:44 +01:00
2740de9b54 Implement cached collector architecture with configurable timeouts
All checks were successful
Build and Release / build-and-release (push) Successful in 1m20s
Major architectural refactor to eliminate false "host offline" alerts:

- Replace sequential blocking collectors with independent async tasks
- Each collector runs at configurable interval and updates shared cache
- ZMQ sender reads cache every 1-2s regardless of collector speed
- Collector intervals: CPU/Memory (1-10s), Backup/NixOS (30-60s), Disk/Systemd (60-300s)

All intervals now configurable via NixOS config:
- collectors.*.interval_seconds (collection frequency per collector)
- collectors.*.command_timeout_seconds (timeout for shell commands)
- notifications.check_interval_seconds (status change detection rate)

Command timeouts increased from hardcoded 2-3s to configurable 10-30s:
- Disk collector: 30s (SMART operations, lsblk)
- Systemd collector: 15s (systemctl, docker, du commands)
- Network collector: 10s (ip route, ip addr)

Benefits:
- No false "offline" alerts when slow collectors take >10s
- Different update rates for different metric types
- Better resource management with longer timeouts
- Full NixOS configuration control

Bump version to v0.1.193
2025-11-27 22:37:20 +01:00
37f2650200 Document cached collector architecture plan
Add architectural plan for separating ZMQ sending from data collection to prevent false 'host offline' alerts caused by slow collectors.

Key concepts:
- Shared cache (Arc<RwLock<AgentData>>)
- Independent async collector tasks with different update rates
- ZMQ sender always sends every 1s from cache
- Fast collectors (1s), medium (5s), slow (60s)
- No blocking regardless of collector speed
2025-11-27 21:49:44 +01:00
833010e270 Bump version to v0.1.192
All checks were successful
Build and Release / build-and-release (push) Successful in 1m8s
2025-11-27 18:34:53 +01:00
549d9d1c72 Replace whale emoji with ASCII 'D' for performance
Emoji rendering in terminals can be very slow, especially when rendered in the hot path (every frame for every docker image). The whale emoji 🐋 was causing significant rendering delays.

Temporary change to ASCII 'D' to test if emoji was the performance issue.
2025-11-27 18:34:27 +01:00
9b84b70581 Bump version to v0.1.191
All checks were successful
Build and Release / build-and-release (push) Successful in 1m8s
2025-11-27 18:16:49 +01:00
92c3ee3f2a Add Docker whale icon for docker images
Docker images now display with distinctive 🐋 whale icon in blue (highlight color) instead of status icons. This provides clear visual identification that these are docker images while not implying operational status.
2025-11-27 18:16:33 +01:00
1be55f765d Bump version to v0.1.190
All checks were successful
Build and Release / build-and-release (push) Successful in 1m9s
2025-11-27 18:09:49 +01:00
2f94a4b853 Add service_type field to separate data from presentation
Changes:
- Add service_type field to SubServiceData: 'nginx_site', 'container', 'image'
- Agent sends pure data without display formatting
- Dashboard checks service_type to decide presentation
- Docker images now display without status icon (service_type='image')
- Remove unused image_size_str from docker images tuple

Clean separation: agent provides data, dashboard handles display logic.
2025-11-27 18:09:20 +01:00
ff2b43827a Bump version to v0.1.189
All checks were successful
Build and Release / build-and-release (push) Successful in 1m19s
2025-11-27 17:57:38 +01:00
fac0188c6f Change docker image display format and status
Changes:
- Rename docker images from 'image_node:18...' to 'I node:18...' for conciseness
- Change image status from 'active' to 'inactive' for neutral informational display
- Images now show with gray empty circle ○ instead of green filled circle ●

Docker images are static artifacts without meaningful operational status, so using inactive status provides neutral gray display that won't trigger alerts or affect service status aggregation.
2025-11-27 17:57:24 +01:00
6bb350f016 Bump version to v0.1.188
All checks were successful
Build and Release / build-and-release (push) Successful in 1m8s
2025-11-27 16:39:46 +01:00
374b126446 Reduce all command timeouts to 2-3 seconds max
With 10-second host heartbeat timeout, all command timeouts must be significantly lower to ensure total collection time stays under 10 seconds.

Changed timeouts:
- smartctl: 10s → 3s (critical: multiple drives queried sequentially)
- du: 5s → 2s
- lsblk: 5s → 2s
- systemctl list commands: 5s → 3s
- systemctl show/is-active: 3s → 2s
- docker commands: 5s → 3s
- df, ip commands: 3s → 2s

Total worst-case collection time now capped at more reasonable levels, preventing false host offline alerts from blocking operations.
2025-11-27 16:38:54 +01:00
76c04633b5 Bump version to v0.1.187
All checks were successful
Build and Release / build-and-release (push) Successful in 1m19s
2025-11-27 16:34:42 +01:00
1e0510be81 Add comprehensive timeouts to all blocking system commands
Fixes random host disconnections caused by blocking operations preventing timely ZMQ packet transmission.

Changes:
- Add run_command_with_timeout() wrapper using tokio for async command execution
- Apply 10s timeout to smartctl (prevents 30+ second hangs on failing drives)
- Apply 5s timeout to du, lsblk, systemctl list commands
- Apply 3s timeout to systemctl show/is-active, df, ip commands
- Apply 2s timeout to hostname command
- Use system 'timeout' command for sync operations where async not needed

Critical fixes:
- smartctl: Failing drives could block for 30+ seconds per drive
- du: Large directories (Docker, PostgreSQL) could block 10-30+ seconds
- systemctl/docker: Commands could block indefinitely during system issues

With 1-second collection interval and 10-second heartbeat timeout, any blocking operation >10s causes false "host offline" alerts. These timeouts ensure collection completes quickly even during system degradation.
2025-11-27 16:34:08 +01:00
9a2df906ea Add ZMQ communication statistics tracking and display
All checks were successful
Build and Release / build-and-release (push) Successful in 1m10s
2025-11-27 16:14:45 +01:00
6d6beb207d Parse Docker image sizes to MB and sort services alphabetically
All checks were successful
Build and Release / build-and-release (push) Successful in 1m18s
2025-11-27 15:57:38 +01:00
7a68da01f5 Remove debug logging for NVMe SMART collection
All checks were successful
Build and Release / build-and-release (push) Successful in 1m9s
2025-11-27 15:40:16 +01:00
5be67fed64 Add debug logging for NVMe SMART data collection
All checks were successful
Build and Release / build-and-release (push) Successful in 1m19s
2025-11-27 15:00:48 +01:00
21 changed files with 462 additions and 416 deletions

6
Cargo.lock generated
View File

@@ -279,7 +279,7 @@ checksum = "a1d728cc89cf3aee9ff92b05e62b19ee65a02b5702cff7d5a377e32c6ae29d8d"
[[package]]
name = "cm-dashboard"
version = "0.1.181"
version = "0.1.211"
dependencies = [
"anyhow",
"chrono",
@@ -301,7 +301,7 @@ dependencies = [
[[package]]
name = "cm-dashboard-agent"
version = "0.1.181"
version = "0.1.211"
dependencies = [
"anyhow",
"async-trait",
@@ -324,7 +324,7 @@ dependencies = [
[[package]]
name = "cm-dashboard-shared"
version = "0.1.181"
version = "0.1.211"
dependencies = [
"chrono",
"serde",

View File

@@ -1,6 +1,6 @@
[package]
name = "cm-dashboard-agent"
version = "0.1.182"
version = "0.1.212"
edition = "2021"
[dependencies]

View File

@@ -4,7 +4,7 @@ use std::time::Duration;
use tokio::time::interval;
use tracing::{debug, error, info};
use crate::communication::{AgentCommand, ZmqHandler};
use crate::communication::ZmqHandler;
use crate::config::AgentConfig;
use crate::collectors::{
Collector,
@@ -114,7 +114,7 @@ impl Agent {
// Set up intervals
let mut transmission_interval = interval(Duration::from_secs(
self.config.collection_interval_seconds,
self.config.zmq.transmission_interval_seconds,
));
let mut notification_interval = interval(Duration::from_secs(30)); // Check notifications every 30s
@@ -134,12 +134,6 @@ impl Agent {
// NOTE: With structured data, we might need to implement status tracking differently
// For now, we skip this until status evaluation is migrated
}
// Handle incoming commands (check periodically)
_ = tokio::time::sleep(Duration::from_millis(100)) => {
if let Err(e) = self.handle_commands().await {
error!("Error handling commands: {}", e);
}
}
_ = &mut shutdown_rx => {
info!("Shutdown signal received, stopping agent loop");
break;
@@ -259,36 +253,4 @@ impl Agent {
Ok(())
}
/// Handle incoming commands from dashboard
async fn handle_commands(&mut self) -> Result<()> {
// Try to receive a command (non-blocking)
if let Ok(Some(command)) = self.zmq_handler.try_receive_command() {
info!("Received command: {:?}", command);
match command {
AgentCommand::CollectNow => {
info!("Received immediate collection request");
if let Err(e) = self.collect_and_broadcast().await {
error!("Failed to collect on demand: {}", e);
}
}
AgentCommand::SetInterval { seconds } => {
info!("Received interval change request: {}s", seconds);
// Note: This would require more complex handling to update the interval
// For now, just acknowledge
}
AgentCommand::ToggleCollector { name, enabled } => {
info!("Received collector toggle request: {} -> {}", name, enabled);
// Note: This would require more complex handling to enable/disable collectors
// For now, just acknowledge
}
AgentCommand::Ping => {
info!("Received ping command");
// Maybe send back a pong or status
}
}
}
Ok(())
}
}

View File

@@ -119,36 +119,69 @@ impl CpuCollector {
utils::parse_u64(content.trim())
}
/// Collect CPU frequency and populate AgentData
async fn collect_frequency(&self, agent_data: &mut AgentData) -> Result<(), CollectorError> {
// Try scaling frequency first (more accurate for current frequency)
if let Ok(freq) =
utils::read_proc_file("/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq")
{
if let Ok(freq_khz) = utils::parse_u64(freq.trim()) {
let freq_mhz = freq_khz as f32 / 1000.0;
agent_data.system.cpu.frequency_mhz = freq_mhz;
return Ok(());
}
}
/// Collect CPU C-state (idle depth) and populate AgentData with top 3 C-states by usage
async fn collect_cstate(&self, agent_data: &mut AgentData) -> Result<(), CollectorError> {
// Read C-state usage from first CPU (representative of overall system)
// C-states indicate CPU idle depth: C1=light sleep, C6=deep sleep, C10=deepest
// Fallback: parse /proc/cpuinfo for base frequency
if let Ok(content) = utils::read_proc_file("/proc/cpuinfo") {
for line in content.lines() {
if line.starts_with("cpu MHz") {
if let Some(freq_str) = line.split(':').nth(1) {
if let Ok(freq_mhz) = utils::parse_f32(freq_str) {
agent_data.system.cpu.frequency_mhz = freq_mhz;
return Ok(());
let mut cstate_times: Vec<(String, u64)> = Vec::new();
let mut total_time: u64 = 0;
// Collect all C-state times from CPU0
for state_num in 0..=10 {
let time_path = format!("/sys/devices/system/cpu/cpu0/cpuidle/state{}/time", state_num);
let name_path = format!("/sys/devices/system/cpu/cpu0/cpuidle/state{}/name", state_num);
if let Ok(time_str) = utils::read_proc_file(&time_path) {
if let Ok(time) = utils::parse_u64(time_str.trim()) {
if let Ok(name) = utils::read_proc_file(&name_path) {
let state_name = name.trim();
// Skip POLL state (not real idle)
if state_name != "POLL" && time > 0 {
// Extract "C" + digits pattern (C3, C10, etc.) to reduce JSON size
// Handles formats like "C3_ACPI", "C10_MWAIT", etc.
let clean_name = if let Some(c_pos) = state_name.find('C') {
let rest = &state_name[c_pos + 1..];
let digit_count = rest.chars().take_while(|c| c.is_ascii_digit()).count();
if digit_count > 0 {
state_name[c_pos..c_pos + 1 + digit_count].to_string()
} else {
state_name.to_string()
}
} else {
state_name.to_string()
};
cstate_times.push((clean_name, time));
total_time += time;
}
}
break; // Only need first CPU entry
}
} else {
// No more states available
break;
}
}
debug!("CPU frequency not available");
// Leave frequency as 0.0 if not available
// Sort by time descending to get top 3
cstate_times.sort_by(|a, b| b.1.cmp(&a.1));
// Calculate percentages for top 3 and populate AgentData
agent_data.system.cpu.cstates = cstate_times
.iter()
.take(3)
.map(|(name, time)| {
let percent = if total_time > 0 {
(*time as f32 / total_time as f32) * 100.0
} else {
0.0
};
cm_dashboard_shared::CStateInfo {
name: name.clone(),
percent,
}
})
.collect();
Ok(())
}
}
@@ -165,8 +198,8 @@ impl Collector for CpuCollector {
// Collect temperature (optional)
self.collect_temperature(agent_data).await?;
// Collect frequency (optional)
self.collect_frequency(agent_data).await?;
// Collect C-state (CPU idle depth)
self.collect_cstate(agent_data).await?;
let duration = start.elapsed();
debug!("CPU collection completed in {:?}", duration);

View File

@@ -112,9 +112,12 @@ impl DiskCollector {
/// Get block devices and their mount points using lsblk
async fn get_mount_devices(&self) -> Result<HashMap<String, String>, CollectorError> {
let output = Command::new("lsblk")
.args(&["-rn", "-o", "NAME,MOUNTPOINT"])
.output()
use super::run_command_with_timeout;
let mut cmd = Command::new("lsblk");
cmd.args(&["-rn", "-o", "NAME,MOUNTPOINT"]);
let output = run_command_with_timeout(cmd, 2).await
.map_err(|e| CollectorError::SystemRead {
path: "block devices".to_string(),
error: e.to_string(),
@@ -186,8 +189,8 @@ impl DiskCollector {
/// Get filesystem info for a single mount point
fn get_filesystem_info(&self, mount_point: &str) -> Result<(u64, u64), CollectorError> {
let output = Command::new("df")
.args(&["--block-size=1", mount_point])
let output = std::process::Command::new("timeout")
.args(&["2", "df", "--block-size=1", mount_point])
.output()
.map_err(|e| CollectorError::SystemRead {
path: format!("df {}", mount_point),
@@ -413,7 +416,9 @@ impl DiskCollector {
/// Get SMART data for a single drive
async fn get_smart_data(&self, drive_name: &str) -> Result<SmartData, CollectorError> {
// Use direct smartctl (no sudo) - service has CAP_SYS_RAWIO capability
use super::run_command_with_timeout;
// Use direct smartctl (no sudo) - service has CAP_SYS_RAWIO and CAP_SYS_ADMIN capabilities
// For NVMe drives, specify device type explicitly
let mut cmd = Command::new("smartctl");
if drive_name.starts_with("nvme") {
@@ -422,7 +427,7 @@ impl DiskCollector {
cmd.args(&["-a", &format!("/dev/{}", drive_name)]);
}
let output = cmd.output()
let output = run_command_with_timeout(cmd, 3).await
.map_err(|e| CollectorError::SystemRead {
path: format!("SMART data for {}", drive_name),
error: e.to_string(),
@@ -757,9 +762,9 @@ impl DiskCollector {
/// Get drive information for a mount path
fn get_drive_info_for_path(&self, path: &str) -> anyhow::Result<PoolDrive> {
// Use lsblk to find the backing device
let output = Command::new("lsblk")
.args(&["-rn", "-o", "NAME,MOUNTPOINT"])
// Use lsblk to find the backing device with timeout
let output = Command::new("timeout")
.args(&["2", "lsblk", "-rn", "-o", "NAME,MOUNTPOINT"])
.output()
.map_err(|e| anyhow::anyhow!("Failed to run lsblk: {}", e))?;

View File

@@ -105,12 +105,12 @@ impl MemoryCollector {
return Ok(());
}
// Get usage data for all tmpfs mounts at once using df
let mut df_args = vec!["df", "--output=target,size,used", "--block-size=1"];
// Get usage data for all tmpfs mounts at once using df (with 2 second timeout)
let mut df_args = vec!["2", "df", "--output=target,size,used", "--block-size=1"];
df_args.extend(tmpfs_mounts.iter().map(|s| s.as_str()));
let df_output = std::process::Command::new(df_args[0])
.args(&df_args[1..])
let df_output = std::process::Command::new("timeout")
.args(&df_args[..])
.output()
.map_err(|e| CollectorError::SystemRead {
path: "tmpfs mounts".to_string(),

View File

@@ -1,6 +1,8 @@
use async_trait::async_trait;
use cm_dashboard_shared::{AgentData};
use std::process::{Command, Output};
use std::time::Duration;
use tokio::time::timeout;
pub mod backup;
pub mod cpu;
@@ -13,6 +15,20 @@ pub mod systemd;
pub use error::CollectorError;
/// Run a command with a timeout to prevent blocking
pub async fn run_command_with_timeout(mut cmd: Command, timeout_secs: u64) -> std::io::Result<Output> {
let timeout_duration = Duration::from_secs(timeout_secs);
match timeout(timeout_duration, tokio::task::spawn_blocking(move || cmd.output())).await {
Ok(Ok(result)) => result,
Ok(Err(e)) => Err(std::io::Error::new(std::io::ErrorKind::Other, e)),
Err(_) => Err(std::io::Error::new(
std::io::ErrorKind::TimedOut,
format!("Command timed out after {} seconds", timeout_secs)
)),
}
}
/// Base trait for all collectors with direct structured data output
#[async_trait]

View File

@@ -51,7 +51,7 @@ impl NetworkCollector {
/// Get the primary physical interface (the one with default route)
fn get_primary_physical_interface() -> Option<String> {
match Command::new("ip").args(["route", "show", "default"]).output() {
match Command::new("timeout").args(["2", "ip", "route", "show", "default"]).output() {
Ok(output) if output.status.success() => {
let output_str = String::from_utf8_lossy(&output.stdout);
// Parse: "default via 192.168.1.1 dev eno1 ..."
@@ -110,7 +110,7 @@ impl NetworkCollector {
// Parse VLAN configuration
let vlan_map = Self::parse_vlan_config();
match Command::new("ip").args(["-j", "addr"]).output() {
match Command::new("timeout").args(["2", "ip", "-j", "addr"]).output() {
Ok(output) if output.status.success() => {
let json_str = String::from_utf8_lossy(&output.stdout);

View File

@@ -43,8 +43,8 @@ impl NixOSCollector {
match fs::read_to_string("/etc/hostname") {
Ok(hostname) => Some(hostname.trim().to_string()),
Err(_) => {
// Fallback to hostname command
match Command::new("hostname").output() {
// Fallback to hostname command (with 2 second timeout)
match Command::new("timeout").args(["2", "hostname"]).output() {
Ok(output) => Some(String::from_utf8_lossy(&output.stdout).trim().to_string()),
Err(_) => None,
}

View File

@@ -43,9 +43,10 @@ struct ServiceCacheState {
/// Cached service status information from systemctl list-units
#[derive(Debug, Clone)]
struct ServiceStatusInfo {
load_state: String,
active_state: String,
sub_state: String,
memory_bytes: Option<u64>,
restart_count: Option<u32>,
start_timestamp: Option<u64>,
}
impl SystemdCollector {
@@ -86,14 +87,20 @@ impl SystemdCollector {
let mut complete_service_data = Vec::new();
for service_name in &monitored_services {
match self.get_service_status(service_name) {
Ok((active_status, _detailed_info)) => {
let memory_mb = self.get_service_memory_usage(service_name).await.unwrap_or(0.0);
let disk_gb = self.get_service_disk_usage(service_name).await.unwrap_or(0.0);
Ok(status_info) => {
let mut sub_services = Vec::new();
// Calculate uptime if we have start timestamp
let uptime_seconds = status_info.start_timestamp.and_then(|start| {
let now = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.ok()?
.as_secs();
Some(now.saturating_sub(start))
});
// Sub-service metrics for specific services (always include cached results)
if service_name.contains("nginx") && active_status == "active" {
if service_name.contains("nginx") && status_info.active_state == "active" {
let nginx_sites = self.get_nginx_site_metrics();
for (site_name, latency_ms) in nginx_sites {
let site_status = if latency_ms >= 0.0 && latency_ms < self.config.nginx_latency_critical_ms {
@@ -113,11 +120,12 @@ impl SystemdCollector {
name: site_name.clone(),
service_status: self.calculate_service_status(&site_name, &site_status),
metrics,
service_type: "nginx_site".to_string(),
});
}
}
if service_name.contains("docker") && active_status == "active" {
if service_name.contains("docker") && status_info.active_state == "active" {
let docker_containers = self.get_docker_containers();
for (container_name, container_status) in docker_containers {
// For now, docker containers have no additional metrics
@@ -128,23 +136,25 @@ impl SystemdCollector {
name: container_name.clone(),
service_status: self.calculate_service_status(&container_name, &container_status),
metrics,
service_type: "container".to_string(),
});
}
// Add Docker images
let docker_images = self.get_docker_images();
for (image_name, image_status, image_size) in docker_images {
for (image_name, image_status, image_size_mb) in docker_images {
let mut metrics = Vec::new();
metrics.push(SubServiceMetric {
label: "size".to_string(),
value: 0.0, // Size as string in name instead
unit: None,
value: image_size_mb,
unit: Some("MB".to_string()),
});
sub_services.push(SubServiceData {
name: format!("{} ({})", image_name, image_size),
name: image_name.to_string(),
service_status: self.calculate_service_status(&image_name, &image_status),
metrics,
service_type: "image".to_string(),
});
}
}
@@ -152,11 +162,12 @@ impl SystemdCollector {
// Create complete service data
let service_data = ServiceData {
name: service_name.clone(),
memory_mb,
disk_gb,
user_stopped: false, // TODO: Integrate with service tracker
service_status: self.calculate_service_status(service_name, &active_status),
service_status: self.calculate_service_status(service_name, &status_info.active_state),
sub_services,
memory_bytes: status_info.memory_bytes,
restart_count: status_info.restart_count,
uptime_seconds,
};
// Add to AgentData and cache
@@ -169,6 +180,10 @@ impl SystemdCollector {
}
}
// Sort services alphabetically by name
agent_data.services.sort_by(|a, b| a.name.cmp(&b.name));
complete_service_data.sort_by(|a, b| a.name.cmp(&b.name));
// Update cached state
{
let mut state = self.state.write().unwrap();
@@ -247,18 +262,18 @@ impl SystemdCollector {
/// Auto-discover interesting services to monitor
fn discover_services_internal(&self) -> Result<(Vec<String>, std::collections::HashMap<String, ServiceStatusInfo>)> {
// First: Get all service unit files
let unit_files_output = Command::new("systemctl")
.args(&["list-unit-files", "--type=service", "--no-pager", "--plain"])
// First: Get all service unit files (with 3 second timeout)
let unit_files_output = Command::new("timeout")
.args(&["3", "systemctl", "list-unit-files", "--type=service", "--no-pager", "--plain"])
.output()?;
if !unit_files_output.status.success() {
return Err(anyhow::anyhow!("systemctl list-unit-files command failed"));
}
// Second: Get runtime status of all units
let units_status_output = Command::new("systemctl")
.args(&["list-units", "--type=service", "--all", "--no-pager", "--plain"])
// Second: Get runtime status of all units (with 3 second timeout)
let units_status_output = Command::new("timeout")
.args(&["3", "systemctl", "list-units", "--type=service", "--all", "--no-pager", "--plain"])
.output()?;
if !units_status_output.status.success() {
@@ -288,14 +303,13 @@ impl SystemdCollector {
let fields: Vec<&str> = line.split_whitespace().collect();
if fields.len() >= 4 && fields[0].ends_with(".service") {
let service_name = fields[0].trim_end_matches(".service");
let load_state = fields.get(1).unwrap_or(&"unknown").to_string();
let active_state = fields.get(2).unwrap_or(&"unknown").to_string();
let sub_state = fields.get(3).unwrap_or(&"unknown").to_string();
status_cache.insert(service_name.to_string(), ServiceStatusInfo {
load_state,
active_state,
sub_state,
memory_bytes: None,
restart_count: None,
start_timestamp: None,
});
}
}
@@ -304,9 +318,10 @@ impl SystemdCollector {
for service_name in &all_service_names {
if !status_cache.contains_key(service_name) {
status_cache.insert(service_name.to_string(), ServiceStatusInfo {
load_state: "not-loaded".to_string(),
active_state: "inactive".to_string(),
sub_state: "dead".to_string(),
memory_bytes: None,
restart_count: None,
start_timestamp: None,
});
}
}
@@ -338,36 +353,60 @@ impl SystemdCollector {
Ok((services, status_cache))
}
/// Get service status from cache (if available) or fallback to systemctl
fn get_service_status(&self, service: &str) -> Result<(String, String)> {
// Try to get status from cache first
if let Ok(state) = self.state.read() {
if let Some(cached_info) = state.service_status_cache.get(service) {
let active_status = cached_info.active_state.clone();
let detailed_info = format!(
"LoadState={}\nActiveState={}\nSubState={}",
cached_info.load_state,
cached_info.active_state,
cached_info.sub_state
);
return Ok((active_status, detailed_info));
/// Get service status with detailed metrics from systemctl
fn get_service_status(&self, service: &str) -> Result<ServiceStatusInfo> {
// Always fetch fresh data to get detailed metrics (memory, restarts, uptime)
// Note: Cache in service_status_cache only has basic active_state from discovery,
// with all detailed metrics set to None. We need fresh systemctl show data.
let output = Command::new("timeout")
.args(&[
"2",
"systemctl",
"show",
&format!("{}.service", service),
"--property=LoadState,ActiveState,SubState,MemoryCurrent,NRestarts,ExecMainStartTimestamp"
])
.output()?;
let output_str = String::from_utf8(output.stdout)?;
// Parse properties
let mut active_state = String::new();
let mut memory_bytes = None;
let mut restart_count = None;
let mut start_timestamp = None;
for line in output_str.lines() {
if let Some(value) = line.strip_prefix("ActiveState=") {
active_state = value.to_string();
} else if let Some(value) = line.strip_prefix("MemoryCurrent=") {
if value != "[not set]" {
memory_bytes = value.parse().ok();
}
} else if let Some(value) = line.strip_prefix("NRestarts=") {
restart_count = value.parse().ok();
} else if let Some(value) = line.strip_prefix("ExecMainStartTimestamp=") {
if value != "[not set]" && !value.is_empty() {
// Parse timestamp to seconds since epoch
if let Ok(output) = Command::new("date")
.args(&["+%s", "-d", value])
.output()
{
if let Ok(timestamp_str) = String::from_utf8(output.stdout) {
start_timestamp = timestamp_str.trim().parse().ok();
}
}
}
}
}
// Fallback to systemctl if not in cache
let output = Command::new("systemctl")
.args(&["is-active", &format!("{}.service", service)])
.output()?;
let active_status = String::from_utf8(output.stdout)?.trim().to_string();
// Get more detailed info
let output = Command::new("systemctl")
.args(&["show", &format!("{}.service", service), "--property=LoadState,ActiveState,SubState"])
.output()?;
let detailed_info = String::from_utf8(output.stdout)?;
Ok((active_status, detailed_info))
Ok(ServiceStatusInfo {
active_state,
memory_bytes,
restart_count,
start_timestamp,
})
}
/// Check if service name matches pattern (supports wildcards like nginx*)
@@ -409,75 +448,6 @@ impl SystemdCollector {
true
}
/// Get disk usage for a specific service
async fn get_service_disk_usage(&self, service_name: &str) -> Result<f32, CollectorError> {
// Check if this service has configured directory paths
if let Some(dirs) = self.config.service_directories.get(service_name) {
// Service has configured paths - use the first accessible one
for dir in dirs {
if let Some(size) = self.get_directory_size(dir) {
return Ok(size);
}
}
// If configured paths failed, return 0
return Ok(0.0);
}
// No configured path - try to get WorkingDirectory from systemctl
let output = Command::new("systemctl")
.args(&["show", &format!("{}.service", service_name), "--property=WorkingDirectory"])
.output()
.map_err(|e| CollectorError::SystemRead {
path: format!("WorkingDirectory for {}", service_name),
error: e.to_string(),
})?;
let output_str = String::from_utf8_lossy(&output.stdout);
for line in output_str.lines() {
if line.starts_with("WorkingDirectory=") && !line.contains("[not set]") {
let dir = line.strip_prefix("WorkingDirectory=").unwrap_or("");
if !dir.is_empty() && dir != "/" {
return Ok(self.get_directory_size(dir).unwrap_or(0.0));
}
}
}
Ok(0.0)
}
/// Get size of a directory in GB
fn get_directory_size(&self, path: &str) -> Option<f32> {
let output = Command::new("sudo")
.args(&["du", "-sb", path])
.output()
.ok()?;
if !output.status.success() {
// Log permission errors for debugging but don't spam logs
let stderr = String::from_utf8_lossy(&output.stderr);
if stderr.contains("Permission denied") {
debug!("Permission denied accessing directory: {}", path);
} else {
debug!("Failed to get size for directory {}: {}", path, stderr);
}
return None;
}
let output_str = String::from_utf8(output.stdout).ok()?;
let size_str = output_str.split_whitespace().next()?;
if let Ok(size_bytes) = size_str.parse::<u64>() {
let size_gb = size_bytes as f32 / (1024.0 * 1024.0 * 1024.0);
// Return size even if very small (minimum 0.001 GB = 1MB for visibility)
if size_gb > 0.0 {
Some(size_gb.max(0.001))
} else {
None
}
} else {
None
}
}
/// Calculate service status, taking user-stopped services into account
fn calculate_service_status(&self, service_name: &str, active_status: &str) -> Status {
match active_status.to_lowercase().as_str() {
@@ -495,33 +465,6 @@ impl SystemdCollector {
}
}
/// Get memory usage for a specific service
async fn get_service_memory_usage(&self, service_name: &str) -> Result<f32, CollectorError> {
let output = Command::new("systemctl")
.args(&["show", &format!("{}.service", service_name), "--property=MemoryCurrent"])
.output()
.map_err(|e| CollectorError::SystemRead {
path: format!("memory usage for {}", service_name),
error: e.to_string(),
})?;
let output_str = String::from_utf8_lossy(&output.stdout);
for line in output_str.lines() {
if line.starts_with("MemoryCurrent=") {
if let Some(mem_str) = line.strip_prefix("MemoryCurrent=") {
if mem_str != "[not set]" {
if let Ok(memory_bytes) = mem_str.parse::<u64>() {
return Ok(memory_bytes as f32 / (1024.0 * 1024.0)); // Convert to MB
}
}
}
}
}
Ok(0.0)
}
/// Check if service collection cache should be updated
fn should_update_cache(&self) -> bool {
let state = self.state.read().unwrap();
@@ -774,9 +717,9 @@ impl SystemdCollector {
let mut containers = Vec::new();
// Check if docker is available (cm-agent user is in docker group)
// Use -a to show ALL containers (running and stopped)
let output = Command::new("docker")
.args(&["ps", "-a", "--format", "{{.Names}},{{.Status}}"])
// Use -a to show ALL containers (running and stopped) with 3 second timeout
let output = Command::new("timeout")
.args(&["3", "docker", "ps", "-a", "--format", "{{.Names}},{{.Status}}"])
.output();
let output = match output {
@@ -815,11 +758,11 @@ impl SystemdCollector {
}
/// Get docker images as sub-services
fn get_docker_images(&self) -> Vec<(String, String, String)> {
fn get_docker_images(&self) -> Vec<(String, String, f32)> {
let mut images = Vec::new();
// Check if docker is available (cm-agent user is in docker group)
let output = Command::new("docker")
.args(&["images", "--format", "{{.Repository}}:{{.Tag}},{{.Size}}"])
// Check if docker is available (cm-agent user is in docker group) with 3 second timeout
let output = Command::new("timeout")
.args(&["3", "docker", "images", "--format", "{{.Repository}}:{{.Tag}},{{.Size}}"])
.output();
let output = match output {
@@ -845,23 +788,54 @@ impl SystemdCollector {
let parts: Vec<&str> = line.split(',').collect();
if parts.len() >= 2 {
let image_name = parts[0].trim();
let size = parts[1].trim();
let size_str = parts[1].trim();
// Skip <none>:<none> images (dangling images)
if image_name.contains("<none>") {
continue;
}
// Parse size to MB (sizes come as "142MB", "1.5GB", "512kB", etc.)
let size_mb = self.parse_docker_size(size_str);
images.push((
format!("image_{}", image_name),
"active".to_string(), // Images are always "active" (present)
size.to_string()
image_name.to_string(),
"inactive".to_string(), // Images are informational - use inactive for neutral display
size_mb
));
}
}
images
}
/// Parse Docker size string to MB
fn parse_docker_size(&self, size_str: &str) -> f32 {
let size_upper = size_str.to_uppercase();
// Extract numeric part and unit
let mut num_str = String::new();
let mut unit = String::new();
for ch in size_upper.chars() {
if ch.is_ascii_digit() || ch == '.' {
num_str.push(ch);
} else if ch.is_alphabetic() {
unit.push(ch);
}
}
let value: f32 = num_str.parse().unwrap_or(0.0);
// Convert to MB
match unit.as_str() {
"KB" | "K" => value / 1024.0,
"MB" | "M" => value,
"GB" | "G" => value * 1024.0,
"TB" | "T" => value * 1024.0 * 1024.0,
_ => value, // Assume bytes if no unit
}
}
}
#[async_trait]

View File

@@ -5,10 +5,9 @@ use zmq::{Context, Socket, SocketType};
use crate::config::ZmqConfig;
/// ZMQ communication handler for publishing metrics and receiving commands
/// ZMQ communication handler for publishing metrics
pub struct ZmqHandler {
publisher: Socket,
command_receiver: Socket,
}
impl ZmqHandler {
@@ -26,20 +25,8 @@ impl ZmqHandler {
publisher.set_sndhwm(1000)?; // High water mark for outbound messages
publisher.set_linger(1000)?; // Linger time on close
// Create command receiver socket (PULL socket to receive commands from dashboard)
let command_receiver = context.socket(SocketType::PULL)?;
let cmd_bind_address = format!("tcp://{}:{}", config.bind_address, config.command_port);
command_receiver.bind(&cmd_bind_address)?;
info!("ZMQ command receiver bound to {}", cmd_bind_address);
// Set non-blocking mode for command receiver
command_receiver.set_rcvtimeo(0)?; // Non-blocking receive
command_receiver.set_linger(1000)?;
Ok(Self {
publisher,
command_receiver,
})
}
@@ -65,36 +52,4 @@ impl ZmqHandler {
Ok(())
}
/// Try to receive a command (non-blocking)
pub fn try_receive_command(&self) -> Result<Option<AgentCommand>> {
match self.command_receiver.recv_bytes(zmq::DONTWAIT) {
Ok(bytes) => {
debug!("Received command message ({} bytes)", bytes.len());
let command: AgentCommand = serde_json::from_slice(&bytes)
.map_err(|e| anyhow::anyhow!("Failed to deserialize command: {}", e))?;
debug!("Parsed command: {:?}", command);
Ok(Some(command))
}
Err(zmq::Error::EAGAIN) => {
// No message available (non-blocking)
Ok(None)
}
Err(e) => Err(anyhow::anyhow!("ZMQ receive error: {}", e)),
}
}
}
/// Commands that can be sent to the agent
#[derive(Debug, Clone, serde::Deserialize, serde::Serialize)]
pub enum AgentCommand {
/// Request immediate metric collection
CollectNow,
/// Change collection interval
SetInterval { seconds: u64 },
/// Enable/disable a collector
ToggleCollector { name: String, enabled: bool },
/// Request status/health check
Ping,
}

View File

@@ -13,14 +13,12 @@ pub struct AgentConfig {
pub collectors: CollectorConfig,
pub cache: CacheConfig,
pub notifications: NotificationConfig,
pub collection_interval_seconds: u64,
}
/// ZMQ communication configuration
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ZmqConfig {
pub publisher_port: u16,
pub command_port: u16,
pub bind_address: String,
pub transmission_interval_seconds: u64,
/// Heartbeat transmission interval in seconds for host connectivity detection

View File

@@ -7,21 +7,13 @@ pub fn validate_config(config: &AgentConfig) -> Result<()> {
bail!("ZMQ publisher port cannot be 0");
}
if config.zmq.command_port == 0 {
bail!("ZMQ command port cannot be 0");
}
if config.zmq.publisher_port == config.zmq.command_port {
bail!("ZMQ publisher and command ports cannot be the same");
}
if config.zmq.bind_address.is_empty() {
bail!("ZMQ bind address cannot be empty");
}
// Validate collection interval
if config.collection_interval_seconds == 0 {
bail!("Collection interval cannot be 0");
// Validate ZMQ transmission interval
if config.zmq.transmission_interval_seconds == 0 {
bail!("ZMQ transmission interval cannot be 0");
}
// Validate CPU thresholds

View File

@@ -1,6 +1,6 @@
[package]
name = "cm-dashboard"
version = "0.1.182"
version = "0.1.212"
edition = "2021"
[dependencies]

View File

@@ -215,7 +215,7 @@ impl Dashboard {
// Update TUI with new metrics (only if not headless)
if let Some(ref mut tui_app) = self.tui_app {
tui_app.update_metrics(&self.metric_store);
tui_app.update_metrics(&mut self.metric_store);
}
}

View File

@@ -5,6 +5,14 @@ use tracing::{debug, info, warn};
use super::MetricDataPoint;
/// ZMQ communication statistics per host
#[derive(Debug, Clone)]
pub struct ZmqStats {
pub packets_received: u64,
pub last_packet_time: Instant,
pub last_packet_age_secs: f64,
}
/// Central metric storage for the dashboard
pub struct MetricStore {
/// Current structured data: hostname -> AgentData
@@ -13,6 +21,8 @@ pub struct MetricStore {
historical_metrics: HashMap<String, Vec<MetricDataPoint>>,
/// Last heartbeat timestamp per host
last_heartbeat: HashMap<String, Instant>,
/// ZMQ communication statistics per host
zmq_stats: HashMap<String, ZmqStats>,
/// Configuration
max_metrics_per_host: usize,
history_retention: Duration,
@@ -24,6 +34,7 @@ impl MetricStore {
current_agent_data: HashMap::new(),
historical_metrics: HashMap::new(),
last_heartbeat: HashMap::new(),
zmq_stats: HashMap::new(),
max_metrics_per_host,
history_retention: Duration::from_secs(history_retention_hours * 3600),
}
@@ -44,6 +55,16 @@ impl MetricStore {
self.last_heartbeat.insert(hostname.clone(), now);
debug!("Updated heartbeat for host {}", hostname);
// Update ZMQ stats
let stats = self.zmq_stats.entry(hostname.clone()).or_insert(ZmqStats {
packets_received: 0,
last_packet_time: now,
last_packet_age_secs: 0.0,
});
stats.packets_received += 1;
stats.last_packet_time = now;
stats.last_packet_age_secs = 0.0; // Just received
// Add to history
let host_history = self
.historical_metrics
@@ -65,6 +86,15 @@ impl MetricStore {
self.current_agent_data.get(hostname)
}
/// Get ZMQ communication statistics for a host
pub fn get_zmq_stats(&mut self, hostname: &str) -> Option<ZmqStats> {
let now = Instant::now();
self.zmq_stats.get_mut(hostname).map(|stats| {
// Update packet age
stats.last_packet_age_secs = now.duration_since(stats.last_packet_time).as_secs_f64();
stats.clone()
})
}
/// Get connected hosts (hosts with recent heartbeats)
pub fn get_connected_hosts(&self, timeout: Duration) -> Vec<String> {

View File

@@ -100,7 +100,7 @@ impl TuiApp {
}
/// Update widgets with structured data from store (only for current host)
pub fn update_metrics(&mut self, metric_store: &MetricStore) {
pub fn update_metrics(&mut self, metric_store: &mut MetricStore) {
if let Some(hostname) = self.current_host.clone() {
// Get structured data for this host
if let Some(agent_data) = metric_store.get_agent_data(&hostname) {
@@ -110,6 +110,14 @@ impl TuiApp {
host_widgets.system_widget.update_from_agent_data(agent_data);
host_widgets.services_widget.update_from_agent_data(agent_data);
// Update ZMQ stats
if let Some(zmq_stats) = metric_store.get_zmq_stats(&hostname) {
host_widgets.system_widget.update_zmq_stats(
zmq_stats.packets_received,
zmq_stats.last_packet_age_secs
);
}
host_widgets.last_update = Some(Instant::now());
}
}

View File

@@ -28,10 +28,12 @@ pub struct ServicesWidget {
#[derive(Clone)]
struct ServiceInfo {
memory_mb: Option<f32>,
disk_gb: Option<f32>,
metrics: Vec<(String, f32, Option<String>)>, // (label, value, unit)
widget_status: Status,
service_type: String, // "nginx_site", "container", "image", or empty for parent services
memory_bytes: Option<u64>,
restart_count: Option<u32>,
uptime_seconds: Option<u64>,
}
impl ServicesWidget {
@@ -51,8 +53,6 @@ impl ServicesWidget {
if metric_name.starts_with("service_") {
if let Some(end_pos) = metric_name
.rfind("_status")
.or_else(|| metric_name.rfind("_memory_mb"))
.or_else(|| metric_name.rfind("_disk_gb"))
.or_else(|| metric_name.rfind("_latency_ms"))
{
let service_part = &metric_name[8..end_pos]; // Remove "service_" prefix
@@ -75,36 +75,8 @@ impl ServicesWidget {
None
}
/// Format disk size with appropriate units (kB/MB/GB)
fn format_disk_size(size_gb: f32) -> String {
let size_mb = size_gb * 1024.0; // Convert GB to MB
if size_mb >= 1024.0 {
// Show as GB
format!("{:.1}GB", size_gb)
} else if size_mb >= 1.0 {
// Show as MB
format!("{:.0}MB", size_mb)
} else if size_mb >= 0.001 {
// Convert to kB
let size_kb = size_mb * 1024.0;
format!("{:.0}kB", size_kb)
} else {
// Show very small sizes as bytes
let size_bytes = size_mb * 1024.0 * 1024.0;
format!("{:.0}B", size_bytes)
}
}
/// Format parent service line - returns text without icon for span formatting
fn format_parent_service_line(&self, name: &str, info: &ServiceInfo) -> String {
let memory_str = info
.memory_mb
.map_or("0M".to_string(), |m| format!("{:.0}M", m));
let disk_str = info
.disk_gb
.map_or("0".to_string(), |d| Self::format_disk_size(d));
// Truncate long service names to fit layout (account for icon space)
let short_name = if name.len() > 22 {
format!("{}...", &name[..19])
@@ -115,7 +87,7 @@ impl ServicesWidget {
// Convert Status enum to display text
let status_str = match info.widget_status {
Status::Ok => "active",
Status::Inactive => "inactive",
Status::Inactive => "inactive",
Status::Critical => "failed",
Status::Pending => "pending",
Status::Warning => "warning",
@@ -123,9 +95,43 @@ impl ServicesWidget {
Status::Offline => "offline",
};
// Format memory
let memory_str = info.memory_bytes.map_or("-".to_string(), |bytes| {
let mb = bytes as f64 / (1024.0 * 1024.0);
if mb >= 1000.0 {
format!("{:.1}G", mb / 1024.0)
} else {
format!("{:.0}M", mb)
}
});
// Format uptime
let uptime_str = info.uptime_seconds.map_or("-".to_string(), |secs| {
let days = secs / 86400;
let hours = (secs % 86400) / 3600;
let mins = (secs % 3600) / 60;
if days > 0 {
format!("{}d{}h", days, hours)
} else if hours > 0 {
format!("{}h{}m", hours, mins)
} else {
format!("{}m", mins)
}
});
// Format restarts (show "!" if > 0 to indicate instability)
let restart_str = info.restart_count.map_or("-".to_string(), |count| {
if count > 0 {
format!("!{}", count)
} else {
"0".to_string()
}
});
format!(
"{:<23} {:<10} {:<8} {:<8}",
short_name, status_str, memory_str, disk_str
"{:<23} {:<10} {:<8} {:<8} {:<5}",
short_name, status_str, memory_str, uptime_str, restart_str
)
}
@@ -169,7 +175,7 @@ impl ServicesWidget {
// Convert Status enum to display text for sub-services
match info.widget_status {
Status::Ok => "active",
Status::Inactive => "inactive",
Status::Inactive => "inactive",
Status::Critical => "failed",
Status::Pending => "pending",
Status::Warning => "warning",
@@ -179,32 +185,62 @@ impl ServicesWidget {
};
let tree_symbol = if is_last { "└─" } else { "├─" };
vec![
// Indentation and tree prefix
ratatui::text::Span::styled(
format!(" {} ", tree_symbol),
Typography::tree(),
),
// Status icon
ratatui::text::Span::styled(
format!("{} ", icon),
Style::default().fg(status_color).bg(Theme::background()),
),
// Service name
ratatui::text::Span::styled(
format!("{:<18} ", short_name),
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
// Status/latency text
ratatui::text::Span::styled(
status_str,
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
]
// Docker images use docker whale icon
if info.service_type == "image" {
vec![
// Indentation and tree prefix
ratatui::text::Span::styled(
format!(" {} ", tree_symbol),
Typography::tree(),
),
// Docker icon (simple character for performance)
ratatui::text::Span::styled(
"D ".to_string(),
Style::default().fg(Theme::highlight()).bg(Theme::background()),
),
// Service name
ratatui::text::Span::styled(
format!("{:<18} ", short_name),
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
// Status/metrics text
ratatui::text::Span::styled(
status_str,
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
]
} else {
vec![
// Indentation and tree prefix
ratatui::text::Span::styled(
format!(" {} ", tree_symbol),
Typography::tree(),
),
// Status icon
ratatui::text::Span::styled(
format!("{} ", icon),
Style::default().fg(status_color).bg(Theme::background()),
),
// Service name
ratatui::text::Span::styled(
format!("{:<18} ", short_name),
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
// Status/latency text
ratatui::text::Span::styled(
status_str,
Style::default()
.fg(Theme::secondary_text())
.bg(Theme::background()),
),
]
}
}
/// Move selection up
@@ -278,13 +314,15 @@ impl Widget for ServicesWidget {
for service in &agent_data.services {
// Store parent service
let parent_info = ServiceInfo {
memory_mb: Some(service.memory_mb),
disk_gb: Some(service.disk_gb),
metrics: Vec::new(), // Parent services don't have custom metrics
widget_status: service.service_status,
service_type: String::new(), // Parent services have no type
memory_bytes: service.memory_bytes,
restart_count: service.restart_count,
uptime_seconds: service.uptime_seconds,
};
self.parent_services.insert(service.name.clone(), parent_info);
// Process sub-services if any
if !service.sub_services.is_empty() {
let mut sub_list = Vec::new();
@@ -293,12 +331,14 @@ impl Widget for ServicesWidget {
let metrics: Vec<(String, f32, Option<String>)> = sub_service.metrics.iter()
.map(|m| (m.label.clone(), m.value, m.unit.clone()))
.collect();
let sub_info = ServiceInfo {
memory_mb: None, // Not used for sub-services
disk_gb: None, // Not used for sub-services
metrics,
widget_status: sub_service.service_status,
service_type: sub_service.service_type.clone(),
memory_bytes: None, // Sub-services don't have individual metrics yet
restart_count: None,
uptime_seconds: None,
};
sub_list.push((sub_service.name.clone(), sub_info));
}
@@ -338,22 +378,16 @@ impl ServicesWidget {
self.parent_services
.entry(parent_service)
.or_insert(ServiceInfo {
memory_mb: None,
disk_gb: None,
metrics: Vec::new(),
widget_status: Status::Unknown,
service_type: String::new(),
memory_bytes: None,
restart_count: None,
uptime_seconds: None,
});
if metric.name.ends_with("_status") {
service_info.widget_status = metric.status;
} else if metric.name.ends_with("_memory_mb") {
if let Some(memory) = metric.value.as_f32() {
service_info.memory_mb = Some(memory);
}
} else if metric.name.ends_with("_disk_gb") {
if let Some(disk) = metric.value.as_f32() {
service_info.disk_gb = Some(disk);
}
}
}
Some(sub_name) => {
@@ -373,10 +407,12 @@ impl ServicesWidget {
sub_service_list.push((
sub_name.clone(),
ServiceInfo {
memory_mb: None,
disk_gb: None,
metrics: Vec::new(),
widget_status: Status::Unknown,
service_type: String::new(), // Unknown type in legacy path
memory_bytes: None,
restart_count: None,
uptime_seconds: None,
},
));
&mut sub_service_list.last_mut().unwrap().1
@@ -384,14 +420,6 @@ impl ServicesWidget {
if metric.name.ends_with("_status") {
sub_service_info.widget_status = metric.status;
} else if metric.name.ends_with("_memory_mb") {
if let Some(memory) = metric.value.as_f32() {
sub_service_info.memory_mb = Some(memory);
}
} else if metric.name.ends_with("_disk_gb") {
if let Some(disk) = metric.value.as_f32() {
sub_service_info.disk_gb = Some(disk);
}
}
}
}
@@ -450,8 +478,8 @@ impl ServicesWidget {
// Header
let header = format!(
"{:<25} {:<10} {:<8} {:<8}",
"Service:", "Status:", "RAM:", "Disk:"
"{:<25} {:<10} {:<8} {:<8} {:<5}",
"Service:", "Status:", "RAM:", "Uptime:", ":"
);
let header_para = Paragraph::new(header).style(Typography::muted());
frame.render_widget(header_para, content_chunks[0]);

View File

@@ -15,6 +15,10 @@ pub struct SystemWidget {
nixos_build: Option<String>,
agent_hash: Option<String>,
// ZMQ communication stats
zmq_packets_received: Option<u64>,
zmq_last_packet_age: Option<f64>,
// Network interfaces
network_interfaces: Vec<cm_dashboard_shared::NetworkInterfaceData>,
@@ -22,7 +26,7 @@ pub struct SystemWidget {
cpu_load_1min: Option<f32>,
cpu_load_5min: Option<f32>,
cpu_load_15min: Option<f32>,
cpu_frequency: Option<f32>,
cpu_cstates: Vec<cm_dashboard_shared::CStateInfo>,
cpu_status: Status,
// Memory metrics
@@ -92,11 +96,13 @@ impl SystemWidget {
Self {
nixos_build: None,
agent_hash: None,
zmq_packets_received: None,
zmq_last_packet_age: None,
network_interfaces: Vec::new(),
cpu_load_1min: None,
cpu_load_5min: None,
cpu_load_15min: None,
cpu_frequency: None,
cpu_cstates: Vec::new(),
cpu_status: Status::Unknown,
memory_usage_percent: None,
memory_used_gb: None,
@@ -131,12 +137,19 @@ impl SystemWidget {
}
}
/// Format CPU frequency
fn format_cpu_frequency(&self) -> String {
match self.cpu_frequency {
Some(freq) => format!("{:.0} MHz", freq),
None => "— MHz".to_string(),
/// Format CPU C-states (idle depth) with percentages
fn format_cpu_cstate(&self) -> String {
if self.cpu_cstates.is_empty() {
return "".to_string();
}
// Format top 3 C-states with percentages: "C10:79% C8:10% C6:8%"
// Agent already sends clean names (C3, C10, etc.)
self.cpu_cstates
.iter()
.map(|cs| format!("{}:{:.0}%", cs.name, cs.percent))
.collect::<Vec<_>>()
.join(" ")
}
/// Format memory usage
@@ -154,6 +167,12 @@ impl SystemWidget {
pub fn _get_agent_hash(&self) -> Option<&String> {
self.agent_hash.as_ref()
}
/// Update ZMQ communication statistics
pub fn update_zmq_stats(&mut self, packets_received: u64, last_packet_age_secs: f64) {
self.zmq_packets_received = Some(packets_received);
self.zmq_last_packet_age = Some(last_packet_age_secs);
}
}
use super::Widget;
@@ -176,7 +195,7 @@ impl Widget for SystemWidget {
self.cpu_load_1min = Some(cpu.load_1min);
self.cpu_load_5min = Some(cpu.load_5min);
self.cpu_load_15min = Some(cpu.load_15min);
self.cpu_frequency = Some(cpu.frequency_mhz);
self.cpu_cstates = cpu.cstates.clone();
self.cpu_status = Status::Ok;
// Extract memory data directly
@@ -796,6 +815,18 @@ impl SystemWidget {
Span::styled(format!("Agent: {}", agent_version_text), Typography::secondary())
]));
// ZMQ communication stats
if let (Some(packets), Some(age)) = (self.zmq_packets_received, self.zmq_last_packet_age) {
let age_text = if age < 1.0 {
format!("{:.0}ms ago", age * 1000.0)
} else {
format!("{:.1}s ago", age)
};
lines.push(Line::from(vec![
Span::styled(format!("ZMQ: {} pkts, last {}", packets, age_text), Typography::secondary())
]));
}
// CPU section
lines.push(Line::from(vec![
Span::styled("CPU:", Typography::widget_title())
@@ -808,10 +839,10 @@ impl SystemWidget {
);
lines.push(Line::from(cpu_spans));
let freq_text = self.format_cpu_frequency();
let cstate_text = self.format_cpu_cstate();
lines.push(Line::from(vec![
Span::styled(" └─ ", Typography::tree()),
Span::styled(format!("Freq: {}", freq_text), Typography::secondary())
Span::styled(format!("C-state: {}", cstate_text), Typography::secondary())
]));
// RAM section

View File

@@ -1,6 +1,6 @@
[package]
name = "cm-dashboard-shared"
version = "0.1.182"
version = "0.1.212"
edition = "2021"
[dependencies]

View File

@@ -40,13 +40,20 @@ pub struct NetworkInterfaceData {
pub vlan_id: Option<u16>,
}
/// CPU C-state usage information
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CStateInfo {
pub name: String,
pub percent: f32,
}
/// CPU monitoring data
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct CpuData {
pub load_1min: f32,
pub load_5min: f32,
pub load_15min: f32,
pub frequency_mhz: f32,
pub cstates: Vec<CStateInfo>, // C-state usage percentages (C1, C6, C10, etc.) - indicates CPU idle depth distribution
pub temperature_celsius: Option<f32>,
pub load_status: Status,
pub temperature_status: Status,
@@ -136,11 +143,15 @@ pub struct PoolDriveData {
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ServiceData {
pub name: String,
pub memory_mb: f32,
pub disk_gb: f32,
pub user_stopped: bool,
pub service_status: Status,
pub sub_services: Vec<SubServiceData>,
/// Memory usage in bytes (from MemoryCurrent)
pub memory_bytes: Option<u64>,
/// Number of service restarts (from NRestarts)
pub restart_count: Option<u32>,
/// Uptime in seconds (calculated from ExecMainStartTimestamp)
pub uptime_seconds: Option<u64>,
}
/// Sub-service data (nginx sites, docker containers, etc.)
@@ -149,6 +160,9 @@ pub struct SubServiceData {
pub name: String,
pub service_status: Status,
pub metrics: Vec<SubServiceMetric>,
/// Type of sub-service: "nginx_site", "container", "image"
#[serde(default)]
pub service_type: String,
}
/// Individual metric for a sub-service
@@ -197,7 +211,7 @@ impl AgentData {
load_1min: 0.0,
load_5min: 0.0,
load_15min: 0.0,
frequency_mhz: 0.0,
cstates: Vec::new(),
temperature_celsius: None,
load_status: Status::Unknown,
temperature_status: Status::Unknown,