Compare commits
10 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 8bfd416327 | |||
| 85c6c624fb | |||
| eab3f17428 | |||
| 7ad149bbe4 | |||
| b444c88ea0 | |||
| 317cf76bd1 | |||
| 0db1a165b9 | |||
| 3c2955376d | |||
| f09ccabc7f | |||
| 43dd5a901a |
80
CLAUDE.md
80
CLAUDE.md
@@ -156,86 +156,6 @@ Complete migration from string-based metrics to structured JSON data. Eliminates
|
||||
- ✅ Backward compatibility via bridge conversion to existing UI widgets
|
||||
- ✅ All string parsing bugs eliminated
|
||||
|
||||
### Cached Collector Architecture (✅ IMPLEMENTED)
|
||||
|
||||
**Problem:** Blocking collectors prevent timely ZMQ transmission, causing false "host offline" alerts.
|
||||
|
||||
**Previous (Sequential Blocking):**
|
||||
```
|
||||
Every 1 second:
|
||||
└─ collect_all_data() [BLOCKS for 2-10+ seconds]
|
||||
├─ CPU (fast: 10ms)
|
||||
├─ Memory (fast: 20ms)
|
||||
├─ Disk SMART (slow: 3s per drive × 4 drives = 12s)
|
||||
├─ Service disk usage (slow: 2-8s per service)
|
||||
└─ Docker (medium: 500ms)
|
||||
└─ send_via_zmq() [Only after ALL collection completes]
|
||||
|
||||
Result: If any collector takes >10s → "host offline" false alert
|
||||
```
|
||||
|
||||
**New (Cached Independent Collectors):**
|
||||
```
|
||||
Shared Cache: Arc<RwLock<AgentData>>
|
||||
|
||||
Background Collectors (independent async tasks):
|
||||
├─ Fast collectors (CPU, RAM, Network)
|
||||
│ └─ Update cache every 1 second
|
||||
├─ Medium collectors (Services, Docker)
|
||||
│ └─ Update cache every 5 seconds
|
||||
└─ Slow collectors (Disk usage, SMART data)
|
||||
└─ Update cache every 60 seconds
|
||||
|
||||
ZMQ Sender (separate async task):
|
||||
Every 1 second:
|
||||
└─ Read current cache
|
||||
└─ Send via ZMQ [Always instant, never blocked]
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- ✅ ZMQ sends every 1 second regardless of collector speed
|
||||
- ✅ No false "host offline" alerts from slow collectors
|
||||
- ✅ Different update rates for different metrics (CPU=1s, SMART=60s)
|
||||
- ✅ System stays responsive even with slow operations
|
||||
- ✅ Slow collectors can use longer timeouts without blocking
|
||||
|
||||
**Implementation Details:**
|
||||
- **Shared cache**: `Arc<RwLock<AgentData>>` initialized at agent startup
|
||||
- **Collector intervals**: Fully configurable via NixOS config (`interval_seconds` per collector)
|
||||
- Recommended: Fast (1-10s): CPU, Memory, Network
|
||||
- Recommended: Medium (30-60s): Backup, NixOS
|
||||
- Recommended: Slow (60-300s): Disk, Systemd
|
||||
- **Independent tasks**: Each collector spawned as separate tokio task in `Agent::new()`
|
||||
- **Cache updates**: Collectors acquire write lock → update → release immediately
|
||||
- **ZMQ sender**: Main loop reads cache every `collection_interval_seconds` and broadcasts
|
||||
- **Notification check**: Runs every `notifications.check_interval_seconds`
|
||||
- **Lock strategy**: Short-lived write locks prevent blocking, read locks for transmission
|
||||
- **Stale data**: Acceptable for slow-changing metrics (SMART data, disk usage)
|
||||
|
||||
**Configuration (NixOS):**
|
||||
All intervals and timeouts configurable in `services/cm-dashboard.nix`:
|
||||
|
||||
Collection Intervals:
|
||||
- `collectors.cpu.interval_seconds` (default: 10s)
|
||||
- `collectors.memory.interval_seconds` (default: 2s)
|
||||
- `collectors.disk.interval_seconds` (default: 300s)
|
||||
- `collectors.systemd.interval_seconds` (default: 10s)
|
||||
- `collectors.backup.interval_seconds` (default: 60s)
|
||||
- `collectors.network.interval_seconds` (default: 10s)
|
||||
- `collectors.nixos.interval_seconds` (default: 60s)
|
||||
- `notifications.check_interval_seconds` (default: 30s)
|
||||
- `collection_interval_seconds` - ZMQ transmission rate (default: 2s)
|
||||
|
||||
Command Timeouts (prevent resource leaks from hung commands):
|
||||
- `collectors.disk.command_timeout_seconds` (default: 30s) - lsblk, smartctl, etc.
|
||||
- `collectors.systemd.command_timeout_seconds` (default: 15s) - systemctl, docker, du
|
||||
- `collectors.network.command_timeout_seconds` (default: 10s) - ip route, ip addr
|
||||
|
||||
**Code Locations:**
|
||||
- agent/src/agent.rs:59-133 - Collector task spawning
|
||||
- agent/src/agent.rs:151-179 - Independent collector task runner
|
||||
- agent/src/agent.rs:199-207 - ZMQ sender in main loop
|
||||
|
||||
### Maintenance Mode
|
||||
|
||||
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
|
||||
|
||||
6
Cargo.lock
generated
6
Cargo.lock
generated
@@ -279,7 +279,7 @@ checksum = "a1d728cc89cf3aee9ff92b05e62b19ee65a02b5702cff7d5a377e32c6ae29d8d"
|
||||
|
||||
[[package]]
|
||||
name = "cm-dashboard"
|
||||
version = "0.1.194"
|
||||
version = "0.1.191"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"chrono",
|
||||
@@ -301,7 +301,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "cm-dashboard-agent"
|
||||
version = "0.1.194"
|
||||
version = "0.1.191"
|
||||
dependencies = [
|
||||
"anyhow",
|
||||
"async-trait",
|
||||
@@ -324,7 +324,7 @@ dependencies = [
|
||||
|
||||
[[package]]
|
||||
name = "cm-dashboard-shared"
|
||||
version = "0.1.194"
|
||||
version = "0.1.191"
|
||||
dependencies = [
|
||||
"chrono",
|
||||
"serde",
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "cm-dashboard-agent"
|
||||
version = "0.1.195"
|
||||
version = "0.1.192"
|
||||
edition = "2021"
|
||||
|
||||
[dependencies]
|
||||
|
||||
@@ -1,14 +1,13 @@
|
||||
use anyhow::Result;
|
||||
use gethostname::gethostname;
|
||||
use std::sync::Arc;
|
||||
use std::time::Duration;
|
||||
use tokio::sync::RwLock;
|
||||
use tokio::time::interval;
|
||||
use tracing::{debug, error, info};
|
||||
|
||||
use crate::communication::{AgentCommand, ZmqHandler};
|
||||
use crate::config::AgentConfig;
|
||||
use crate::collectors::{
|
||||
Collector,
|
||||
backup::BackupCollector,
|
||||
cpu::CpuCollector,
|
||||
disk::DiskCollector,
|
||||
@@ -24,7 +23,7 @@ pub struct Agent {
|
||||
hostname: String,
|
||||
config: AgentConfig,
|
||||
zmq_handler: ZmqHandler,
|
||||
cache: Arc<RwLock<AgentData>>,
|
||||
collectors: Vec<Box<dyn Collector>>,
|
||||
notification_manager: NotificationManager,
|
||||
previous_status: Option<SystemStatus>,
|
||||
}
|
||||
@@ -56,94 +55,39 @@ impl Agent {
|
||||
config.zmq.publisher_port
|
||||
);
|
||||
|
||||
// Initialize shared cache
|
||||
let cache = Arc::new(RwLock::new(AgentData::new(
|
||||
hostname.clone(),
|
||||
env!("CARGO_PKG_VERSION").to_string()
|
||||
)));
|
||||
info!("Initialized shared agent data cache");
|
||||
|
||||
// Spawn independent collector tasks
|
||||
let mut collector_count = 0;
|
||||
|
||||
// CPU collector
|
||||
// Initialize collectors
|
||||
let mut collectors: Vec<Box<dyn Collector>> = Vec::new();
|
||||
|
||||
// Add enabled collectors
|
||||
if config.collectors.cpu.enabled {
|
||||
let cache_clone = cache.clone();
|
||||
let collector = CpuCollector::new(config.collectors.cpu.clone());
|
||||
let interval = config.collectors.cpu.interval_seconds;
|
||||
tokio::spawn(async move {
|
||||
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "CPU").await;
|
||||
});
|
||||
collector_count += 1;
|
||||
collectors.push(Box::new(CpuCollector::new(config.collectors.cpu.clone())));
|
||||
}
|
||||
|
||||
// Memory collector
|
||||
|
||||
if config.collectors.memory.enabled {
|
||||
let cache_clone = cache.clone();
|
||||
let collector = MemoryCollector::new(config.collectors.memory.clone());
|
||||
let interval = config.collectors.memory.interval_seconds;
|
||||
tokio::spawn(async move {
|
||||
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Memory").await;
|
||||
});
|
||||
collector_count += 1;
|
||||
collectors.push(Box::new(MemoryCollector::new(config.collectors.memory.clone())));
|
||||
}
|
||||
|
||||
// Network collector
|
||||
if config.collectors.network.enabled {
|
||||
let cache_clone = cache.clone();
|
||||
let collector = NetworkCollector::new(config.collectors.network.clone());
|
||||
let interval = config.collectors.network.interval_seconds;
|
||||
tokio::spawn(async move {
|
||||
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Network").await;
|
||||
});
|
||||
collector_count += 1;
|
||||
}
|
||||
|
||||
// Backup collector
|
||||
if config.collectors.backup.enabled {
|
||||
let cache_clone = cache.clone();
|
||||
let collector = BackupCollector::new();
|
||||
let interval = config.collectors.backup.interval_seconds;
|
||||
tokio::spawn(async move {
|
||||
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Backup").await;
|
||||
});
|
||||
collector_count += 1;
|
||||
}
|
||||
|
||||
// NixOS collector
|
||||
if config.collectors.nixos.enabled {
|
||||
let cache_clone = cache.clone();
|
||||
let collector = NixOSCollector::new(config.collectors.nixos.clone());
|
||||
let interval = config.collectors.nixos.interval_seconds;
|
||||
tokio::spawn(async move {
|
||||
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "NixOS").await;
|
||||
});
|
||||
collector_count += 1;
|
||||
}
|
||||
|
||||
// Disk collector
|
||||
|
||||
if config.collectors.disk.enabled {
|
||||
let cache_clone = cache.clone();
|
||||
let collector = DiskCollector::new(config.collectors.disk.clone());
|
||||
let interval = config.collectors.disk.interval_seconds;
|
||||
tokio::spawn(async move {
|
||||
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Disk").await;
|
||||
});
|
||||
collector_count += 1;
|
||||
collectors.push(Box::new(DiskCollector::new(config.collectors.disk.clone())));
|
||||
}
|
||||
|
||||
// Systemd collector
|
||||
|
||||
if config.collectors.systemd.enabled {
|
||||
let cache_clone = cache.clone();
|
||||
let collector = SystemdCollector::new(config.collectors.systemd.clone());
|
||||
let interval = config.collectors.systemd.interval_seconds;
|
||||
tokio::spawn(async move {
|
||||
Self::run_collector_task(cache_clone, collector, Duration::from_secs(interval), "Systemd").await;
|
||||
});
|
||||
collector_count += 1;
|
||||
collectors.push(Box::new(SystemdCollector::new(config.collectors.systemd.clone())));
|
||||
}
|
||||
|
||||
if config.collectors.backup.enabled {
|
||||
collectors.push(Box::new(BackupCollector::new()));
|
||||
}
|
||||
|
||||
info!("Spawned {} independent collector tasks", collector_count);
|
||||
if config.collectors.network.enabled {
|
||||
collectors.push(Box::new(NetworkCollector::new(config.collectors.network.clone())));
|
||||
}
|
||||
|
||||
if config.collectors.nixos.enabled {
|
||||
collectors.push(Box::new(NixOSCollector::new(config.collectors.nixos.clone())));
|
||||
}
|
||||
|
||||
info!("Initialized {} collectors", collectors.len());
|
||||
|
||||
// Initialize notification manager
|
||||
let notification_manager = NotificationManager::new(&config.notifications, &hostname)?;
|
||||
@@ -153,121 +97,45 @@ impl Agent {
|
||||
hostname,
|
||||
config,
|
||||
zmq_handler,
|
||||
cache,
|
||||
collectors,
|
||||
notification_manager,
|
||||
previous_status: None,
|
||||
})
|
||||
}
|
||||
|
||||
/// Independent collector task runner
|
||||
async fn run_collector_task<C>(
|
||||
cache: Arc<RwLock<AgentData>>,
|
||||
collector: C,
|
||||
interval_duration: Duration,
|
||||
name: &str,
|
||||
) where
|
||||
C: crate::collectors::Collector + Send + 'static,
|
||||
{
|
||||
let mut interval_timer = interval(interval_duration);
|
||||
info!("{} collector task started (interval: {:?})", name, interval_duration);
|
||||
|
||||
loop {
|
||||
interval_timer.tick().await;
|
||||
|
||||
// Acquire write lock and update cache
|
||||
{
|
||||
let mut agent_data = cache.write().await;
|
||||
match collector.collect_structured(&mut *agent_data).await {
|
||||
Ok(_) => {
|
||||
debug!("{} collector updated cache", name);
|
||||
}
|
||||
Err(e) => {
|
||||
error!("{} collector failed: {}", name, e);
|
||||
}
|
||||
}
|
||||
} // Release lock immediately after collection
|
||||
}
|
||||
}
|
||||
|
||||
/// Main agent loop with cached data architecture
|
||||
/// Main agent loop with structured data collection
|
||||
pub async fn run(&mut self, mut shutdown_rx: tokio::sync::oneshot::Receiver<()>) -> Result<()> {
|
||||
info!("Starting agent main loop with cached collector architecture");
|
||||
info!("Starting agent main loop");
|
||||
|
||||
// Spawn independent ZMQ sender task
|
||||
// Create dedicated ZMQ publisher for the sender task
|
||||
let cache_clone = self.cache.clone();
|
||||
let publisher_config = self.config.zmq.clone();
|
||||
let transmission_interval_secs = self.config.collection_interval_seconds;
|
||||
// Initial collection
|
||||
if let Err(e) = self.collect_and_broadcast().await {
|
||||
error!("Initial metric collection failed: {}", e);
|
||||
}
|
||||
|
||||
std::thread::spawn(move || {
|
||||
// Create ZMQ publisher in this thread (ZMQ sockets are not thread-safe)
|
||||
let context = zmq::Context::new();
|
||||
let publisher = context.socket(zmq::SocketType::PUB).unwrap();
|
||||
let bind_address = format!("tcp://{}:{}", publisher_config.bind_address, publisher_config.publisher_port);
|
||||
publisher.bind(&bind_address).unwrap();
|
||||
publisher.set_sndhwm(1000).unwrap();
|
||||
publisher.set_linger(1000).unwrap();
|
||||
info!("ZMQ sender task started on {} (interval: {}s)", bind_address, transmission_interval_secs);
|
||||
|
||||
let mut last_sent_data: Option<AgentData> = None;
|
||||
let interval_duration = std::time::Duration::from_secs(transmission_interval_secs);
|
||||
let mut next_send = std::time::Instant::now() + interval_duration;
|
||||
|
||||
loop {
|
||||
// Sleep until next send time
|
||||
std::thread::sleep(next_send.saturating_duration_since(std::time::Instant::now()));
|
||||
next_send = std::time::Instant::now() + interval_duration;
|
||||
|
||||
// Try to read cache without blocking - if locked, send last known data
|
||||
let data_to_send = match cache_clone.try_read() {
|
||||
Ok(agent_data) => {
|
||||
let data_clone = agent_data.clone();
|
||||
drop(agent_data); // Release lock immediately
|
||||
last_sent_data = Some(data_clone.clone());
|
||||
Some(data_clone)
|
||||
}
|
||||
Err(_) => {
|
||||
// Lock is held by collector - use last sent data
|
||||
debug!("Cache locked by collector, sending previous data");
|
||||
last_sent_data.clone()
|
||||
}
|
||||
};
|
||||
|
||||
if let Some(data) = data_to_send {
|
||||
// Publish via ZMQ
|
||||
if let Ok(envelope) = cm_dashboard_shared::MessageEnvelope::agent_data(data) {
|
||||
if let Ok(serialized) = serde_json::to_vec(&envelope) {
|
||||
if let Err(e) = publisher.send(&serialized, 0) {
|
||||
error!("Failed to send ZMQ message: {}", e);
|
||||
} else {
|
||||
debug!("Successfully broadcast agent data");
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
});
|
||||
|
||||
// Set up intervals for notifications and commands
|
||||
let mut notification_interval = interval(Duration::from_secs(
|
||||
self.config.notifications.check_interval_seconds,
|
||||
// Set up intervals
|
||||
let mut transmission_interval = interval(Duration::from_secs(
|
||||
self.config.collection_interval_seconds,
|
||||
));
|
||||
let mut command_interval = interval(Duration::from_millis(100));
|
||||
let mut notification_interval = interval(Duration::from_secs(30)); // Check notifications every 30s
|
||||
|
||||
// Skip initial ticks
|
||||
// Skip initial ticks to avoid immediate execution
|
||||
transmission_interval.tick().await;
|
||||
notification_interval.tick().await;
|
||||
command_interval.tick().await;
|
||||
|
||||
loop {
|
||||
tokio::select! {
|
||||
_ = notification_interval.tick() => {
|
||||
// Read cache and check for status changes
|
||||
let agent_data = self.cache.read().await.clone();
|
||||
if let Err(e) = self.check_status_changes_and_notify(&agent_data).await {
|
||||
error!("Failed to check status changes: {}", e);
|
||||
_ = transmission_interval.tick() => {
|
||||
if let Err(e) = self.collect_and_broadcast().await {
|
||||
error!("Failed to collect and broadcast metrics: {}", e);
|
||||
}
|
||||
}
|
||||
_ = command_interval.tick() => {
|
||||
_ = notification_interval.tick() => {
|
||||
// Process any pending notifications
|
||||
// NOTE: With structured data, we might need to implement status tracking differently
|
||||
// For now, we skip this until status evaluation is migrated
|
||||
}
|
||||
// Handle incoming commands (check periodically)
|
||||
_ = tokio::time::sleep(Duration::from_millis(100)) => {
|
||||
if let Err(e) = self.handle_commands().await {
|
||||
error!("Error handling commands: {}", e);
|
||||
}
|
||||
@@ -283,6 +151,35 @@ impl Agent {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Collect structured data from all collectors and broadcast via ZMQ
|
||||
async fn collect_and_broadcast(&mut self) -> Result<()> {
|
||||
debug!("Starting structured data collection");
|
||||
|
||||
// Initialize empty AgentData
|
||||
let mut agent_data = AgentData::new(self.hostname.clone(), env!("CARGO_PKG_VERSION").to_string());
|
||||
|
||||
// Collect data from all collectors
|
||||
for collector in &self.collectors {
|
||||
if let Err(e) = collector.collect_structured(&mut agent_data).await {
|
||||
error!("Collector failed: {}", e);
|
||||
// Continue with other collectors even if one fails
|
||||
}
|
||||
}
|
||||
|
||||
// Check for status changes and send notifications
|
||||
if let Err(e) = self.check_status_changes_and_notify(&agent_data).await {
|
||||
error!("Failed to check status changes: {}", e);
|
||||
}
|
||||
|
||||
// Broadcast the structured data via ZMQ
|
||||
if let Err(e) = self.zmq_handler.publish_agent_data(&agent_data).await {
|
||||
error!("Failed to broadcast agent data: {}", e);
|
||||
} else {
|
||||
debug!("Successfully broadcast structured agent data");
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Check for status changes and send notifications
|
||||
async fn check_status_changes_and_notify(&mut self, agent_data: &AgentData) -> Result<()> {
|
||||
@@ -370,10 +267,10 @@ impl Agent {
|
||||
|
||||
match command {
|
||||
AgentCommand::CollectNow => {
|
||||
info!("Received immediate transmission request");
|
||||
// With cached architecture and dedicated ZMQ sender thread,
|
||||
// data is already being sent every interval
|
||||
// This command is acknowledged but not actionable in new architecture
|
||||
info!("Received immediate collection request");
|
||||
if let Err(e) = self.collect_and_broadcast().await {
|
||||
error!("Failed to collect on demand: {}", e);
|
||||
}
|
||||
}
|
||||
AgentCommand::SetInterval { seconds } => {
|
||||
info!("Received interval change request: {}s", seconds);
|
||||
|
||||
@@ -117,7 +117,7 @@ impl DiskCollector {
|
||||
let mut cmd = Command::new("lsblk");
|
||||
cmd.args(&["-rn", "-o", "NAME,MOUNTPOINT"]);
|
||||
|
||||
let output = run_command_with_timeout(cmd, self.config.command_timeout_seconds).await
|
||||
let output = run_command_with_timeout(cmd, 2).await
|
||||
.map_err(|e| CollectorError::SystemRead {
|
||||
path: "block devices".to_string(),
|
||||
error: e.to_string(),
|
||||
@@ -530,9 +530,6 @@ impl DiskCollector {
|
||||
|
||||
/// Populate drives data into AgentData
|
||||
fn populate_drives_data(&self, physical_drives: &[PhysicalDrive], smart_data: &HashMap<String, SmartData>, agent_data: &mut AgentData) -> Result<(), CollectorError> {
|
||||
// Clear existing drives data to prevent duplicates in cached architecture
|
||||
agent_data.system.storage.drives.clear();
|
||||
|
||||
for drive in physical_drives {
|
||||
let smart = smart_data.get(&drive.name);
|
||||
|
||||
@@ -570,9 +567,6 @@ impl DiskCollector {
|
||||
|
||||
/// Populate pools data into AgentData
|
||||
fn populate_pools_data(&self, mergerfs_pools: &[MergerfsPool], smart_data: &HashMap<String, SmartData>, agent_data: &mut AgentData) -> Result<(), CollectorError> {
|
||||
// Clear existing pools data to prevent duplicates in cached architecture
|
||||
agent_data.system.storage.pools.clear();
|
||||
|
||||
for pool in mergerfs_pools {
|
||||
// Calculate pool health and statuses based on member drive health
|
||||
let (pool_health, health_status, usage_status, data_drive_data, parity_drive_data) = self.calculate_pool_health(pool, smart_data);
|
||||
|
||||
@@ -97,12 +97,9 @@ impl MemoryCollector {
|
||||
|
||||
/// Populate tmpfs data into AgentData
|
||||
async fn populate_tmpfs_data(&self, agent_data: &mut AgentData) -> Result<(), CollectorError> {
|
||||
// Clear existing tmpfs data to prevent duplicates in cached architecture
|
||||
agent_data.system.memory.tmpfs.clear();
|
||||
|
||||
// Discover all tmpfs mount points
|
||||
let tmpfs_mounts = self.discover_tmpfs_mounts()?;
|
||||
|
||||
|
||||
if tmpfs_mounts.is_empty() {
|
||||
debug!("No tmpfs mounts found to monitor");
|
||||
return Ok(());
|
||||
|
||||
@@ -8,12 +8,12 @@ use crate::config::NetworkConfig;
|
||||
|
||||
/// Network interface collector with physical/virtual classification and link status
|
||||
pub struct NetworkCollector {
|
||||
config: NetworkConfig,
|
||||
_config: NetworkConfig,
|
||||
}
|
||||
|
||||
impl NetworkCollector {
|
||||
pub fn new(config: NetworkConfig) -> Self {
|
||||
Self { config }
|
||||
Self { _config: config }
|
||||
}
|
||||
|
||||
/// Check if interface is physical (not virtual)
|
||||
@@ -50,9 +50,8 @@ impl NetworkCollector {
|
||||
}
|
||||
|
||||
/// Get the primary physical interface (the one with default route)
|
||||
fn get_primary_physical_interface(&self) -> Option<String> {
|
||||
let timeout_str = self.config.command_timeout_seconds.to_string();
|
||||
match Command::new("timeout").args([&timeout_str, "ip", "route", "show", "default"]).output() {
|
||||
fn get_primary_physical_interface() -> Option<String> {
|
||||
match Command::new("timeout").args(["2", "ip", "route", "show", "default"]).output() {
|
||||
Ok(output) if output.status.success() => {
|
||||
let output_str = String::from_utf8_lossy(&output.stdout);
|
||||
// Parse: "default via 192.168.1.1 dev eno1 ..."
|
||||
@@ -111,8 +110,7 @@ impl NetworkCollector {
|
||||
// Parse VLAN configuration
|
||||
let vlan_map = Self::parse_vlan_config();
|
||||
|
||||
let timeout_str = self.config.command_timeout_seconds.to_string();
|
||||
match Command::new("timeout").args([&timeout_str, "ip", "-j", "addr"]).output() {
|
||||
match Command::new("timeout").args(["2", "ip", "-j", "addr"]).output() {
|
||||
Ok(output) if output.status.success() => {
|
||||
let json_str = String::from_utf8_lossy(&output.stdout);
|
||||
|
||||
@@ -197,7 +195,7 @@ impl NetworkCollector {
|
||||
}
|
||||
|
||||
// Assign primary physical interface as parent to virtual interfaces without explicit parent
|
||||
let primary_interface = self.get_primary_physical_interface();
|
||||
let primary_interface = Self::get_primary_physical_interface();
|
||||
if let Some(primary) = primary_interface {
|
||||
for interface in interfaces.iter_mut() {
|
||||
// Only assign parent to virtual interfaces that don't already have one
|
||||
|
||||
@@ -254,19 +254,18 @@ impl SystemdCollector {
|
||||
|
||||
/// Auto-discover interesting services to monitor
|
||||
fn discover_services_internal(&self) -> Result<(Vec<String>, std::collections::HashMap<String, ServiceStatusInfo>)> {
|
||||
// First: Get all service unit files
|
||||
let timeout_str = self.config.command_timeout_seconds.to_string();
|
||||
// First: Get all service unit files (with 3 second timeout)
|
||||
let unit_files_output = Command::new("timeout")
|
||||
.args(&[&timeout_str, "systemctl", "list-unit-files", "--type=service", "--no-pager", "--plain"])
|
||||
.args(&["3", "systemctl", "list-unit-files", "--type=service", "--no-pager", "--plain"])
|
||||
.output()?;
|
||||
|
||||
if !unit_files_output.status.success() {
|
||||
return Err(anyhow::anyhow!("systemctl list-unit-files command failed"));
|
||||
}
|
||||
|
||||
// Second: Get runtime status of all units
|
||||
// Second: Get runtime status of all units (with 3 second timeout)
|
||||
let units_status_output = Command::new("timeout")
|
||||
.args(&[&timeout_str, "systemctl", "list-units", "--type=service", "--all", "--no-pager", "--plain"])
|
||||
.args(&["3", "systemctl", "list-units", "--type=service", "--all", "--no-pager", "--plain"])
|
||||
.output()?;
|
||||
|
||||
if !units_status_output.status.success() {
|
||||
@@ -362,17 +361,16 @@ impl SystemdCollector {
|
||||
}
|
||||
}
|
||||
|
||||
// Fallback to systemctl if not in cache
|
||||
let timeout_str = self.config.command_timeout_seconds.to_string();
|
||||
// Fallback to systemctl if not in cache (with 2 second timeout)
|
||||
let output = Command::new("timeout")
|
||||
.args(&[&timeout_str, "systemctl", "is-active", &format!("{}.service", service)])
|
||||
.args(&["2", "systemctl", "is-active", &format!("{}.service", service)])
|
||||
.output()?;
|
||||
|
||||
let active_status = String::from_utf8(output.stdout)?.trim().to_string();
|
||||
|
||||
// Get more detailed info
|
||||
// Get more detailed info (with 2 second timeout)
|
||||
let output = Command::new("timeout")
|
||||
.args(&[&timeout_str, "systemctl", "show", &format!("{}.service", service), "--property=LoadState,ActiveState,SubState"])
|
||||
.args(&["2", "systemctl", "show", &format!("{}.service", service), "--property=LoadState,ActiveState,SubState"])
|
||||
.output()?;
|
||||
|
||||
let detailed_info = String::from_utf8(output.stdout)?;
|
||||
@@ -432,10 +430,9 @@ impl SystemdCollector {
|
||||
return Ok(0.0);
|
||||
}
|
||||
|
||||
// No configured path - try to get WorkingDirectory from systemctl
|
||||
let timeout_str = self.config.command_timeout_seconds.to_string();
|
||||
// No configured path - try to get WorkingDirectory from systemctl (with 2 second timeout)
|
||||
let output = Command::new("timeout")
|
||||
.args(&[&timeout_str, "systemctl", "show", &format!("{}.service", service_name), "--property=WorkingDirectory"])
|
||||
.args(&["2", "systemctl", "show", &format!("{}.service", service_name), "--property=WorkingDirectory"])
|
||||
.output()
|
||||
.map_err(|e| CollectorError::SystemRead {
|
||||
path: format!("WorkingDirectory for {}", service_name),
|
||||
@@ -455,15 +452,15 @@ impl SystemdCollector {
|
||||
Ok(0.0)
|
||||
}
|
||||
|
||||
/// Get size of a directory in GB
|
||||
/// Get size of a directory in GB (with 2 second timeout)
|
||||
async fn get_directory_size(&self, path: &str) -> Option<f32> {
|
||||
use super::run_command_with_timeout;
|
||||
|
||||
// Use -s (summary) and --apparent-size for speed
|
||||
// Use -s (summary) and --apparent-size for speed, 2 second timeout
|
||||
let mut cmd = Command::new("sudo");
|
||||
cmd.args(&["du", "-s", "--apparent-size", "--block-size=1", path]);
|
||||
|
||||
let output = run_command_with_timeout(cmd, self.config.command_timeout_seconds).await.ok()?;
|
||||
let output = run_command_with_timeout(cmd, 2).await.ok()?;
|
||||
|
||||
if !output.status.success() {
|
||||
// Log permission errors for debugging but don't spam logs
|
||||
@@ -789,10 +786,9 @@ impl SystemdCollector {
|
||||
let mut containers = Vec::new();
|
||||
|
||||
// Check if docker is available (cm-agent user is in docker group)
|
||||
// Use -a to show ALL containers (running and stopped)
|
||||
let timeout_str = self.config.command_timeout_seconds.to_string();
|
||||
// Use -a to show ALL containers (running and stopped) with 3 second timeout
|
||||
let output = Command::new("timeout")
|
||||
.args(&[&timeout_str, "docker", "ps", "-a", "--format", "{{.Names}},{{.Status}}"])
|
||||
.args(&["3", "docker", "ps", "-a", "--format", "{{.Names}},{{.Status}}"])
|
||||
.output();
|
||||
|
||||
let output = match output {
|
||||
@@ -833,10 +829,9 @@ impl SystemdCollector {
|
||||
/// Get docker images as sub-services
|
||||
fn get_docker_images(&self) -> Vec<(String, String, f32)> {
|
||||
let mut images = Vec::new();
|
||||
// Check if docker is available (cm-agent user is in docker group)
|
||||
let timeout_str = self.config.command_timeout_seconds.to_string();
|
||||
// Check if docker is available (cm-agent user is in docker group) with 3 second timeout
|
||||
let output = Command::new("timeout")
|
||||
.args(&[&timeout_str, "docker", "images", "--format", "{{.Repository}}:{{.Tag}},{{.Size}}"])
|
||||
.args(&["3", "docker", "images", "--format", "{{.Repository}}:{{.Tag}},{{.Size}}"])
|
||||
.output();
|
||||
|
||||
let output = match output {
|
||||
@@ -915,9 +910,6 @@ impl SystemdCollector {
|
||||
#[async_trait]
|
||||
impl Collector for SystemdCollector {
|
||||
async fn collect_structured(&self, agent_data: &mut AgentData) -> Result<(), CollectorError> {
|
||||
// Clear existing services data to prevent duplicates in cached architecture
|
||||
agent_data.services.clear();
|
||||
|
||||
// Use cached complete data if available and fresh
|
||||
if let Some(cached_complete_services) = self.get_cached_complete_services() {
|
||||
for service_data in cached_complete_services {
|
||||
|
||||
@@ -1,12 +1,13 @@
|
||||
use anyhow::Result;
|
||||
use cm_dashboard_shared::{AgentData, MessageEnvelope};
|
||||
use tracing::{debug, info};
|
||||
use zmq::{Context, Socket, SocketType};
|
||||
|
||||
use crate::config::ZmqConfig;
|
||||
|
||||
/// ZMQ communication handler for receiving commands
|
||||
/// NOTE: Publishing is handled by dedicated thread in Agent::run()
|
||||
/// ZMQ communication handler for publishing metrics and receiving commands
|
||||
pub struct ZmqHandler {
|
||||
publisher: Socket,
|
||||
command_receiver: Socket,
|
||||
}
|
||||
|
||||
@@ -14,6 +15,17 @@ impl ZmqHandler {
|
||||
pub async fn new(config: &ZmqConfig) -> Result<Self> {
|
||||
let context = Context::new();
|
||||
|
||||
// Create publisher socket for metrics
|
||||
let publisher = context.socket(SocketType::PUB)?;
|
||||
let pub_bind_address = format!("tcp://{}:{}", config.bind_address, config.publisher_port);
|
||||
publisher.bind(&pub_bind_address)?;
|
||||
|
||||
info!("ZMQ publisher bound to {}", pub_bind_address);
|
||||
|
||||
// Set socket options for efficiency
|
||||
publisher.set_sndhwm(1000)?; // High water mark for outbound messages
|
||||
publisher.set_linger(1000)?; // Linger time on close
|
||||
|
||||
// Create command receiver socket (PULL socket to receive commands from dashboard)
|
||||
let command_receiver = context.socket(SocketType::PULL)?;
|
||||
let cmd_bind_address = format!("tcp://{}:{}", config.bind_address, config.command_port);
|
||||
@@ -26,10 +38,33 @@ impl ZmqHandler {
|
||||
command_receiver.set_linger(1000)?;
|
||||
|
||||
Ok(Self {
|
||||
publisher,
|
||||
command_receiver,
|
||||
})
|
||||
}
|
||||
|
||||
|
||||
/// Publish agent data via ZMQ
|
||||
pub async fn publish_agent_data(&self, data: &AgentData) -> Result<()> {
|
||||
debug!(
|
||||
"Publishing agent data for host {}",
|
||||
data.hostname
|
||||
);
|
||||
|
||||
// Create message envelope for agent data
|
||||
let envelope = MessageEnvelope::agent_data(data.clone())
|
||||
.map_err(|e| anyhow::anyhow!("Failed to create agent data envelope: {}", e))?;
|
||||
|
||||
// Serialize envelope
|
||||
let serialized = serde_json::to_vec(&envelope)?;
|
||||
|
||||
// Send via ZMQ
|
||||
self.publisher.send(&serialized, 0)?;
|
||||
|
||||
debug!("Published agent data message ({} bytes)", serialized.len());
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Try to receive a command (non-blocking)
|
||||
pub fn try_receive_command(&self) -> Result<Option<AgentCommand>> {
|
||||
match self.command_receiver.recv_bytes(zmq::DONTWAIT) {
|
||||
|
||||
@@ -79,9 +79,6 @@ pub struct DiskConfig {
|
||||
pub temperature_critical_celsius: f32,
|
||||
pub wear_warning_percent: f32,
|
||||
pub wear_critical_percent: f32,
|
||||
/// Command timeout in seconds for lsblk, smartctl, etc.
|
||||
#[serde(default = "default_disk_command_timeout")]
|
||||
pub command_timeout_seconds: u64,
|
||||
}
|
||||
|
||||
/// Filesystem configuration entry
|
||||
@@ -111,9 +108,6 @@ pub struct SystemdConfig {
|
||||
pub http_timeout_seconds: u64,
|
||||
pub http_connect_timeout_seconds: u64,
|
||||
pub nginx_latency_critical_ms: f32,
|
||||
/// Command timeout in seconds for systemctl, docker, du commands
|
||||
#[serde(default = "default_systemd_command_timeout")]
|
||||
pub command_timeout_seconds: u64,
|
||||
}
|
||||
|
||||
|
||||
@@ -138,9 +132,6 @@ pub struct BackupConfig {
|
||||
pub struct NetworkConfig {
|
||||
pub enabled: bool,
|
||||
pub interval_seconds: u64,
|
||||
/// Command timeout in seconds for ip route, ip addr commands
|
||||
#[serde(default = "default_network_command_timeout")]
|
||||
pub command_timeout_seconds: u64,
|
||||
}
|
||||
|
||||
/// Notification configuration
|
||||
@@ -154,9 +145,6 @@ pub struct NotificationConfig {
|
||||
pub rate_limit_minutes: u64,
|
||||
/// Email notification batching interval in seconds (default: 60)
|
||||
pub aggregation_interval_seconds: u64,
|
||||
/// Status check interval in seconds for detecting changes (default: 30)
|
||||
#[serde(default = "default_notification_check_interval")]
|
||||
pub check_interval_seconds: u64,
|
||||
/// List of metric names to exclude from email notifications
|
||||
#[serde(default)]
|
||||
pub exclude_email_metrics: Vec<String>,
|
||||
@@ -170,26 +158,10 @@ fn default_heartbeat_interval_seconds() -> u64 {
|
||||
5
|
||||
}
|
||||
|
||||
fn default_notification_check_interval() -> u64 {
|
||||
30
|
||||
}
|
||||
|
||||
fn default_maintenance_mode_file() -> String {
|
||||
"/tmp/cm-maintenance".to_string()
|
||||
}
|
||||
|
||||
fn default_disk_command_timeout() -> u64 {
|
||||
30
|
||||
}
|
||||
|
||||
fn default_systemd_command_timeout() -> u64 {
|
||||
15
|
||||
}
|
||||
|
||||
fn default_network_command_timeout() -> u64 {
|
||||
10
|
||||
}
|
||||
|
||||
impl AgentConfig {
|
||||
pub fn from_file<P: AsRef<Path>>(path: P) -> Result<Self> {
|
||||
loader::load_config(path)
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "cm-dashboard"
|
||||
version = "0.1.195"
|
||||
version = "0.1.192"
|
||||
edition = "2021"
|
||||
|
||||
[dependencies]
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
[package]
|
||||
name = "cm-dashboard-shared"
|
||||
version = "0.1.195"
|
||||
version = "0.1.192"
|
||||
edition = "2021"
|
||||
|
||||
[dependencies]
|
||||
|
||||
Reference in New Issue
Block a user