Implement cached collector architecture with configurable timeouts
All checks were successful
Build and Release / build-and-release (push) Successful in 1m20s
All checks were successful
Build and Release / build-and-release (push) Successful in 1m20s
Major architectural refactor to eliminate false "host offline" alerts: - Replace sequential blocking collectors with independent async tasks - Each collector runs at configurable interval and updates shared cache - ZMQ sender reads cache every 1-2s regardless of collector speed - Collector intervals: CPU/Memory (1-10s), Backup/NixOS (30-60s), Disk/Systemd (60-300s) All intervals now configurable via NixOS config: - collectors.*.interval_seconds (collection frequency per collector) - collectors.*.command_timeout_seconds (timeout for shell commands) - notifications.check_interval_seconds (status change detection rate) Command timeouts increased from hardcoded 2-3s to configurable 10-30s: - Disk collector: 30s (SMART operations, lsblk) - Systemd collector: 15s (systemctl, docker, du commands) - Network collector: 10s (ip route, ip addr) Benefits: - No false "offline" alerts when slow collectors take >10s - Different update rates for different metric types - Better resource management with longer timeouts - Full NixOS configuration control Bump version to v0.1.193
This commit is contained in:
44
CLAUDE.md
44
CLAUDE.md
@@ -156,7 +156,7 @@ Complete migration from string-based metrics to structured JSON data. Eliminates
|
||||
- ✅ Backward compatibility via bridge conversion to existing UI widgets
|
||||
- ✅ All string parsing bugs eliminated
|
||||
|
||||
### Cached Collector Architecture (🚧 PLANNED)
|
||||
### Cached Collector Architecture (✅ IMPLEMENTED)
|
||||
|
||||
**Problem:** Blocking collectors prevent timely ZMQ transmission, causing false "host offline" alerts.
|
||||
|
||||
@@ -199,12 +199,42 @@ Every 1 second:
|
||||
- ✅ System stays responsive even with slow operations
|
||||
- ✅ Slow collectors can use longer timeouts without blocking
|
||||
|
||||
**Implementation:**
|
||||
- Shared `AgentData` cache wrapped in `Arc<RwLock<>>`
|
||||
- Each collector spawned as independent tokio task
|
||||
- Collectors update their section of cache at their own rate
|
||||
- ZMQ sender reads cache every 1s and transmits
|
||||
- Stale data acceptable for slow-changing metrics (disk usage, SMART)
|
||||
**Implementation Details:**
|
||||
- **Shared cache**: `Arc<RwLock<AgentData>>` initialized at agent startup
|
||||
- **Collector intervals**: Fully configurable via NixOS config (`interval_seconds` per collector)
|
||||
- Recommended: Fast (1-10s): CPU, Memory, Network
|
||||
- Recommended: Medium (30-60s): Backup, NixOS
|
||||
- Recommended: Slow (60-300s): Disk, Systemd
|
||||
- **Independent tasks**: Each collector spawned as separate tokio task in `Agent::new()`
|
||||
- **Cache updates**: Collectors acquire write lock → update → release immediately
|
||||
- **ZMQ sender**: Main loop reads cache every `collection_interval_seconds` and broadcasts
|
||||
- **Notification check**: Runs every `notifications.check_interval_seconds`
|
||||
- **Lock strategy**: Short-lived write locks prevent blocking, read locks for transmission
|
||||
- **Stale data**: Acceptable for slow-changing metrics (SMART data, disk usage)
|
||||
|
||||
**Configuration (NixOS):**
|
||||
All intervals and timeouts configurable in `services/cm-dashboard.nix`:
|
||||
|
||||
Collection Intervals:
|
||||
- `collectors.cpu.interval_seconds` (default: 10s)
|
||||
- `collectors.memory.interval_seconds` (default: 2s)
|
||||
- `collectors.disk.interval_seconds` (default: 300s)
|
||||
- `collectors.systemd.interval_seconds` (default: 10s)
|
||||
- `collectors.backup.interval_seconds` (default: 60s)
|
||||
- `collectors.network.interval_seconds` (default: 10s)
|
||||
- `collectors.nixos.interval_seconds` (default: 60s)
|
||||
- `notifications.check_interval_seconds` (default: 30s)
|
||||
- `collection_interval_seconds` - ZMQ transmission rate (default: 2s)
|
||||
|
||||
Command Timeouts (prevent resource leaks from hung commands):
|
||||
- `collectors.disk.command_timeout_seconds` (default: 30s) - lsblk, smartctl, etc.
|
||||
- `collectors.systemd.command_timeout_seconds` (default: 15s) - systemctl, docker, du
|
||||
- `collectors.network.command_timeout_seconds` (default: 10s) - ip route, ip addr
|
||||
|
||||
**Code Locations:**
|
||||
- agent/src/agent.rs:59-133 - Collector task spawning
|
||||
- agent/src/agent.rs:151-179 - Independent collector task runner
|
||||
- agent/src/agent.rs:199-207 - ZMQ sender in main loop
|
||||
|
||||
### Maintenance Mode
|
||||
|
||||
|
||||
Reference in New Issue
Block a user