Document cached collector architecture plan
Add architectural plan for separating ZMQ sending from data collection to prevent false 'host offline' alerts caused by slow collectors. Key concepts: - Shared cache (Arc<RwLock<AgentData>>) - Independent async collector tasks with different update rates - ZMQ sender always sends every 1s from cache - Fast collectors (1s), medium (5s), slow (60s) - No blocking regardless of collector speed
This commit is contained in:
parent
833010e270
commit
37f2650200
50
CLAUDE.md
50
CLAUDE.md
@ -156,6 +156,56 @@ Complete migration from string-based metrics to structured JSON data. Eliminates
|
|||||||
- ✅ Backward compatibility via bridge conversion to existing UI widgets
|
- ✅ Backward compatibility via bridge conversion to existing UI widgets
|
||||||
- ✅ All string parsing bugs eliminated
|
- ✅ All string parsing bugs eliminated
|
||||||
|
|
||||||
|
### Cached Collector Architecture (🚧 PLANNED)
|
||||||
|
|
||||||
|
**Problem:** Blocking collectors prevent timely ZMQ transmission, causing false "host offline" alerts.
|
||||||
|
|
||||||
|
**Previous (Sequential Blocking):**
|
||||||
|
```
|
||||||
|
Every 1 second:
|
||||||
|
└─ collect_all_data() [BLOCKS for 2-10+ seconds]
|
||||||
|
├─ CPU (fast: 10ms)
|
||||||
|
├─ Memory (fast: 20ms)
|
||||||
|
├─ Disk SMART (slow: 3s per drive × 4 drives = 12s)
|
||||||
|
├─ Service disk usage (slow: 2-8s per service)
|
||||||
|
└─ Docker (medium: 500ms)
|
||||||
|
└─ send_via_zmq() [Only after ALL collection completes]
|
||||||
|
|
||||||
|
Result: If any collector takes >10s → "host offline" false alert
|
||||||
|
```
|
||||||
|
|
||||||
|
**New (Cached Independent Collectors):**
|
||||||
|
```
|
||||||
|
Shared Cache: Arc<RwLock<AgentData>>
|
||||||
|
|
||||||
|
Background Collectors (independent async tasks):
|
||||||
|
├─ Fast collectors (CPU, RAM, Network)
|
||||||
|
│ └─ Update cache every 1 second
|
||||||
|
├─ Medium collectors (Services, Docker)
|
||||||
|
│ └─ Update cache every 5 seconds
|
||||||
|
└─ Slow collectors (Disk usage, SMART data)
|
||||||
|
└─ Update cache every 60 seconds
|
||||||
|
|
||||||
|
ZMQ Sender (separate async task):
|
||||||
|
Every 1 second:
|
||||||
|
└─ Read current cache
|
||||||
|
└─ Send via ZMQ [Always instant, never blocked]
|
||||||
|
```
|
||||||
|
|
||||||
|
**Benefits:**
|
||||||
|
- ✅ ZMQ sends every 1 second regardless of collector speed
|
||||||
|
- ✅ No false "host offline" alerts from slow collectors
|
||||||
|
- ✅ Different update rates for different metrics (CPU=1s, SMART=60s)
|
||||||
|
- ✅ System stays responsive even with slow operations
|
||||||
|
- ✅ Slow collectors can use longer timeouts without blocking
|
||||||
|
|
||||||
|
**Implementation:**
|
||||||
|
- Shared `AgentData` cache wrapped in `Arc<RwLock<>>`
|
||||||
|
- Each collector spawned as independent tokio task
|
||||||
|
- Collectors update their section of cache at their own rate
|
||||||
|
- ZMQ sender reads cache every 1s and transmits
|
||||||
|
- Stale data acceptable for slow-changing metrics (disk usage, SMART)
|
||||||
|
|
||||||
### Maintenance Mode
|
### Maintenance Mode
|
||||||
|
|
||||||
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
|
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user