This commit is contained in:
2025-10-12 22:31:46 +02:00
parent 4d8bacef50
commit 9e344fb66d
17 changed files with 283 additions and 297 deletions

106
CLAUDE.md
View File

@@ -184,25 +184,103 @@ Keys: [←→] hosts [r]efresh [q]uit
Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
```
## Development Status
## Architecture Principles - CRITICAL
### Immediate TODOs
### Agent-Dashboard Separation of Concerns
- Refactor all dashboard widgets to use a shared table/layout helper so icons, padding, and titles remain consistent across panels
**AGENT IS SINGLE SOURCE OF TRUTH FOR ALL STATUS CALCULATIONS**
- Agent calculates status ("ok"/"warning"/"critical"/"unknown") using defined thresholds
- Agent sends status to dashboard via ZMQ
- Dashboard NEVER calculates status - only displays what agent provides
- Investigate why the backup metrics agent is not publishing data to the dashboard
- Resize the services widget so it can display more services without truncation
- Remove the dedicated status widget and redistribute the layout space
- Add responsive scaling within each widget so columns and content adapt dynamically
**Data Flow Architecture:**
```
Agent (calculations + thresholds) → Status → Dashboard (display only) → TableBuilder (colors)
```
### Phase 3: Advanced Features 🚧 IN PROGRESS
**Status Handling Rules:**
- Agent provides status → Dashboard uses agent status
- Agent doesn't provide status → Dashboard shows "unknown" (NOT "ok")
- Dashboard widgets NEVER contain hardcoded thresholds
- TableBuilder converts status to colors for display
- [x] ZMQ gossip network implementation
- [x] Comprehensive error handling
- [x] Performance optimizations
- [ ] Predictive analytics for wear levels
- [ ] Custom alert rules engine
- [ ] Historical data export capabilities
### Current Agent Thresholds (as of 2025-10-12)
**CPU Load (service.rs:392-400):**
- Warning: ≥ 2.0 (testing value, was 5.0)
- Critical: ≥ 4.0 (testing value, was 8.0)
**CPU Temperature (service.rs:412-420):**
- Warning: ≥ 70.0°C
- Critical: ≥ 80.0°C
**Memory Usage (service.rs:402-410):**
- Warning: ≥ 80%
- Critical: ≥ 95%
### Email Notifications
**System Configuration:**
- From: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se)
- To: `cm@cmtec.se`
- SMTP: localhost:25 (postfix)
- Timezone: Europe/Stockholm (not UTC)
**Notification Triggers:**
- Status degradation: any → "warning" or "critical"
- Recovery: "warning"/"critical" → "ok"
- Rate limiting: configurable (set to 0 for testing, 30 minutes for production)
**Monitored Components:**
- system.cpu (load status)
- system.cpu_temp (temperature status)
- system.memory (usage status)
- system.services (service health status)
- storage.smart (drive health)
- backup.overall (backup status)
### Pure Auto-Discovery Implementation
**Agent Configuration:**
- No config files required
- Auto-detects storage devices, services, backup systems
- Runtime discovery of system capabilities
- CLI: `cm-dashboard-agent [-v]` (only verbose flag)
**Service Discovery:**
- Scans running systemd services
- Filters by predefined interesting patterns (gitea, nginx, docker, etc.)
- No host-specific hardcoded service lists
### Current Implementation Status
**Completed:**
- [x] Pure auto-discovery agent (no config files)
- [x] Agent-side status calculations with defined thresholds
- [x] Dashboard displays agent status (no dashboard calculations)
- [x] Email notifications with Stockholm timezone
- [x] CPU temperature monitoring and notifications
- [x] ZMQ message format standardization
- [x] Removed all hardcoded dashboard thresholds
**Testing Configuration (REVERT FOR PRODUCTION):**
- CPU thresholds lowered to 2.0/4.0 for easy testing
- Email rate limiting disabled (0 minutes)
### Development Guidelines
**When Adding New Metrics:**
1. Agent calculates status with thresholds
2. Agent adds `{metric}_status` field to JSON output
3. Dashboard data structure adds `{metric}_status: Option<String>`
4. Dashboard uses `status_level_from_agent_status()` for display
5. Agent adds notification monitoring for status changes
**NEVER:**
- Add hardcoded thresholds to dashboard widgets
- Calculate status in dashboard with different thresholds than agent
- Use "ok" as default when agent status is missing (use "unknown")
- Calculate colors in widgets (TableBuilder's responsibility)
# Important Communication Guidelines