Testing

2025-10-12 22:31:46 +02:00
parent 4d8bacef50
commit 9e344fb66d
17 changed files with 283 additions and 297 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -184,25 +184,103 @@ Keys: [←→] hosts [r]efresh [q]uit
 Keys: [Enter] details [r]efresh [s]ort [f]ilter [q]uit
 ```

-## Development Status
+## Architecture Principles - CRITICAL

-### Immediate TODOs
+### Agent-Dashboard Separation of Concerns

- Refactor all dashboard widgets to use a shared table/layout helper so icons, padding, and titles remain consistent across panels
+**AGENT IS SINGLE SOURCE OF TRUTH FOR ALL STATUS CALCULATIONS**
+- Agent calculates status ("ok"/"warning"/"critical"/"unknown") using defined thresholds
+- Agent sends status to dashboard via ZMQ
+- Dashboard NEVER calculates status - only displays what agent provides

- Investigate why the backup metrics agent is not publishing data to the dashboard
- Resize the services widget so it can display more services without truncation
- Remove the dedicated status widget and redistribute the layout space
- Add responsive scaling within each widget so columns and content adapt dynamically
+**Data Flow Architecture:**
+```
+Agent (calculations + thresholds) → Status → Dashboard (display only) → TableBuilder (colors)
+```

-### Phase 3: Advanced Features 🚧 IN PROGRESS
+**Status Handling Rules:**
+- Agent provides status → Dashboard uses agent status
+- Agent doesn't provide status → Dashboard shows "unknown" (NOT "ok")
+- Dashboard widgets NEVER contain hardcoded thresholds
+- TableBuilder converts status to colors for display

- [x] ZMQ gossip network implementation
- [x] Comprehensive error handling
- [x] Performance optimizations
- [ ] Predictive analytics for wear levels
- [ ] Custom alert rules engine
- [ ] Historical data export capabilities
+### Current Agent Thresholds (as of 2025-10-12)
+
+**CPU Load (service.rs:392-400):**
+- Warning: ≥ 2.0 (testing value, was 5.0)
+- Critical: ≥ 4.0 (testing value, was 8.0)
+
+**CPU Temperature (service.rs:412-420):**
+- Warning: ≥ 70.0°C
+- Critical: ≥ 80.0°C
+
+**Memory Usage (service.rs:402-410):**
+- Warning: ≥ 80%
+- Critical: ≥ 95%
+
+### Email Notifications
+
+**System Configuration:**
+- From: `{hostname}@cmtec.se` (e.g., cmbox@cmtec.se)
+- To: `cm@cmtec.se`
+- SMTP: localhost:25 (postfix)
+- Timezone: Europe/Stockholm (not UTC)
+
+**Notification Triggers:**
+- Status degradation: any → "warning" or "critical"
+- Recovery: "warning"/"critical" → "ok"
+- Rate limiting: configurable (set to 0 for testing, 30 minutes for production)
+
+**Monitored Components:**
+- system.cpu (load status)
+- system.cpu_temp (temperature status)
+- system.memory (usage status)
+- system.services (service health status)
+- storage.smart (drive health)
+- backup.overall (backup status)
+
+### Pure Auto-Discovery Implementation
+
+**Agent Configuration:**
+- No config files required
+- Auto-detects storage devices, services, backup systems
+- Runtime discovery of system capabilities
+- CLI: `cm-dashboard-agent [-v]` (only verbose flag)
+
+**Service Discovery:**
+- Scans running systemd services
+- Filters by predefined interesting patterns (gitea, nginx, docker, etc.)
+- No host-specific hardcoded service lists
+
+### Current Implementation Status
+
+**Completed:**
+- [x] Pure auto-discovery agent (no config files)
+- [x] Agent-side status calculations with defined thresholds
+- [x] Dashboard displays agent status (no dashboard calculations)
+- [x] Email notifications with Stockholm timezone
+- [x] CPU temperature monitoring and notifications
+- [x] ZMQ message format standardization
+- [x] Removed all hardcoded dashboard thresholds
+
+**Testing Configuration (REVERT FOR PRODUCTION):**
+- CPU thresholds lowered to 2.0/4.0 for easy testing
+- Email rate limiting disabled (0 minutes)
+
+### Development Guidelines
+
+**When Adding New Metrics:**
+1. Agent calculates status with thresholds
+2. Agent adds `{metric}_status` field to JSON output  
+3. Dashboard data structure adds `{metric}_status: Option<String>`
+4. Dashboard uses `status_level_from_agent_status()` for display
+5. Agent adds notification monitoring for status changes
+
+**NEVER:**
+- Add hardcoded thresholds to dashboard widgets
+- Calculate status in dashboard with different thresholds than agent
+- Use "ok" as default when agent status is missing (use "unknown")
+- Calculate colors in widgets (TableBuilder's responsibility)

 # Important Communication Guidelines