209 lines
7.3 KiB
Markdown
209 lines
7.3 KiB
Markdown
# CM Dashboard - Infrastructure Monitoring TUI
|
|
|
|
## Overview
|
|
|
|
A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and ZMQ-based metric collection.
|
|
|
|
## Implementation Strategy
|
|
|
|
### Next Phase: Systemd Collector Optimization (Based on TODO.md)
|
|
|
|
**Current Status**: Reverted to working baseline (commit 245e546) after optimization broke service discovery.
|
|
|
|
**Planned Implementation Steps** (step-by-step to avoid breaking functionality):
|
|
|
|
**Phase 1: Exact Name Filtering**
|
|
- Replace `contains()` matching with exact name matching for service filters
|
|
- Change `service_name.contains(pattern) || pattern.contains(service_name)` to `service_name == pattern`
|
|
- Test: Ensure cmbox remains visible with exact service names in config
|
|
- Commit and test after each change
|
|
|
|
**Phase 2: Remove User Service Collection**
|
|
- Remove all `sudo -u` systemctl commands for user services
|
|
- Remove user_unit_files_output and user_units_output logic
|
|
- Keep only system service discovery via `systemctl list-units --type=service`
|
|
- Test: Verify system services still discovered correctly
|
|
|
|
**Phase 3: Add Wildcard Support**
|
|
- Implement glob pattern matching for service filters
|
|
- Support patterns like "nginx*" to match "nginx", "nginx-config-reload", etc.
|
|
- Use fnmatch or similar for wildcard expansion
|
|
- Test: Verify patterns work as expected
|
|
|
|
**Phase 4: Optimize systemctl Calls**
|
|
- Cache service status information during discovery
|
|
- Eliminate redundant `systemctl is-active` and `systemctl show` calls per service
|
|
- Parse status from `systemctl list-units` output directly
|
|
- Test: Ensure performance improvement without functionality loss
|
|
|
|
**Phase 5: Include-Only Discovery**
|
|
- Remove auto-discovery of all services
|
|
- Only check services explicitly listed in service_name_filters
|
|
- Skip systemctl discovery entirely, use configured list directly
|
|
- Test: Verify only configured services are monitored
|
|
|
|
**Critical Requirements:**
|
|
- Each phase must be tested independently
|
|
- cmbox must remain visible in dashboard after each change
|
|
- No functionality regressions allowed
|
|
- Commit each phase separately with descriptive messages
|
|
|
|
**Rollback Strategy:**
|
|
- If any phase breaks functionality, immediately revert that specific commit
|
|
- Do not attempt to "fix forward" - revert and redesign the problematic step
|
|
- Each phase should be atomic and independently revertible
|
|
|
|
## Core Architecture Principles - CRITICAL
|
|
|
|
### Individual Metrics Philosophy
|
|
|
|
**NEW ARCHITECTURE**: Agent collects individual metrics, dashboard composes widgets from those metrics.
|
|
|
|
### Maintenance Mode
|
|
|
|
**Purpose:**
|
|
|
|
- Suppress email notifications during planned maintenance or backups
|
|
- Prevents false alerts when services are intentionally stopped
|
|
|
|
**Implementation:**
|
|
|
|
- Agent checks for `/tmp/cm-maintenance` file before sending notifications
|
|
- File presence suppresses all email notifications while continuing monitoring
|
|
- Dashboard continues to show real status, only notifications are blocked
|
|
|
|
**Usage:**
|
|
|
|
```bash
|
|
# Enable maintenance mode
|
|
touch /tmp/cm-maintenance
|
|
|
|
# Run maintenance tasks (backups, service restarts, etc.)
|
|
systemctl stop service
|
|
# ... maintenance work ...
|
|
systemctl start service
|
|
|
|
# Disable maintenance mode
|
|
rm /tmp/cm-maintenance
|
|
```
|
|
|
|
**NixOS Integration:**
|
|
|
|
- Borgbackup script automatically creates/removes maintenance file
|
|
- Automatic cleanup via trap ensures maintenance mode doesn't stick
|
|
- All cinfiguration are shall be done from nixos config
|
|
|
|
**ARCHITECTURE ENFORCEMENT**:
|
|
|
|
- **ZERO legacy code reuse** - Fresh implementation following ARCHITECT.md exactly
|
|
- **Individual metrics only** - NO grouped metric structures
|
|
- **Reference-only legacy** - Study old functionality, implement new architecture
|
|
- **Clean slate mindset** - Build as if legacy codebase never existed
|
|
|
|
**Implementation Rules**:
|
|
|
|
1. **Individual Metrics**: Each metric is collected, transmitted, and stored individually
|
|
2. **Agent Status Authority**: Agent calculates status for each metric using thresholds
|
|
3. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
|
|
4. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status
|
|
**Testing & Building**:
|
|
|
|
- **Workspace builds**: `cargo build --workspace` for all testing
|
|
- **Clean compilation**: Remove `target/` between architecture changes
|
|
- **ZMQ testing**: Test agent-dashboard communication independently
|
|
- **Widget testing**: Verify UI layout matches legacy appearance exactly
|
|
|
|
**NEVER in New Implementation**:
|
|
|
|
- Copy/paste ANY code from legacy backup
|
|
- Calculate status in dashboard widgets
|
|
- Hardcode metric names in widgets (use const arrays)
|
|
|
|
# Important Communication Guidelines
|
|
|
|
NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.
|
|
|
|
NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.
|
|
|
|
## Commit Message Guidelines
|
|
|
|
**NEVER mention:**
|
|
|
|
- Claude or any AI assistant names
|
|
- Automation or AI-generated content
|
|
- Any reference to automated code generation
|
|
|
|
**ALWAYS:**
|
|
|
|
- Focus purely on technical changes and their purpose
|
|
- Use standard software development commit message format
|
|
- Describe what was changed and why, not how it was created
|
|
- Write from the perspective of a human developer
|
|
|
|
**Examples:**
|
|
|
|
- ❌ "Generated with Claude Code"
|
|
- ❌ "AI-assisted implementation"
|
|
- ❌ "Automated refactoring"
|
|
- ✅ "Implement maintenance mode for backup operations"
|
|
- ✅ "Restructure storage widget with improved layout"
|
|
- ✅ "Update CPU thresholds to production values"
|
|
|
|
## NixOS Configuration Updates
|
|
|
|
When code changes are made to cm-dashboard, the NixOS configuration at `~/nixosbox` must be updated to deploy the changes.
|
|
|
|
### Update Process
|
|
|
|
1. **Get Latest Commit Hash**
|
|
|
|
```bash
|
|
git log -1 --format="%H"
|
|
```
|
|
|
|
2. **Update NixOS Configuration**
|
|
Edit `~/nixosbox/hosts/common/cm-dashboard.nix`:
|
|
|
|
```nix
|
|
src = pkgs.fetchgit {
|
|
url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
|
|
rev = "NEW_COMMIT_HASH_HERE";
|
|
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA="; # Placeholder
|
|
};
|
|
```
|
|
|
|
3. **Get Correct Source Hash**
|
|
Build with placeholder hash to get the actual hash:
|
|
|
|
```bash
|
|
cd ~/nixosbox
|
|
nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchgit {
|
|
url = "https://gitea.cmtec.se/cm/cm-dashboard.git";
|
|
rev = "NEW_COMMIT_HASH";
|
|
sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
|
|
}' 2>&1 | grep "got:"
|
|
```
|
|
|
|
Example output:
|
|
|
|
```
|
|
error: hash mismatch in fixed-output derivation '/nix/store/...':
|
|
specified: sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
|
|
got: sha256-x8crxNusOUYRrkP9mYEOG+Ga3JCPIdJLkEAc5P1ZxdQ=
|
|
```
|
|
|
|
4. **Update Configuration with Correct Hash**
|
|
Replace the placeholder with the hash from the error message (the "got:" line).
|
|
|
|
5. **Commit NixOS Configuration**
|
|
|
|
```bash
|
|
cd ~/nixosbox
|
|
git add hosts/common/cm-dashboard.nix
|
|
git commit -m "Update cm-dashboard to latest version (SHORT_HASH)"
|
|
git push
|
|
```
|
|
|
|
6. **Rebuild System**
|
|
The user handles the system rebuild step - this cannot be automated.
|