Migrate service control from ZMQ to SSH with real-time progress

Replace ZMQ-based service start/stop commands with SSH execution in tmux popups. This provides better user feedback with real-time systemctl output while eliminating blocking operations from the main message processing loop. Changes: - Service start/stop now use SSH with progress display - Added backup functionality with 'B' key - Preserved transitional icons (↑/↓) for immediate visual feedback - Removed all ZMQ service control commands and handlers - Updated configuration to include backup_alias setting - All operations (rebuild, backup, services) now use consistent SSH interface This ensures stable heartbeat processing while providing superior user experience with live command output and service status feedback.
Remove blocking CollectNow commands to fix heartbeat stability
2025-11-18 16:02:15 +01:00 · 2025-11-15 11:41:58 +01:00 · 2025-11-15 11:09:49 +01:00 · 2025-11-15 10:25:08 +01:00 · 2025-11-15 10:21:30 +01:00 · 2025-11-15 10:04:47 +01:00
34 changed files with 1231 additions and 1280 deletions
--- a/.gitea/workflows/release.yml
+++ b/.gitea/workflows/release.yml
@@ -113,13 +113,13 @@ jobs:
          NIX_HASH="sha256-$(python3 -c "import base64, binascii; print(base64.b64encode(binascii.unhexlify('$NEW_HASH')).decode())")"
          
          # Update the NixOS configuration
-          sed -i "s|version = \"v[^\"]*\"|version = \"$VERSION\"|" hosts/common/cm-dashboard.nix
-          sed -i "s|sha256 = \"sha256-[^\"]*\"|sha256 = \"$NIX_HASH\"|" hosts/common/cm-dashboard.nix
+          sed -i "s|version = \"v[^\"]*\"|version = \"$VERSION\"|" hosts/services/cm-dashboard.nix
+          sed -i "s|sha256 = \"sha256-[^\"]*\"|sha256 = \"$NIX_HASH\"|" hosts/services/cm-dashboard.nix
          
          # Commit and push changes
          git config user.name "Gitea Actions"
          git config user.email "actions@gitea.cmtec.se"
-          git add hosts/common/cm-dashboard.nix
+          git add hosts/services/cm-dashboard.nix
          git commit -m "Auto-update cm-dashboard to $VERSION

          - Update version to $VERSION with automated release
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,3 +0,0 @@
-# Agent Guide
-
-Agents working in this repo must follow the instructions in `CLAUDE.md`.
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -2,277 +2,80 @@

 ## Overview

-A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built to replace Glance with a custom solution tailored for our specific monitoring needs and ZMQ-based metric collection.
+A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.

-## Implementation Strategy
+## Current Features

-### Current Implementation Status
+### Core Functionality
+- **Real-time Monitoring**: CPU, RAM, Storage, and Service status
+- **Service Management**: Start/stop services with user-stopped tracking
+- **Multi-host Support**: Monitor multiple servers from single dashboard
+- **NixOS Integration**: System rebuild via SSH + tmux popup
+- **Backup Monitoring**: Borgbackup status and scheduling

-**System Panel Enhancement - COMPLETED** ✅
+### User-Stopped Service Tracking
+- Services stopped via dashboard are marked as "user-stopped"
+- User-stopped services report Status::OK instead of Warning
+- Prevents false alerts during intentional maintenance
+- Persistent storage survives agent restarts
+- Automatic flag clearing when services are restarted via dashboard

-All system panel features successfully implemented:
- ✅ **NixOS Collector**: Created collector for version and active users  
- ✅ **System Widget**: Unified widget combining NixOS, CPU, RAM, and Storage
- ✅ **Build Display**: Shows NixOS build information without codename
- ✅ **Active Users**: Displays currently logged in users
- ✅ **Tmpfs Monitoring**: Added /tmp usage to RAM section
- ✅ **Agent Deployment**: NixOS collector working in production
-
-**Simplified Navigation and Service Management - COMPLETED** ✅
-
-All navigation and service management features successfully implemented:
- ✅ **Direct Service Control**: Up/Down (or j/k) arrows directly control service selection
- ✅ **Always Visible Selection**: Service selection highlighting always visible (no panel focus needed)
- ✅ **Complete Service Discovery**: All configured services visible regardless of state
- ✅ **Transitional Visual Feedback**: Service operations show directional arrows (↑ ↓ ↻)
- ✅ **Simplified Interface**: Removed panel switching complexity, uniform appearance
- ✅ **Vi-style Navigation**: Added j/k keys for vim users alongside arrow keys
-
-**Current Status - October 28, 2025:**
- All service discovery and display features working correctly ✅
- Simplified navigation system implemented ✅
- Service selection always visible with direct control ✅
- Complete service visibility (all configured services show regardless of state) ✅
- Transitional service icons working with proper color handling ✅
- Build display working: "Build: 25.05.20251004.3bcc93c" ✅
- Agent version display working: "Agent: v0.1.33" ✅
- Cross-host version comparison implemented ✅
- Automated binary release system working ✅
- SMART data consolidated into disk collector ✅
-
-**RESOLVED - Remote Rebuild Functionality:**
- ✅ **System Rebuild**: Now uses simple SSH + tmux popup approach
- ✅ **Process Isolation**: Rebuild runs independently via SSH, survives agent/dashboard restarts
- ✅ **Configuration**: SSH user and rebuild alias configurable in dashboard config
- ✅ **Service Control**: Works correctly for start/stop/restart of services
-
-**Solution Implemented:**
- Replaced complex SystemRebuild command infrastructure with direct tmux popup
- Uses `tmux display-popup "ssh -tt {user}@{hostname} 'bash -ic {alias}'"`
- Configurable SSH user and rebuild alias in dashboard config
- Eliminates all agent crashes during rebuilds
- Simple, reliable, and follows standard tmux interface patterns
-
-**Current Layout:**
-```
-NixOS:
-Build: 25.05.20251004.3bcc93c
-Agent: v0.1.17   # Shows agent version from Cargo.toml
-Active users: cm, simon
-CPU:
-● Load: 0.02 0.31 0.86 • 3000MHz
-RAM:
-● Usage: 33% 2.6GB/7.6GB  
-● /tmp: 0% 0B/2.0GB  
-Storage:  
-● root (Single):  
- ├─ ● nvme0n1 W: 1%
- └─ ● 18% 167.4GB/928.2GB
+### Custom Service Logs
+- Configure service-specific log file paths per host in dashboard config
+- Press `L` on any service to view custom log files via `tail -f`
+- Configuration format in dashboard config:
+```toml
+[service_logs]
+hostname1 = [
+  { service_name = "nginx", log_file_path = "/var/log/nginx/access.log" },
+  { service_name = "app", log_file_path = "/var/log/myapp/app.log" }
+]
+hostname2 = [
+  { service_name = "database", log_file_path = "/var/log/postgres/postgres.log" }
+]
 ```

-**System panel layout fully implemented with blue tree symbols ✅**
-**Tree symbols now use consistent blue theming across all panels ✅**
-**Overflow handling restored for all widgets ("... and X more") ✅**
-**Agent version display working correctly ✅**
-**Cross-host version comparison logging warnings ✅**
-**Backup panel visibility fixed - only shows when meaningful data exists ✅**
-**SSH-based rebuild system fully implemented and working ✅**
+### Service Management
+- **Direct Control**: Arrow keys (↑↓) or vim keys (j/k) navigate services
+- **Service Actions**: 
+  - `s` - Start service (sends UserStart command)
+  - `S` - Stop service (sends UserStop command)
+  - `J` - Show service logs (journalctl in tmux popup)
+  - `L` - Show custom log files (tail -f custom paths in tmux popup)
+  - `R` - Rebuild current host
+- **Visual Status**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed)
+- **Transitional Icons**: Blue arrows during operations

-### Current Simplified Navigation Implementation
-
-**Navigation Controls:**
- **Tab**: Switch between hosts (cmbox, srv01, srv02, steambox, etc.)
- **↑↓ or j/k**: Move service selection cursor (always works)
+### Navigation
+- **Tab**: Switch between hosts
+- **↑↓ or j/k**: Select services
+- **s**: Start selected service (UserStart)
+- **S**: Stop selected service (UserStop)
+- **J**: Show service logs (journalctl)
+- **L**: Show custom log files
+- **R**: Rebuild current host
+- **B**: Run backup on current host
 - **q**: Quit dashboard

-**Service Control:**
- **s**: Start selected service
- **S**: Stop selected service  
- **R**: Rebuild current host (works from any context)
-
-**Visual Features:**
- **Service Selection**: Always visible blue background highlighting current service
- **Status Icons**: Green ● (active), Yellow ◐ (inactive), Red ◯ (failed), ? (unknown)
- **Transitional Icons**: Blue ↑ (starting), ↓ (stopping), ↻ (restarting) when not selected
- **Transitional Icons**: Dark gray arrows when service is selected (for visibility)
- **Uniform Interface**: All panels have consistent appearance (no focus borders)
-
-### Service Discovery and Display - WORKING ✅
-
-**All Issues Resolved (as of 2025-10-28):**
- ✅ **Complete Service Discovery**: Uses `systemctl list-unit-files` + `list-units --all` for comprehensive service detection
- ✅ **All Services Visible**: Shows all configured services regardless of current state (active/inactive)
- ✅ **Proper Status Display**: Active services show green ●, inactive show yellow ◐, failed show red ◯
- ✅ **Transitional Icons**: Visual feedback during service operations with proper color handling
- ✅ **Simplified Navigation**: Removed panel complexity, direct service control always available
- ✅ **Service Control**: Start (s) and Stop (S) commands work from anywhere
- ✅ **System Rebuild**: SSH + tmux popup approach for reliable remote rebuilds
-
-### Terminal Popup for Real-time Output - IMPLEMENTED ✅
-
-**Status (as of 2025-10-26):**
- ✅ **Terminal Popup UI**: 80% screen coverage with terminal styling and color-coded output
- ✅ **ZMQ Streaming Protocol**: CommandOutputMessage for real-time output transmission
- ✅ **Keyboard Controls**: ESC/Q to close, ↑↓ to scroll, manual close (no auto-close)
- ✅ **Real-time Display**: Live streaming of command output as it happens
- ✅ **Version-based Agent Reporting**: Shows "Agent: v0.1.13" instead of nix store hash
-
-**Current Implementation Issues:**
- ❌ **Agent Process Crashes**: Agent dies during nixos-rebuild execution
- ❌ **Inconsistent Output**: Different outputs each time 'R' is pressed
- ❌ **Limited Output Visibility**: Not capturing all nixos-rebuild progress
-
-**PLANNED SOLUTION - Systemd Service Approach:**
-
-**Problem**: Direct nixos-rebuild execution in agent causes process crashes and inconsistent output.
-
-**Solution**: Create dedicated systemd service for rebuild operations.
-
-**Implementation Plan:**
-1. **NixOS Systemd Service**:
-   ```nix
-   systemd.services.cm-rebuild = {
-     description = "CM Dashboard NixOS Rebuild";
-     serviceConfig = {
-       Type = "oneshot";
-       ExecStart = "${pkgs.nixos-rebuild}/bin/nixos-rebuild switch --flake . --option sandbox false";
-       WorkingDirectory = "/var/lib/cm-dashboard/nixos-config";
-       User = "root";
-       StandardOutput = "journal";
-       StandardError = "journal";
-     };
-   };
-   ```
-
-2. **Agent Modification**:
-   - Replace direct nixos-rebuild execution with: `systemctl start cm-rebuild`
-   - Stream output via: `journalctl -u cm-rebuild -f --no-pager`
-   - Monitor service status for completion detection
-
-3. **Benefits**:
-   - **Process Isolation**: Service runs independently, won't crash agent
-   - **Consistent Output**: Always same deterministic rebuild process
-   - **Proper Logging**: systemd journal handles all output management
-   - **Resource Management**: systemd manages cleanup and resource limits
-   - **Status Tracking**: Can query service status (running/failed/success)
-
-**Next Priority**: Implement systemd service approach for reliable rebuild operations.
-
-**Keyboard Controls Status:**
- **Services Panel**: 
-  - R (restart) ✅ Working
-  - s (start) ✅ Working  
-  - S (stop) ✅ Working
- **System Panel**: R (nixos-rebuild) ✅ Working with --option sandbox false
- **Backup Panel**: B (trigger backup) ❓ Not implemented
-
-**Visual Feedback Implementation - IN PROGRESS:**
-
-Context-appropriate progress indicators for each panel:
-
-**Services Panel** (Service status transitions):
-```
-● nginx          active    →  ⏳ nginx      restarting  →  ● nginx          active
-● docker         active    →  ⏳ docker     stopping    →  ● docker         inactive  
-```
-
-**System Panel** (Build progress in NixOS section):
-```
-NixOS:
-Build: 25.05.20251004.3bcc93c    →    Build: [████████████     ] 65%
-Active users: cm, simon               Active users: cm, simon
-```
-
-**Backup Panel** (OnGoing status with progress):
-```
-Latest backup:              →    Latest backup:
-● 2024-10-23 14:32:15            ● OnGoing  
-└─ Duration: 1.3m                 └─ [██████       ] 60%
-```
-
-**Critical Configuration Hash Fix - HIGH PRIORITY:**
-
-**Problem:** Configuration hash currently shows git commit hash instead of actual deployed system hash.
-
-**Current (incorrect):** 
- Shows git hash: `db11f82` (source repository commit)
- Not accurate - doesn't reflect what's actually deployed
-
-**Target (correct):**
- Show nix store hash: `d8ivwiar` (first 8 chars from deployed system)  
- Source: `/nix/store/d8ivwiarhwhgqzskj6q2482r58z46qjf-nixos-system-cmbox-25.05.20251004.3bcc93c`
- Pattern: Extract hash from `/nix/store/HASH-nixos-system-HOSTNAME-VERSION`
-
-**Benefits:**
-1. **Deployment Verification:** Confirms rebuild actually succeeded
-2. **Accurate Status:** Shows what's truly running, not just source
-3. **Rebuild Completion Detection:** Hash change = rebuild completed
-4. **Rollback Tracking:** Each deployment has unique identifier
-
-**Implementation Required:**
-1. Agent extracts nix store hash from `ls -la /run/current-system` 
-2. Reports this as `system_config_hash` metric instead of git hash
-3. Dashboard displays first 8 characters: `Config: d8ivwiar`
-
-**Next Session Priority Tasks:**
-
-**Remaining Features:**
-1. **Fix Configuration Hash Display (CRITICAL)**:
-   - Use nix store hash instead of git commit hash
-   - Extract from `/run/current-system` -> `/nix/store/HASH-nixos-system-*`
-   - Enables proper rebuild completion detection
-
-2. **Command Response Protocol**:
-   - Agent sends command completion/failure back to dashboard via ZMQ
-   - Dashboard updates UI status from ⏳ to ● when commands complete
-   - Clear success/failure status after timeout
-
-3. **Backup Panel Features**:
-   - Implement backup trigger functionality (B key)
-   - Complete visual feedback for backup operations
-   - Add backup progress indicators
-
-**Enhancement Tasks:**
- Add confirmation dialogs for destructive actions (stop/restart/rebuild)
- Implement command history/logging
- Add keyboard shortcuts help overlay
-
-**Future Enhanced Navigation:**
- Add Page Up/Down for faster scrolling through long service lists
- Implement search/filter functionality for services
- Add jump-to-service shortcuts (first letter navigation)
-
-**Future Advanced Features:**
- Service dependency visualization
- Historical service status tracking
- Real-time log viewing integration
-
-## Core Architecture Principles - CRITICAL
+## Core Architecture Principles

 ### Individual Metrics Philosophy
-
-**NEW ARCHITECTURE**: Agent collects individual metrics, dashboard composes widgets from those metrics.
+- Agent collects individual metrics, dashboard composes widgets
+- Each metric collected, transmitted, and stored individually
+- Agent calculates status for each metric using thresholds
+- Dashboard aggregates individual metric statuses for widget status

 ### Maintenance Mode
-
-**Purpose:**
-
- Suppress email notifications during planned maintenance or backups
- Prevents false alerts when services are intentionally stopped
-
-**Implementation:**
-
 - Agent checks for `/tmp/cm-maintenance` file before sending notifications
 - File presence suppresses all email notifications while continuing monitoring
 - Dashboard continues to show real status, only notifications are blocked

-**Usage:**
-
+Usage:
 ```bash
 # Enable maintenance mode
 touch /tmp/cm-maintenance

-# Run maintenance tasks (backups, service restarts, etc.)
+# Run maintenance tasks
 systemctl stop service
 # ... maintenance work ...
 systemctl start service
@@ -281,61 +84,84 @@ systemctl start service
 rm /tmp/cm-maintenance
 ```

-**NixOS Integration:**
+## Development and Deployment Architecture

- Borgbackup script automatically creates/removes maintenance file
- Automatic cleanup via trap ensures maintenance mode doesn't stick
- All cinfiguration are shall be done from nixos config
+### Development Path
+- **Location:** `~/projects/cm-dashboard` 
+- **Purpose:** Development workflow only - for committing new code
+- **Access:** Only for developers to commit changes

-**ARCHITECTURE ENFORCEMENT**:
+### Deployment Path  
+- **Location:** `/var/lib/cm-dashboard/nixos-config`
+- **Purpose:** Production deployment only - agent clones/pulls from git
+- **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild

- **ZERO legacy code reuse** - Fresh implementation following ARCHITECT.md exactly
- **Individual metrics only** - NO grouped metric structures
- **Reference-only legacy** - Study old functionality, implement new architecture
- **Clean slate mindset** - Build as if legacy codebase never existed
+### Git Flow
+```
+Development: ~/projects/cm-dashboard → git commit → git push
+Deployment:  git pull → /var/lib/cm-dashboard/nixos-config → rebuild
+```

-**Implementation Rules**:
+## Automated Binary Release System

-1. **Individual Metrics**: Each metric is collected, transmitted, and stored individually
-2. **Agent Status Authority**: Agent calculates status for each metric using thresholds
-3. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
-4. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status
-   **Testing & Building**:
+CM Dashboard uses automated binary releases instead of source builds.

- **Workspace builds**: `cargo build --workspace` for all testing
- **Clean compilation**: Remove `target/` between architecture changes
- **ZMQ testing**: Test agent-dashboard communication independently
- **Widget testing**: Verify UI layout matches legacy appearance exactly
+### Creating New Releases
+```bash
+cd ~/projects/cm-dashboard
+git tag v0.1.X
+git push origin v0.1.X
+```

-**NEVER in New Implementation**:
+This automatically:
+- Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
+- Creates GitHub-style release with tarball
+- Uploads binaries via Gitea API

- Copy/paste ANY code from legacy backup
- Calculate status in dashboard widgets
- Hardcode metric names in widgets (use const arrays)
+### NixOS Configuration Updates
+Edit `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:

-# Important Communication Guidelines
+```nix
+version = "v0.1.X";
+src = pkgs.fetchurl {
+  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
+  sha256 = "sha256-NEW_HASH_HERE";
+};
+```

-NEVER write that you have "successfully implemented" something or generate extensive summary text without first verifying with the user that the implementation is correct. This wastes tokens. Keep responses concise.
+### Get Release Hash
+```bash
+cd ~/projects/nixosbox
+nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
+  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
+  sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
+}' 2>&1 | grep "got:"
+```

-NEVER implement code without first getting explicit user agreement on the approach. Always ask for confirmation before proceeding with implementation.
+### Building
+
+**Testing & Building:**
+- **Workspace builds**: `nix-shell -p openssl pkg-config --run "cargo build --workspace"`
+- **Clean compilation**: Remove `target/` between major changes
+
+## Important Communication Guidelines
+
+Keep responses concise and focused. Avoid extensive implementation summaries unless requested.

 ## Commit Message Guidelines

 **NEVER mention:**
-
 - Claude or any AI assistant names
 - Automation or AI-generated content
 - Any reference to automated code generation

 **ALWAYS:**
-
 - Focus purely on technical changes and their purpose
 - Use standard software development commit message format
 - Describe what was changed and why, not how it was created
 - Write from the perspective of a human developer

 **Examples:**
-
 - ❌ "Generated with Claude Code"
 - ❌ "AI-assisted implementation"
 - ❌ "Automated refactoring"
@@ -343,83 +169,22 @@ NEVER implement code without first getting explicit user agreement on the approa
 - ✅ "Restructure storage widget with improved layout"
 - ✅ "Update CPU thresholds to production values"

-## Development and Deployment Architecture
+## Implementation Rules

-**CRITICAL:** Development and deployment paths are completely separate:
+1. **Individual Metrics**: Each metric is collected, transmitted, and stored individually
+2. **Agent Status Authority**: Agent calculates status for each metric using thresholds
+3. **Dashboard Composition**: Dashboard widgets subscribe to specific metrics by name
+4. **Status Aggregation**: Dashboard aggregates individual metric statuses for widget status

-### Development Path
- **Location:** `~/projects/nixosbox` 
- **Purpose:** Development workflow only - for committing new cm-dashboard code
- **Access:** Only for developers to commit changes
- **Code Access:** Running cm-dashboard code shall NEVER access this path
+**NEVER:**
+- Copy/paste ANY code from legacy implementations
+- Calculate status in dashboard widgets
+- Hardcode metric names in widgets (use const arrays)
+- Create files unless absolutely necessary for achieving goals
+- Create documentation files unless explicitly requested

-### Deployment Path  
- **Location:** `/var/lib/cm-dashboard/nixos-config`
- **Purpose:** Production deployment only - agent clones/pulls from git
- **Access:** Only cm-dashboard agent for deployment operations
- **Workflow:** git pull → `/var/lib/cm-dashboard/nixos-config` → nixos-rebuild
-
-### Git Flow
-```
-Development: ~/projects/nixosbox → git commit → git push
-Deployment:  git pull → /var/lib/cm-dashboard/nixos-config → rebuild
-```
-
-## Automated Binary Release System
-
-**IMPLEMENTED:** cm-dashboard now uses automated binary releases instead of source builds.
-
-### Release Workflow
-
-1. **Automated Release Creation**
-   - Gitea Actions workflow builds static binaries on tag push
-   - Creates release with `cm-dashboard-linux-x86_64.tar.gz` tarball
-   - No manual intervention required for binary generation
-
-2. **Creating New Releases**
-   ```bash
-   cd ~/projects/cm-dashboard
-   git tag v0.1.X
-   git push origin v0.1.X
-   ```
-   
-   This automatically:
-   - Builds static binaries with `RUSTFLAGS="-C target-feature=+crt-static"`
-   - Creates GitHub-style release with tarball
-   - Uploads binaries via Gitea API
-
-3. **NixOS Configuration Updates**
-   Edit `~/projects/nixosbox/hosts/common/cm-dashboard.nix`:
-
-   ```nix
-   version = "v0.1.X";
-   src = pkgs.fetchurl {
-     url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
-     sha256 = "sha256-NEW_HASH_HERE";
-   };
-   ```
-
-4. **Get Release Hash**
-   ```bash
-   cd ~/projects/nixosbox
-   nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
-     url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/v0.1.X/cm-dashboard-linux-x86_64.tar.gz";
-     sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
-   }' 2>&1 | grep "got:"
-   ```
-
-5. **Commit and Deploy**
-   ```bash
-   cd ~/projects/nixosbox
-   git add hosts/common/cm-dashboard.nix
-   git commit -m "Update cm-dashboard to v0.1.X with static binaries"
-   git push
-   ```
-
-### Benefits
-
- **No compilation overhead** on each host
- **Consistent static binaries** across all hosts
- **Faster deployments** - download vs compile
- **No library dependency issues** - static linking
- **Automated pipeline** - tag push triggers everything
+**ALWAYS:**
+- Prefer editing existing files to creating new ones
+- Follow existing code conventions and patterns
+- Use existing libraries and utilities
+- Follow security best practices
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -270,7 +270,7 @@ checksum = "a1d728cc89cf3aee9ff92b05e62b19ee65a02b5702cff7d5a377e32c6ae29d8d"

 [[package]]
 name = "cm-dashboard"
-version = "0.1.37"
+version = "0.1.73"
 dependencies = [
 "anyhow",
 "chrono",
@@ -286,12 +286,13 @@ dependencies = [
 "toml",
 "tracing",
 "tracing-subscriber",
+ "wake-on-lan",
 "zmq",
 ]

 [[package]]
 name = "cm-dashboard-agent"
-version = "0.1.37"
+version = "0.1.73"
 dependencies = [
 "anyhow",
 "async-trait",
@@ -314,7 +315,7 @@ dependencies = [

 [[package]]
 name = "cm-dashboard-shared"
-version = "0.1.37"
+version = "0.1.73"
 dependencies = [
 "chrono",
 "serde",
@@ -2064,6 +2065,12 @@ version = "0.9.5"
 source = "registry+https://github.com/rust-lang/crates.io-index"
 checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a"

+[[package]]
+name = "wake-on-lan"
+version = "0.2.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1ccf60b60ad7e5b1b37372c5134cbcab4db0706c231d212e0c643a077462bc8f"
+
 [[package]]
 name = "walkdir"
 version = "2.5.0"
--- a/README.md
+++ b/README.md
@@ -1,88 +1,108 @@
 # CM Dashboard

-A real-time infrastructure monitoring system with intelligent status aggregation and email notifications, built with Rust and ZMQ.
+A high-performance Rust-based TUI dashboard for monitoring CMTEC infrastructure. Built with ZMQ-based metric collection and individual metrics architecture.

-## Current Implementation
+## Features

-This is a complete rewrite implementing an **individual metrics architecture** where:
+### Core Monitoring
+- **Real-time metrics**: CPU, RAM, Storage, and Service status
+- **Multi-host support**: Monitor multiple servers from single dashboard  
+- **Service management**: Start/stop services with intelligent status tracking
+- **NixOS integration**: System rebuild via SSH + tmux popup
+- **Backup monitoring**: Borgbackup status and scheduling
+- **Email notifications**: Intelligent batching prevents spam

- **Agent** collects individual metrics (e.g., `cpu_load_1min`, `memory_usage_percent`) and calculates status
- **Dashboard** subscribes to specific metrics and composes widgets
- **Status Aggregation** provides intelligent email notifications with batching
- **Persistent Cache** prevents false notifications on restart
+### User-Stopped Service Tracking
+Services stopped via the dashboard are intelligently tracked to prevent false alerts:

-## Dashboard Interface
+- **Smart status reporting**: User-stopped services show as Status::OK instead of Warning
+- **Persistent storage**: Tracking survives agent restarts via JSON storage
+- **Automatic management**: Flags cleared when services restarted via dashboard
+- **Maintenance friendly**: No false alerts during intentional service operations
+
+## Architecture
+
+### Individual Metrics Philosophy
+- **Agent**: Collects individual metrics, calculates status using thresholds
+- **Dashboard**: Subscribes to specific metrics, composes widgets from individual data
+- **ZMQ Communication**: Efficient real-time metric transmission
+- **Status Aggregation**: Host-level status calculated from all service metrics
+
+### Components
+
+```
+┌─────────────────┐    ZMQ     ┌─────────────────┐
+│                 │◄──────────►│                 │
+│   Agent         │  Metrics   │   Dashboard     │
+│   - Collectors  │            │   - TUI         │
+│   - Status      │            │   - Widgets     │
+│   - Tracking    │            │   - Commands    │
+│                 │            │                 │
+└─────────────────┘            └─────────────────┘
+         │                              │
+         ▼                              ▼
+┌─────────────────┐            ┌─────────────────┐
+│ JSON Storage    │            │ SSH + tmux      │
+│ - User-stopped  │            │ - Remote rebuild│
+│ - Cache         │            │ - Process       │
+│ - State         │            │   isolation     │
+└─────────────────┘            └─────────────────┘
+```
+
+### Service Control Flow
+
+1. **User Action**: Dashboard sends `UserStart`/`UserStop` commands
+2. **Agent Processing**: 
+   - Marks service as user-stopped (if stopping)
+   - Executes `systemctl start/stop service`
+   - Syncs state to global tracker
+3. **Status Calculation**: 
+   - Systemd collector checks user-stopped flag
+   - Reports Status::OK for user-stopped inactive services
+   - Normal Warning status for system failures
+
+## Interface

 ```
 cm-dashboard • ● cmbox ● srv01 ● srv02 ● steambox
 ┌system──────────────────────────────┐┌services─────────────────────────────────────────┐
-│CPU:                                ││Service:                  Status:  RAM:   Disk:  │
-│● Load: 0.10 0.52 0.88 • 400.0 MHz  ││● docker                  active   27M    496MB  │
-│RAM:                                ││● docker-registry         active   19M    496MB  │
-│● Used: 30% 2.3GB/7.6GB             ││● gitea                   active   579M   2.6GB  │
-│● tmp: 0.0% 0B/2.0GB                ││● gitea-runner-default    active   11M    2.6GB  │
-│Disk nvme0n1:                       ││● haasp-core              active   9M     1MB    │
-│● Health: PASSED                    ││● haasp-mqtt              active   3M     1MB    │
-│● Usage @root: 8.3% • 75.4/906.2 GB ││● haasp-webgrid           active   10M    1MB    │
-│● Usage @boot: 5.9% • 0.1/1.0 GB    ││● immich-server           active   240M   45.1GB │
-│                                    ││● mosquitto               active   1M     1MB    │
-│                                    ││● mysql                   active   38M    225MB  │
-│                                    ││● nginx                   active   28M    24MB   │
-│                                    ││  ├─ ● gitea.cmtec.se     51ms                   │
-│                                    ││  ├─ ● haasp.cmtec.se     43ms                   │
-│                                    ││  ├─ ● haasp.net          43ms                   │
-│                                    ││  ├─ ● pages.cmtec.se     45ms                   │
-└────────────────────────────────────┘│  ├─ ● photos.cmtec.se    41ms                   │
-┌backup──────────────────────────────┐│  ├─ ● unifi.cmtec.se     46ms                   │
-│Latest backup:                      ││  ├─ ● vault.cmtec.se     47ms                   │
-│● Status: OK                        ││  ├─ ● www.kryddorten.se  81ms                   │
-│Duration: 54s • Last: 4h ago        ││  ├─ ● www.mariehall2.se  86ms                   │
-│Disk usage: 48.2GB/915.8GB          ││● postgresql              active   112M   357MB  │
-│P/N: Samsung SSD 870 QVO 1TB        ││● redis-immich            active   8M     45.1GB │
-│S/N: S5RRNF0W800639Y                ││● sshd                    active   2M     0      │
-│● gitea 2 archives 2.7GB            ││● unifi                   active   594M   495MB  │
-│● immich 2 archives 45.0GB          ││● vaultwarden             active   12M    1MB    │
-│● kryddorten 2 archives 67.6MB      ││                                                 │
-│● mariehall2 2 archives 321.8MB     ││                                                 │
-│● nixosbox 2 archives 4.5MB         ││                                                 │
-│● unifi 2 archives 2.9MB            ││                                                 │
-│● vaultwarden 2 archives 305kB      ││                                                 │
+│NixOS:                              ││Service:                  Status:  RAM:   Disk:  │
+│Build: 25.05.20251004.3bcc93c       ││● docker                  active   27M    496MB  │
+│Agent: v0.1.43                      ││● gitea                   active   579M   2.6GB  │
+│Active users: cm, simon             ││● nginx                   active   28M    24MB   │
+│CPU:                                ││  ├─ ● gitea.cmtec.se     51ms                   │
+│● Load: 0.10 0.52 0.88 • 3000MHz    ││  ├─ ● photos.cmtec.se    41ms                   │
+│RAM:                                ││● postgresql              active   112M   357MB  │
+│● Usage: 33% 2.6GB/7.6GB            ││● redis-immich            user-stopped           │
+│● /tmp: 0% 0B/2.0GB                 ││● sshd                    active   2M     0      │
+│Storage:                            ││● unifi                   active   594M   495MB  │
+│● root (Single):                    ││                                                 │
+│ ├─ ● nvme0n1 W: 1%                 ││                                                 │
+│ └─ ● 18% 167.4GB/928.2GB           ││                                                 │
 └────────────────────────────────────┘└─────────────────────────────────────────────────┘
 ```

-**Navigation**: `←→` switch hosts, `r` refresh, `q` quit
+### Navigation
+- **Tab**: Switch between hosts
+- **↑↓ or j/k**: Navigate services
+- **s**: Start selected service (UserStart)  
+- **S**: Stop selected service (UserStop)
+- **J**: Show service logs (journalctl in tmux popup)
+- **L**: Show custom log files (tail -f custom paths in tmux popup)
+- **R**: Rebuild current host
+- **B**: Run backup on current host
+- **q**: Quit

-## Features
-
- **Real-time monitoring** - Dashboard updates every 1-2 seconds
- **Individual metric collection** - Granular data for flexible dashboard composition
- **Intelligent status aggregation** - Host-level status calculated from all services
- **Smart email notifications** - Batched, detailed alerts with service groupings
- **Persistent state** - Prevents false notifications on restarts
- **ZMQ communication** - Efficient agent-to-dashboard messaging
- **Clean TUI** - Terminal-based dashboard with color-coded status indicators
-
-## Architecture
-
-### Core Components
-
- **Agent** (`cm-dashboard-agent`) - Collects metrics and sends via ZMQ
- **Dashboard** (`cm-dashboard`) - Real-time TUI display consuming metrics
- **Shared** (`cm-dashboard-shared`) - Common types and protocol
- **Status Aggregation** - Intelligent batching and notification management
- **Persistent Cache** - Maintains state across restarts
-
-### Status Levels
-
- **🟢 Ok** - Service running normally
- **🔵 Pending** - Service starting/stopping/reloading
- **🟡 Warning** - Service issues (high load, memory, disk usage)
- **🔴 Critical** - Service failed or critical thresholds exceeded
- **❓ Unknown** - Service state cannot be determined
+### Status Indicators
+- **Green ●**: Active service
+- **Yellow ◐**: Inactive service (system issue)
+- **Red ◯**: Failed service
+- **Blue arrows**: Service transitioning (↑ starting, ↓ stopping, ↻ restarting)
+- **"user-stopped"**: Service stopped via dashboard (Status::OK)

 ## Quick Start

-### Build
+### Building

 ```bash
 # With Nix (recommended)
@@ -93,21 +113,20 @@ sudo apt install libssl-dev pkg-config  # Ubuntu/Debian
 cargo build --workspace
 ```

-### Run
+### Running

 ```bash
-# Start agent (requires configuration file)
+# Start agent (requires configuration)
 ./target/debug/cm-dashboard-agent --config /etc/cm-dashboard/agent.toml

-# Start dashboard
-./target/debug/cm-dashboard --config /path/to/dashboard.toml
+# Start dashboard (inside tmux session)
+tmux
+./target/debug/cm-dashboard --config /etc/cm-dashboard/dashboard.toml
 ```

 ## Configuration

-### Agent Configuration (`agent.toml`)
-
-The agent requires a comprehensive TOML configuration file:
+### Agent Configuration

 ```toml
 collection_interval_seconds = 2
@@ -116,50 +135,27 @@ collection_interval_seconds = 2
 publisher_port = 6130
 command_port = 6131
 bind_address = "0.0.0.0"
-timeout_ms = 5000
-heartbeat_interval_ms = 30000
+transmission_interval_seconds = 2

 [collectors.cpu]
 enabled = true
 interval_seconds = 2
-load_warning_threshold = 9.0
+load_warning_threshold = 5.0
 load_critical_threshold = 10.0
-temperature_warning_threshold = 100.0
-temperature_critical_threshold = 110.0

 [collectors.memory]
 enabled = true
 interval_seconds = 2
 usage_warning_percent = 80.0
-usage_critical_percent = 95.0
-
-[collectors.disk]
-enabled = true
-interval_seconds = 300
-usage_warning_percent = 80.0
 usage_critical_percent = 90.0

-[[collectors.disk.filesystems]]
-name = "root"
-uuid = "4cade5ce-85a5-4a03-83c8-dfd1d3888d79"
-mount_point = "/"
-fs_type = "ext4"
-monitor = true
-
 [collectors.systemd]
 enabled = true
 interval_seconds = 10
-memory_warning_mb = 1000.0
-memory_critical_mb = 2000.0
-service_name_filters = [
-  "nginx*", "postgresql*", "redis*", "docker*", "sshd*", 
-  "gitea*", "immich*", "haasp*", "mosquitto*", "mysql*", 
-  "unifi*", "vaultwarden*"
-]
-excluded_services = [
-  "nginx-config-reload", "sshd-keygen", "systemd-", 
-  "getty@", "user@", "dbus-", "NetworkManager-"
-]
+service_name_filters = ["nginx*", "postgresql*", "docker*", "sshd*"]
+excluded_services = ["nginx-config-reload", "systemd-", "getty@"]
+nginx_latency_critical_ms = 1000.0
+http_timeout_seconds = 10

 [notifications]
 enabled = true
@@ -167,251 +163,203 @@ smtp_host = "localhost"
 smtp_port = 25
 from_email = "{hostname}@example.com"
 to_email = "admin@example.com"
-rate_limit_minutes = 0
-trigger_on_warnings = true
-trigger_on_failures = true
-recovery_requires_all_ok = true
-suppress_individual_recoveries = true
-
-[status_aggregation]
-enabled = true
-aggregation_method = "worst_case"
-notification_interval_seconds = 30
-
-[cache]
-persist_path = "/var/lib/cm-dashboard/cache.json"
+aggregation_interval_seconds = 30
 ```

-### Dashboard Configuration (`dashboard.toml`)
+### Dashboard Configuration

 ```toml
 [zmq]
-hosts = [
-  { name = "server1", address = "192.168.1.100", port = 6130 },
-  { name = "server2", address = "192.168.1.101", port = 6130 }
-]
-connection_timeout_ms = 5000
-reconnect_interval_ms = 10000
+subscriber_ports = [6130]

-[ui]
-refresh_interval_ms = 1000
-theme = "dark"
+[hosts]
+predefined_hosts = ["cmbox", "srv01", "srv02"]
+
+[ssh]
+rebuild_user = "cm"
+rebuild_alias = "nixos-rebuild-cmtec"
+backup_alias = "cm-backup-run"
 ```

-## Collectors
+## Technical Implementation

-The agent implements several specialized collectors:
+### Collectors

-### CPU Collector (`cpu.rs`)
+#### Systemd Collector
+- **Service Discovery**: Uses `systemctl list-unit-files` + `list-units --all`
+- **Status Calculation**: Checks user-stopped flag before assigning Warning status
+- **Memory Tracking**: Per-service memory usage via `systemctl show`
+- **Sub-services**: Nginx site latency, Docker containers
+- **User-stopped Integration**: `UserStoppedServiceTracker::is_service_user_stopped()`

- Load average (1, 5, 15 minute)
- CPU temperature monitoring
- Real-time process monitoring (top CPU consumers)
- Status calculation with configurable thresholds
+#### User-Stopped Service Tracker
+- **Storage**: `/var/lib/cm-dashboard/user-stopped-services.json`
+- **Thread Safety**: Global singleton with `Arc<Mutex<>>`
+- **Persistence**: Automatic save on state changes
+- **Global Access**: Static methods for collector integration

-### Memory Collector (`memory.rs`)
+#### Other Collectors
+- **CPU**: Load average, temperature, frequency monitoring
+- **Memory**: RAM/swap usage, tmpfs monitoring  
+- **Disk**: Filesystem usage, SMART health data
+- **NixOS**: Build version, active users, agent version
+- **Backup**: Borgbackup repository status and metrics

- RAM usage (total, used, available)
- Swap monitoring
- Real-time process monitoring (top RAM consumers)
- Memory pressure detection
+### ZMQ Protocol

-### Disk Collector (`disk.rs`)
+```rust
+// Metric Message
+#[derive(Serialize, Deserialize)]
+pub struct MetricMessage {
+    pub hostname: String,
+    pub timestamp: u64,
+    pub metrics: Vec<Metric>,
+}

- Filesystem usage per mount point
- SMART health monitoring
- Temperature and wear tracking
- Configurable filesystem monitoring
+// Service Commands
+pub enum AgentCommand {
+    ServiceControl {
+        service_name: String,
+        action: ServiceAction,
+    },
+    SystemRebuild { /* SSH config */ },
+    CollectNow,
+}

-### Systemd Collector (`systemd.rs`)
+pub enum ServiceAction {
+    Start,           // System-initiated
+    Stop,            // System-initiated  
+    UserStart,       // User via dashboard (clears user-stopped)
+    UserStop,        // User via dashboard (marks user-stopped)
+    Status,
+}
+```

- Service status monitoring (`active`, `inactive`, `failed`)
- Memory usage per service
- Service filtering and exclusions
- Handles transitional states (`Status::Pending`)
+### Maintenance Mode

-### Backup Collector (`backup.rs`)
+Suppress notifications during planned maintenance:

- Reads TOML status files from backup systems
- Archive age verification
- Disk usage tracking
- Repository health monitoring
+```bash
+# Enable maintenance mode
+touch /tmp/cm-maintenance
+
+# Perform maintenance
+systemctl stop service
+# ... work ...
+systemctl start service  
+
+# Disable maintenance mode
+rm /tmp/cm-maintenance
+```

 ## Email Notifications

 ### Intelligent Batching
+- **Real-time dashboard**: Immediate status updates
+- **Batched emails**: Aggregated every 30 seconds
+- **Smart grouping**: Services organized by severity
+- **Recovery suppression**: Reduces notification spam

-The system implements smart notification batching to prevent email spam:
-
- **Real-time dashboard updates** - Status changes appear immediately
- **Batched email notifications** - Aggregated every 30 seconds
- **Detailed groupings** - Services organized by severity
-
-### Example Alert Email
-
+### Example Alert
 ```
-Subject: Status Alert: 2 critical, 1 warning, 15 started
+Subject: Status Alert: 1 critical, 2 warnings, 0 recoveries

 Status Summary (30s duration)
 Host Status: Ok → Warning

-🔴 CRITICAL ISSUES (2):
-  postgresql: Ok → Critical
-  nginx: Warning → Critical
+🔴 CRITICAL ISSUES (1):
+  postgresql: Ok → Critical (memory usage 95%)

-🟡 WARNINGS (1):
-  redis: Ok → Warning (memory usage 85%)
+🟡 WARNINGS (2):
+  nginx: Ok → Warning (high load 8.5)
+  redis: user-stopped → Warning (restarted by system)

 ✅ RECOVERIES (0):

-🟢 SERVICE STARTUPS (15):
-  docker: Unknown → Ok
-  sshd: Unknown → Ok
-  ...
-
 --
-CM Dashboard Agent
-Generated at 2025-10-21 19:42:42 CET
+CM Dashboard Agent v0.1.43
 ```

-## Individual Metrics Architecture
-
-The system follows a **metrics-first architecture**:
-
-### Agent Side
-
-```rust
-// Agent collects individual metrics
-vec![
-    Metric::new("cpu_load_1min".to_string(), MetricValue::Float(2.5), Status::Ok),
-    Metric::new("memory_usage_percent".to_string(), MetricValue::Float(78.5), Status::Warning),
-    Metric::new("service_nginx_status".to_string(), MetricValue::String("active".to_string()), Status::Ok),
-]
-```
-
-### Dashboard Side
-
-```rust
-// Widgets subscribe to specific metrics
-impl Widget for CpuWidget {
-    fn update_from_metrics(&mut self, metrics: &[&Metric]) {
-        for metric in metrics {
-            match metric.name.as_str() {
-                "cpu_load_1min" => self.load_1min = metric.value.as_f32(),
-                "cpu_load_5min" => self.load_5min = metric.value.as_f32(),
-                "cpu_temperature_celsius" => self.temperature = metric.value.as_f32(),
-                _ => {}
-            }
-        }
-    }
-}
-```
-
-## Persistent Cache
-
-The cache system prevents false notifications:
-
- **Automatic saving** - Saves when service status changes
- **Persistent storage** - Maintains state across agent restarts
- **Simple design** - No complex TTL or cleanup logic
- **Status preservation** - Prevents duplicate notifications
-
 ## Development

 ### Project Structure
-
 ```
 cm-dashboard/
-├── agent/                  # Metrics collection agent
+├── agent/                     # Metrics collection agent
 │   ├── src/
-│   │   ├── collectors/     # CPU, memory, disk, systemd, backup
-│   │   ├── status/         # Status aggregation and notifications
-│   │   ├── cache/          # Persistent metric caching
-│   │   ├── config/         # TOML configuration loading
-│   │   └── notifications/  # Email notification system
-├── dashboard/              # TUI dashboard application
+│   │   ├── collectors/        # CPU, memory, disk, systemd, backup, nixos
+│   │   ├── service_tracker.rs # User-stopped service tracking
+│   │   ├── status/            # Status aggregation and notifications
+│   │   ├── config/            # TOML configuration loading
+│   │   └── communication/     # ZMQ message handling
+├── dashboard/                 # TUI dashboard application  
 │   ├── src/
-│   │   ├── ui/widgets/     # CPU, memory, services, backup widgets
-│   │   ├── metrics/        # Metric storage and filtering
-│   │   └── communication/  # ZMQ metric consumption
-├── shared/                 # Shared types and utilities
+│   │   ├── ui/widgets/        # CPU, memory, services, backup, system
+│   │   ├── communication/     # ZMQ consumption and commands
+│   │   └── app.rs            # Main application loop
+├── shared/                    # Shared types and utilities
 │   └── src/
-│       ├── metrics.rs      # Metric, Status, and Value types
-│       ├── protocol.rs     # ZMQ message format
-│       └── cache.rs        # Cache configuration
-└── README.md              # This file
+│       ├── metrics.rs         # Metric, Status, StatusTracker types
+│       ├── protocol.rs        # ZMQ message format
+│       └── cache.rs           # Cache configuration
+└── CLAUDE.md                  # Development guidelines and rules
 ```

-### Building
-
+### Testing
 ```bash
-# Debug build
-cargo build --workspace
+# Build and test
+nix-shell -p openssl pkg-config --run "cargo build --workspace"
+nix-shell -p openssl pkg-config --run "cargo test --workspace"

-# Release build
-cargo build --workspace --release
-
-# Run tests
-cargo test --workspace
-
-# Check code formatting
-cargo fmt --all -- --check
-
-# Run clippy linter
+# Code quality
+cargo fmt --all
 cargo clippy --workspace -- -D warnings
 ```

-### Dependencies
+## Deployment

- **tokio** - Async runtime
- **zmq** - Message passing between agent and dashboard
- **ratatui** - Terminal user interface
- **serde** - Serialization for metrics and config
- **anyhow/thiserror** - Error handling
- **tracing** - Structured logging
- **lettre** - SMTP email notifications
- **clap** - Command-line argument parsing
- **toml** - Configuration file parsing
+### Automated Binary Releases
+```bash
+# Create new release
+cd ~/projects/cm-dashboard
+git tag v0.1.X
+git push origin v0.1.X
+```

-## NixOS Integration
+This triggers automated:
+- Static binary compilation with `RUSTFLAGS="-C target-feature=+crt-static"`
+- GitHub-style release creation
+- Tarball upload to Gitea

-This project is designed for declarative deployment via NixOS:
-
-### Configuration Generation
-
-The NixOS module automatically generates the agent configuration:
+### NixOS Integration
+Update `~/projects/nixosbox/hosts/services/cm-dashboard.nix`:

 ```nix
-# hosts/common/cm-dashboard.nix
-services.cm-dashboard-agent = {
-  enable = true;
-  port = 6130;
+version = "v0.1.43";
+src = pkgs.fetchurl {
+  url = "https://gitea.cmtec.se/cm/cm-dashboard/releases/download/${version}/cm-dashboard-linux-x86_64.tar.gz";
+  sha256 = "sha256-HASH";
 };
 ```

-### Deployment
-
+Get hash via:
 ```bash
-# Update NixOS configuration
-git add hosts/common/cm-dashboard.nix
-git commit -m "Update cm-dashboard configuration"
-git push
-
-# Rebuild system (user-performed)
-sudo nixos-rebuild switch --flake .
+cd ~/projects/nixosbox
+nix-build --no-out-link -E 'with import <nixpkgs> {}; fetchurl {
+  url = "URL_HERE";
+  sha256 = "sha256-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=";
+}' 2>&1 | grep "got:"
 ```

 ## Monitoring Intervals

- **CPU/Memory**: 2 seconds (real-time monitoring)
- **Disk usage**: 300 seconds (5 minutes)
- **Systemd services**: 10 seconds
- **SMART health**: 600 seconds (10 minutes)
- **Backup status**: 60 seconds (1 minute)
- **Email notifications**: 30 seconds (batched)
- **Dashboard updates**: 1 second (real-time display)
+- **Metrics Collection**: 2 seconds (CPU, memory, services)
+- **Metric Transmission**: 2 seconds (ZMQ publish)
+- **Dashboard Updates**: 1 second (UI refresh)
+- **Email Notifications**: 30 seconds (batched)
+- **Disk Monitoring**: 300 seconds (5 minutes)
+- **Service Discovery**: 300 seconds (5 minutes cache)

 ## License

-MIT License - see LICENSE file for details
-
+MIT License - see LICENSE file for details.
--- a/TODO.md
+++ b/TODO.md
@@ -1,63 +0,0 @@
-# TODO
-
-## Systemd filtering (agent)
-
- remove user systemd collection
- reduce number of systemctl call
- Cahnge so only services in include list are detected
- Filter on exact name
- Add support for "\*" in filtering
-
-## System panel (agent/dashboard)
-
-use following layout:
-'''
-NixOS:
-Build: xxxxxx
-Agen: xxxxxx
-CPU:
-● Load: 0.02 0.31 0.86
-└─ Freq: 3000MHz
-RAM:
-● Usage: 33% 2.6GB/7.6GB  
- └─ ● /tmp: 0% 0B/2.0GB
-Storage:
-● /:  
- ├─ ● nvme0n1 T: 40C • W: 4%  
- └─ ● 8% 75.0GB/906.2GB
-'''
-
- Add support to show login/active users
- Add support to show timestamp/version for latest nixos rebuild
-
-## Backup panel (dashboard)
-
-use following layout:
-'''
-Latest backup:  
-● <timestamp>
-└─ Duration: 1.3m
-Disk:
-● Samsung SSD 870 QVO 1TB  
- ├─ S/N: S5RRNF0W800639Y
-└─ Usage: 50.5GB/915.8GB
-Repos:
-● gitea (4) 5.1GB  
-● immich (4) 45.0GB  
-● kryddorten (4) 67.8MB  
-● mariehall2 (4) 322.7MB
-● nixosbox (4) 5.5MB  
-● unifi (4) 5.7MB  
-● vaultwarden (4) 508kB
-'''
-
-## Keyboard navigation and scrolling (dashboard)
-
- Add keyboard navigation between panels "Shift-Tab"
- Add lower statusbar with dynamic updated shortcuts when switchng between panels
-
-## Remote execution (agent/dashboard)
-
- Add support for send command via dashboard to agent to do nixos rebuid
- Add support for navigating services in dashboard and trigger start/stop/restart
- Add support for trigger backup
--- a/agent/Cargo.toml
+++ b/agent/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "cm-dashboard-agent"
-version = "0.1.38"
+version = "0.1.74"
 edition = "2021"

 [dependencies]
--- a/agent/src/agent.rs
+++ b/agent/src/agent.rs
@@ -4,10 +4,11 @@ use std::time::Duration;
 use tokio::time::interval;
 use tracing::{debug, error, info};

-use crate::communication::{AgentCommand, ServiceAction, ZmqHandler};
+use crate::communication::{AgentCommand, ZmqHandler};
 use crate::config::AgentConfig;
 use crate::metrics::MetricCollectionManager;
 use crate::notifications::NotificationManager;
+use crate::service_tracker::UserStoppedServiceTracker;
 use crate::status::HostStatusManager;
 use cm_dashboard_shared::{Metric, MetricMessage, MetricValue, Status};

@@ -18,6 +19,7 @@ pub struct Agent {
    metric_manager: MetricCollectionManager,
    notification_manager: NotificationManager,
    host_status_manager: HostStatusManager,
+    service_tracker: UserStoppedServiceTracker,
 }

 impl Agent {
@@ -50,6 +52,10 @@ impl Agent {
        let host_status_manager = HostStatusManager::new(config.status_aggregation.clone());
        info!("Host status manager initialized");

+        // Initialize user-stopped service tracker
+        let service_tracker = UserStoppedServiceTracker::init_global()?;
+        info!("User-stopped service tracker initialized");
+
        Ok(Self {
            hostname,
            config,
@@ -57,6 +63,7 @@ impl Agent {
            metric_manager,
            notification_manager,
            host_status_manager,
+            service_tracker,
        })
    }

@@ -71,10 +78,11 @@ impl Agent {
            info!("Initial metric collection completed - all data cached and ready");
        }

-        // Separate intervals for collection, transmission, and email notifications
+        // Separate intervals for collection, transmission, heartbeat, and email notifications
        let mut collection_interval =
            interval(Duration::from_secs(self.config.collection_interval_seconds));
        let mut transmission_interval = interval(Duration::from_secs(self.config.zmq.transmission_interval_seconds));
+        let mut heartbeat_interval = interval(Duration::from_secs(self.config.zmq.heartbeat_interval_seconds));
        let mut notification_interval = interval(Duration::from_secs(self.config.notifications.aggregation_interval_seconds));

        loop {
@@ -91,6 +99,12 @@ impl Agent {
                        error!("Failed to broadcast metrics: {}", e);
                    }
                }
+                _ = heartbeat_interval.tick() => {
+                    // Send standalone heartbeat for host connectivity detection
+                    if let Err(e) = self.send_heartbeat().await {
+                        error!("Failed to send heartbeat: {}", e);
+                    }
+                }
                _ = notification_interval.tick() => {
                    // Process batched email notifications (separate from dashboard updates)
                    if let Err(e) = self.host_status_manager.process_pending_notifications(&mut self.notification_manager).await {
@@ -173,6 +187,13 @@ impl Agent {
        let version_metric = self.get_agent_version_metric();
        metrics.push(version_metric);

+        // Add heartbeat metric for host connectivity detection
+        let heartbeat_metric = self.get_heartbeat_metric();
+        metrics.push(heartbeat_metric);
+
+        // Check for user-stopped services that are now active and clear their flags
+        self.clear_user_stopped_flags_for_active_services(&metrics);
+
        if metrics.is_empty() {
            debug!("No metrics to broadcast");
            return Ok(());
@@ -191,6 +212,12 @@ impl Agent {
    async fn process_metrics(&mut self, metrics: &[Metric]) -> bool {
        let mut status_changed = false;
        for metric in metrics {
+            // Filter excluded metrics from email notification processing only
+            if self.config.notifications.exclude_email_metrics.contains(&metric.name) {
+                debug!("Excluding metric '{}' from email notification processing", metric.name);
+                continue;
+            }
+            
            if self.host_status_manager.process_metric(metric, &mut self.notification_manager).await {
                status_changed = true;
            }
@@ -216,6 +243,35 @@ impl Agent {
        format!("v{}", env!("CARGO_PKG_VERSION"))
    }

+    /// Create heartbeat metric for host connectivity detection
+    fn get_heartbeat_metric(&self) -> Metric {
+        use std::time::{SystemTime, UNIX_EPOCH};
+        
+        let timestamp = SystemTime::now()
+            .duration_since(UNIX_EPOCH)
+            .unwrap()
+            .as_secs();
+        
+        Metric::new(
+            "agent_heartbeat".to_string(),
+            MetricValue::Integer(timestamp as i64),
+            Status::Ok,
+        )
+    }
+
+    /// Send standalone heartbeat for connectivity detection
+    async fn send_heartbeat(&mut self) -> Result<()> {
+        let heartbeat_metric = self.get_heartbeat_metric();
+        let message = MetricMessage::new(
+            self.hostname.clone(),
+            vec![heartbeat_metric],
+        );
+
+        self.zmq_handler.publish_metrics(&message).await?;
+        debug!("Sent standalone heartbeat for connectivity detection");
+        Ok(())
+    }
+
    async fn handle_commands(&mut self) -> Result<()> {
        // Try to receive commands (non-blocking)
        match self.zmq_handler.try_receive_command() {
@@ -259,55 +315,38 @@ impl Agent {
                info!("Processing Ping command - agent is alive");
                // Could send a response back via ZMQ if needed
            }
-            AgentCommand::ServiceControl { service_name, action } => {
-                info!("Processing ServiceControl command: {} {:?}", service_name, action);
-                if let Err(e) = self.handle_service_control(&service_name, &action).await {
-                    error!("Failed to execute service control: {}", e);
+        }
+        Ok(())
+    }
+
+
+    /// Check metrics for user-stopped services that are now active and clear their flags
+    fn clear_user_stopped_flags_for_active_services(&mut self, metrics: &[Metric]) {
+        for metric in metrics {
+            // Look for service status metrics that are active
+            if metric.name.starts_with("service_") && metric.name.ends_with("_status") {
+                if let MetricValue::String(status) = &metric.value {
+                    if status == "active" {
+                        // Extract service name from metric name (service_nginx_status -> nginx)
+                        let service_name = metric.name
+                            .strip_prefix("service_")
+                            .and_then(|s| s.strip_suffix("_status"))
+                            .unwrap_or("");
+                        
+                        if !service_name.is_empty() && UserStoppedServiceTracker::is_service_user_stopped(service_name) {
+                            info!("Service '{}' is now active - clearing user-stopped flag", service_name);
+                            if let Err(e) = self.service_tracker.clear_user_stopped(service_name) {
+                                error!("Failed to clear user-stopped flag for '{}': {}", service_name, e);
+                            } else {
+                                // Sync to global tracker
+                                UserStoppedServiceTracker::update_global(&self.service_tracker);
+                                debug!("Cleared user-stopped flag for service '{}'", service_name);
+                            }
+                        }
+                    }
                }
            }
        }
-        Ok(())
-    }
-
-    /// Handle systemd service control commands
-    async fn handle_service_control(&mut self, service_name: &str, action: &ServiceAction) -> Result<()> {
-        let action_str = match action {
-            ServiceAction::Start => "start",
-            ServiceAction::Stop => "stop", 
-            ServiceAction::Status => "status",
-        };
-
-        info!("Executing systemctl {} {}", action_str, service_name);
-
-        let output = tokio::process::Command::new("sudo")
-            .arg("systemctl")
-            .arg(action_str)
-            .arg(service_name)
-            .output()
-            .await?;
-
-        if output.status.success() {
-            info!("Service {} {} completed successfully", service_name, action_str);
-            if !output.stdout.is_empty() {
-                debug!("stdout: {}", String::from_utf8_lossy(&output.stdout));
-            }
-        } else {
-            let stderr = String::from_utf8_lossy(&output.stderr);
-            error!("Service {} {} failed: {}", service_name, action_str, stderr);
-            return Err(anyhow::anyhow!("systemctl {} {} failed: {}", action_str, service_name, stderr));
-        }
-
-        // Force refresh metrics after service control to update service status
-        if matches!(action, ServiceAction::Start | ServiceAction::Stop) {
-            info!("Triggering immediate metric refresh after service control");
-            if let Err(e) = self.collect_metrics_only().await {
-                error!("Failed to refresh metrics after service control: {}", e);
-            } else {
-                info!("Service status refreshed immediately after {} {}", action_str, service_name);
-            }
-        }
-
-        Ok(())
    }

 }
--- a/agent/src/collectors/backup.rs
+++ b/agent/src/collectors/backup.rs
@@ -140,6 +140,7 @@ impl Collector for BackupCollector {
                Status::Warning => "warning".to_string(),
                Status::Critical => "critical".to_string(),
                Status::Unknown => "unknown".to_string(),
+                Status::Offline => "offline".to_string(),
            }),
            status: overall_status,
            timestamp,
@@ -202,6 +203,7 @@ impl Collector for BackupCollector {
                    Status::Warning => "warning".to_string(),
                    Status::Critical => "critical".to_string(),
                    Status::Unknown => "unknown".to_string(),
+                    Status::Offline => "offline".to_string(),
                }),
                status: service_status,
                timestamp,
--- a/agent/src/collectors/nixos.rs
+++ b/agent/src/collectors/nixos.rs
@@ -37,6 +37,22 @@ impl NixOSCollector {
    }

    /// Get configuration hash from deployed nix store system
+    /// Get git commit hash from rebuild process
+    fn get_git_commit(&self) -> Result<String, Box<dyn std::error::Error>> {
+        let commit_file = "/var/lib/cm-dashboard/git-commit";
+        match std::fs::read_to_string(commit_file) {
+            Ok(content) => {
+                let commit_hash = content.trim();
+                if commit_hash.len() >= 7 {
+                    Ok(commit_hash.to_string())
+                } else {
+                    Err("Git commit hash too short".into())
+                }
+            }
+            Err(e) => Err(format!("Failed to read git commit file: {}", e).into())
+        }
+    }
+
    fn get_config_hash(&self) -> Result<String, Box<dyn std::error::Error>> {
        // Read the symlink target of /run/current-system to get nix store path
        let output = Command::new("readlink")
@@ -74,25 +90,25 @@ impl Collector for NixOSCollector {
        let mut metrics = Vec::new();
        let timestamp = chrono::Utc::now().timestamp() as u64;

-        // Collect NixOS build information (config hash)
-        match self.get_config_hash() {
-            Ok(config_hash) => {
+        // Collect git commit information (shows what's actually deployed)
+        match self.get_git_commit() {
+            Ok(git_commit) => {
                metrics.push(Metric {
                    name: "system_nixos_build".to_string(),
-                    value: MetricValue::String(config_hash),
+                    value: MetricValue::String(git_commit),
                    unit: None,
-                    description: Some("NixOS deployed configuration hash".to_string()),
+                    description: Some("Git commit hash of deployed configuration".to_string()),
                    status: Status::Ok,
                    timestamp,
                });
            }
            Err(e) => {
-                debug!("Failed to get config hash: {}", e);
+                debug!("Failed to get git commit: {}", e);
                metrics.push(Metric {
                    name: "system_nixos_build".to_string(),
                    value: MetricValue::String("unknown".to_string()),
                    unit: None,
-                    description: Some("NixOS config hash (failed to detect)".to_string()),
+                    description: Some("Git commit hash (failed to detect)".to_string()),
                    status: Status::Unknown,
                    timestamp,
                });
--- a/agent/src/collectors/systemd.rs
+++ b/agent/src/collectors/systemd.rs
@@ -8,6 +8,7 @@ use tracing::debug;

 use super::{Collector, CollectorError};
 use crate::config::SystemdConfig;
+use crate::service_tracker::UserStoppedServiceTracker;

 /// Systemd collector for monitoring systemd services
 pub struct SystemdCollector {
@@ -353,13 +354,37 @@ impl SystemdCollector {
        Ok((active_status, detailed_info))
    }

-    /// Calculate service status
-    fn calculate_service_status(&self, active_status: &str) -> Status {
+    /// Calculate service status, taking user-stopped services into account
+    fn calculate_service_status(&self, service_name: &str, active_status: &str) -> Status {
        match active_status.to_lowercase().as_str() {
-            "active" => Status::Ok,
-            "inactive" | "dead" => Status::Warning,
+            "active" => {
+                // If service is now active and was marked as user-stopped, clear the flag
+                if UserStoppedServiceTracker::is_service_user_stopped(service_name) {
+                    debug!("Service '{}' is now active - clearing user-stopped flag", service_name);
+                    // Note: We can't directly clear here because this is a read-only context
+                    // The agent will need to handle this differently
+                }
+                Status::Ok
+            },
+            "inactive" | "dead" => {
+                // Check if this service was stopped by user action
+                if UserStoppedServiceTracker::is_service_user_stopped(service_name) {
+                    debug!("Service '{}' is inactive but marked as user-stopped - treating as OK", service_name);
+                    Status::Ok
+                } else {
+                    Status::Warning
+                }
+            },
            "failed" | "error" => Status::Critical,
-            "activating" | "deactivating" | "reloading" | "start" | "stop" | "restart" => Status::Pending,
+            "activating" | "deactivating" | "reloading" | "start" | "stop" | "restart" => {
+                // For user-stopped services that are transitioning, keep them as OK during transition
+                if UserStoppedServiceTracker::is_service_user_stopped(service_name) {
+                    debug!("Service '{}' is transitioning but was user-stopped - treating as OK", service_name);
+                    Status::Ok
+                } else {
+                    Status::Pending
+                }
+            },
            _ => Status::Unknown,
        }
    }
@@ -480,7 +505,7 @@ impl Collector for SystemdCollector {
        for service in &monitored_services {
            match self.get_service_status(service) {
                Ok((active_status, _detailed_info)) => {
-                    let status = self.calculate_service_status(&active_status);
+                    let status = self.calculate_service_status(service, &active_status);

                    // Individual service status metric
                    metrics.push(Metric {
--- a/agent/src/communication/mod.rs
+++ b/agent/src/communication/mod.rs
@@ -66,8 +66,6 @@ impl ZmqHandler {
    }


-    /// Send heartbeat (placeholder for future use)
-
    /// Try to receive a command (non-blocking)
    pub fn try_receive_command(&self) -> Result<Option<AgentCommand>> {
        match self.command_receiver.recv_bytes(zmq::DONTWAIT) {
@@ -100,17 +98,4 @@ pub enum AgentCommand {
    ToggleCollector { name: String, enabled: bool },
    /// Request status/health check
    Ping,
-    /// Control systemd service
-    ServiceControl {
-        service_name: String,
-        action: ServiceAction,
-    },
-}
-
-/// Service control actions
-#[derive(Debug, Clone, serde::Deserialize, serde::Serialize)]
-pub enum ServiceAction {
-    Start,
-    Stop,
-    Status,
 }
--- a/agent/src/config/mod.rs
+++ b/agent/src/config/mod.rs
@@ -25,9 +25,10 @@ pub struct ZmqConfig {
    pub publisher_port: u16,
    pub command_port: u16,
    pub bind_address: String,
-    pub timeout_ms: u64,
-    pub heartbeat_interval_ms: u64,
    pub transmission_interval_seconds: u64,
+    /// Heartbeat transmission interval in seconds for host connectivity detection
+    #[serde(default = "default_heartbeat_interval_seconds")]
+    pub heartbeat_interval_seconds: u64,
 }

 /// Collector configuration
@@ -146,9 +147,23 @@ pub struct NotificationConfig {
    pub rate_limit_minutes: u64,
    /// Email notification batching interval in seconds (default: 60)
    pub aggregation_interval_seconds: u64,
+    /// List of metric names to exclude from email notifications
+    #[serde(default)]
+    pub exclude_email_metrics: Vec<String>,
+    /// Path to maintenance mode file that suppresses email notifications when present
+    #[serde(default = "default_maintenance_mode_file")]
+    pub maintenance_mode_file: String,
 }


+fn default_heartbeat_interval_seconds() -> u64 {
+    5
+}
+
+fn default_maintenance_mode_file() -> String {
+    "/tmp/cm-maintenance".to_string()
+}
+
 impl AgentConfig {
    pub fn from_file<P: AsRef<Path>>(path: P) -> Result<Self> {
        loader::load_config(path)
--- a/agent/src/config/validation.rs
+++ b/agent/src/config/validation.rs
@@ -19,10 +19,6 @@ pub fn validate_config(config: &AgentConfig) -> Result<()> {
        bail!("ZMQ bind address cannot be empty");
    }

-    if config.zmq.timeout_ms == 0 {
-        bail!("ZMQ timeout cannot be 0");
-    }
-
    // Validate collection interval
    if config.collection_interval_seconds == 0 {
        bail!("Collection interval cannot be 0");
--- a/agent/src/main.rs
+++ b/agent/src/main.rs
@@ -9,6 +9,7 @@ mod communication;
 mod config;
 mod metrics;
 mod notifications;
+mod service_tracker;
 mod status;

 use agent::Agent;
--- a/agent/src/notifications/mod.rs
+++ b/agent/src/notifications/mod.rs
@@ -59,6 +59,6 @@ impl NotificationManager {
    }

    fn is_maintenance_mode(&self) -> bool {
-        std::fs::metadata("/tmp/cm-maintenance").is_ok()
+        std::fs::metadata(&self.config.maintenance_mode_file).is_ok()
    }
 }
--- a/agent/src/service_tracker.rs
+++ b/agent/src/service_tracker.rs
@@ -0,0 +1,172 @@
+use anyhow::Result;
+use serde::{Deserialize, Serialize};
+use std::collections::HashSet;
+use std::fs;
+use std::path::Path;
+use std::sync::{Arc, Mutex, OnceLock};
+use tracing::{debug, info, warn};
+
+/// Shared instance for global access
+static GLOBAL_TRACKER: OnceLock<Arc<Mutex<UserStoppedServiceTracker>>> = OnceLock::new();
+
+/// Tracks services that have been stopped by user action
+/// These services should be treated as OK status instead of Warning
+#[derive(Debug)]
+pub struct UserStoppedServiceTracker {
+    /// Set of services stopped by user action
+    user_stopped_services: HashSet<String>,
+    /// Path to persistent storage file
+    storage_path: String,
+}
+
+/// Serializable data structure for persistence
+#[derive(Debug, Serialize, Deserialize)]
+struct UserStoppedData {
+    services: Vec<String>,
+}
+
+impl UserStoppedServiceTracker {
+    /// Create new tracker with default storage path
+    pub fn new() -> Self {
+        Self::with_storage_path("/var/lib/cm-dashboard/user-stopped-services.json")
+    }
+
+    /// Initialize global instance (called by agent)
+    pub fn init_global() -> Result<Self> {
+        let tracker = Self::new();
+        
+        // Set global instance
+        let global_instance = Arc::new(Mutex::new(tracker));
+        if GLOBAL_TRACKER.set(global_instance).is_err() {
+            warn!("Global service tracker was already initialized");
+        }
+        
+        // Return a new instance for the agent to use
+        Ok(Self::new())
+    }
+
+    /// Check if a service is user-stopped (global access for collectors)
+    pub fn is_service_user_stopped(service_name: &str) -> bool {
+        if let Some(global) = GLOBAL_TRACKER.get() {
+            if let Ok(tracker) = global.lock() {
+                tracker.is_user_stopped(service_name)
+            } else {
+                debug!("Failed to lock global service tracker");
+                false
+            }
+        } else {
+            debug!("Global service tracker not initialized");
+            false
+        }
+    }
+
+    /// Update global tracker (called by agent when tracker state changes)
+    pub fn update_global(updated_tracker: &UserStoppedServiceTracker) {
+        if let Some(global) = GLOBAL_TRACKER.get() {
+            if let Ok(mut tracker) = global.lock() {
+                tracker.user_stopped_services = updated_tracker.user_stopped_services.clone();
+            } else {
+                debug!("Failed to lock global service tracker for update");
+            }
+        } else {
+            debug!("Global service tracker not initialized for update");
+        }
+    }
+
+    /// Create new tracker with custom storage path
+    pub fn with_storage_path<P: AsRef<Path>>(storage_path: P) -> Self {
+        let storage_path = storage_path.as_ref().to_string_lossy().to_string();
+        let mut tracker = Self {
+            user_stopped_services: HashSet::new(),
+            storage_path,
+        };
+
+        // Load existing data from storage
+        if let Err(e) = tracker.load_from_storage() {
+            warn!("Failed to load user-stopped services from storage: {}", e);
+            info!("Starting with empty user-stopped services list");
+        }
+
+        tracker
+    }
+
+    /// Mark a service as user-stopped
+    pub fn mark_user_stopped(&mut self, service_name: &str) -> Result<()> {
+        info!("Marking service '{}' as user-stopped", service_name);
+        self.user_stopped_services.insert(service_name.to_string());
+        self.save_to_storage()?;
+        debug!("Service '{}' marked as user-stopped and saved to storage", service_name);
+        Ok(())
+    }
+
+    /// Clear user-stopped flag for a service (when user starts it)
+    pub fn clear_user_stopped(&mut self, service_name: &str) -> Result<()> {
+        if self.user_stopped_services.remove(service_name) {
+            info!("Cleared user-stopped flag for service '{}'", service_name);
+            self.save_to_storage()?;
+            debug!("Service '{}' user-stopped flag cleared and saved to storage", service_name);
+        } else {
+            debug!("Service '{}' was not marked as user-stopped", service_name);
+        }
+        Ok(())
+    }
+
+    /// Check if a service is marked as user-stopped
+    pub fn is_user_stopped(&self, service_name: &str) -> bool {
+        let is_stopped = self.user_stopped_services.contains(service_name);
+        debug!("Service '{}' user-stopped status: {}", service_name, is_stopped);
+        is_stopped
+    }
+
+
+    /// Save current state to persistent storage
+    fn save_to_storage(&self) -> Result<()> {
+        // Create parent directory if it doesn't exist
+        if let Some(parent_dir) = Path::new(&self.storage_path).parent() {
+            if !parent_dir.exists() {
+                fs::create_dir_all(parent_dir)?;
+                debug!("Created parent directory: {}", parent_dir.display());
+            }
+        }
+
+        let data = UserStoppedData {
+            services: self.user_stopped_services.iter().cloned().collect(),
+        };
+
+        let json_data = serde_json::to_string_pretty(&data)?;
+        fs::write(&self.storage_path, json_data)?;
+
+        debug!(
+            "Saved {} user-stopped services to {}",
+            data.services.len(),
+            self.storage_path
+        );
+        Ok(())
+    }
+
+    /// Load state from persistent storage
+    fn load_from_storage(&mut self) -> Result<()> {
+        if !Path::new(&self.storage_path).exists() {
+            debug!("Storage file {} does not exist, starting fresh", self.storage_path);
+            return Ok(());
+        }
+
+        let json_data = fs::read_to_string(&self.storage_path)?;
+        let data: UserStoppedData = serde_json::from_str(&json_data)?;
+
+        self.user_stopped_services = data.services.into_iter().collect();
+
+        info!(
+            "Loaded {} user-stopped services from {}",
+            self.user_stopped_services.len(),
+            self.storage_path
+        );
+
+        if !self.user_stopped_services.is_empty() {
+            debug!("User-stopped services: {:?}", self.user_stopped_services);
+        }
+
+        Ok(())
+    }
+}
+
--- a/agent/src/status/mod.rs
+++ b/agent/src/status/mod.rs
@@ -272,11 +272,13 @@ impl HostStatusManager {
    /// Check if a status change is significant enough for notification
    fn is_significant_change(&self, old_status: Status, new_status: Status) -> bool {
        match (old_status, new_status) {
-            // Always notify on problems
+            // Don't notify on transitions from Unknown (startup/restart scenario)
+            (Status::Unknown, _) => false,
+            // Always notify on problems (but not from Unknown)
            (_, Status::Warning) | (_, Status::Critical) => true,
            // Only notify on recovery if it's from a problem state to OK and all services are OK
            (Status::Warning | Status::Critical, Status::Ok) => self.current_host_status == Status::Ok,
-            // Don't notify on startup or other transitions
+            // Don't notify on other transitions
            _ => false,
        }
    }
@@ -374,8 +376,8 @@ impl HostStatusManager {
            details.push('\n');
        }

-        // Show recoveries
-        if !recovery_changes.is_empty() {
+        // Show recoveries only if host status is now OK (all services recovered)
+        if !recovery_changes.is_empty() && aggregated.host_status_final == Status::Ok {
            details.push_str(&format!("✅ RECOVERIES ({}):\n", recovery_changes.len()));
            for change in recovery_changes {
                details.push_str(&format!("  {}\n", change));
--- a/dashboard/Cargo.toml
+++ b/dashboard/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "cm-dashboard"
-version = "0.1.38"
+version = "0.1.74"
 edition = "2021"

 [dependencies]
@@ -18,4 +18,5 @@ tracing-subscriber = { workspace = true }
 ratatui = { workspace = true }
 crossterm = { workspace = true }
 toml = { workspace = true }
-gethostname = { workspace = true }
+gethostname = { workspace = true }
+wake-on-lan = "0.2"
--- a/dashboard/src/app.rs
+++ b/dashboard/src/app.rs
@@ -9,20 +9,19 @@ use std::io;
 use std::time::{Duration, Instant};
 use tracing::{debug, error, info, warn};

-use crate::communication::{AgentCommand, ServiceAction, ZmqCommandSender, ZmqConsumer};
+use crate::communication::{ZmqConsumer};
 use crate::config::DashboardConfig;
 use crate::metrics::MetricStore;
 use crate::ui::{TuiApp, UiCommand};

 pub struct Dashboard {
    zmq_consumer: ZmqConsumer,
-    zmq_command_sender: ZmqCommandSender,
    metric_store: MetricStore,
    tui_app: Option<TuiApp>,
    terminal: Option<Terminal<CrosstermBackend<io::Stdout>>>,
    headless: bool,
    initial_commands_sent: std::collections::HashSet<String>,
-    _config: DashboardConfig,
+    config: DashboardConfig,
 }

 impl Dashboard {
@@ -58,20 +57,9 @@ impl Dashboard {
            }
        };

-        // Initialize ZMQ command sender
-        let zmq_command_sender = match ZmqCommandSender::new(&config.zmq) {
-            Ok(sender) => sender,
-            Err(e) => {
-                error!("Failed to initialize ZMQ command sender: {}", e);
-                return Err(e);
-            }
-        };
-
-        // Connect to predefined hosts from configuration
-        let hosts = config.hosts.predefined_hosts.clone();

        // Try to connect to hosts but don't fail if none are available
-        match zmq_consumer.connect_to_predefined_hosts(&hosts).await {
+        match zmq_consumer.connect_to_predefined_hosts(&config.hosts).await {
            Ok(_) => info!("Successfully connected to ZMQ hosts"),
            Err(e) => {
                warn!(
@@ -127,28 +115,23 @@ impl Dashboard {

        Ok(Self {
            zmq_consumer,
-            zmq_command_sender,
            metric_store,
            tui_app,
            terminal,
            headless,
            initial_commands_sent: std::collections::HashSet::new(),
-            _config: config,
+            config,
        })
    }

-    /// Send a command to a specific agent
-    pub async fn send_command(&mut self, hostname: &str, command: AgentCommand) -> Result<()> {
-        self.zmq_command_sender
-            .send_command(hostname, command)
-            .await
-    }

    pub async fn run(&mut self) -> Result<()> {
        info!("Starting dashboard main loop");

        let mut last_metrics_check = Instant::now();
        let metrics_check_interval = Duration::from_millis(100); // Check for metrics every 100ms
+        let mut last_heartbeat_check = Instant::now();
+        let heartbeat_check_interval = Duration::from_secs(1); // Check for host connectivity every 1 second

        loop {
            // Handle terminal events (keyboard input) only if not headless
@@ -191,6 +174,17 @@ impl Dashboard {
                        break;
                    }
                }
+
+                // Render UI immediately after handling input for responsive feedback
+                if let Some(ref mut terminal) = self.terminal {
+                    if let Some(ref mut tui_app) = self.tui_app {
+                        if let Err(e) = terminal.draw(|frame| {
+                            tui_app.render(frame, &self.metric_store);
+                        }) {
+                            error!("Error rendering TUI after input: {}", e);
+                        }
+                    }
+                }
            }

            // Check for new metrics
@@ -202,34 +196,18 @@ impl Dashboard {
                        metric_message.metrics.len()
                    );

-                    // Check if this is the first time we've seen this host
+                    // Track first contact with host (no command needed - agent sends data every 2s)
                    let is_new_host = !self
                        .initial_commands_sent
                        .contains(&metric_message.hostname);

                    if is_new_host {
                        info!(
-                            "First contact with host {}, sending initial CollectNow command",
+                            "First contact with host {} - data will update automatically",
                            metric_message.hostname
                        );
-
-                        // Send CollectNow command for immediate refresh
-                        if let Err(e) = self
-                            .send_command(&metric_message.hostname, AgentCommand::CollectNow)
-                            .await
-                        {
-                            error!(
-                                "Failed to send initial CollectNow command to {}: {}",
-                                metric_message.hostname, e
-                            );
-                        } else {
-                            info!(
-                                "✓ Sent initial CollectNow command to {}",
-                                metric_message.hostname
-                            );
-                            self.initial_commands_sent
-                                .insert(metric_message.hostname.clone());
-                        }
+                        self.initial_commands_sent
+                            .insert(metric_message.hostname.clone());
                    }

                    // Update metric store
@@ -243,14 +221,8 @@ impl Dashboard {
                        }
                    }

-                    // Update TUI with new hosts and metrics (only if not headless)
+                    // Update TUI with new metrics (only if not headless)
                    if let Some(ref mut tui_app) = self.tui_app {
-                        let connected_hosts = self
-                            .metric_store
-                            .get_connected_hosts(Duration::from_secs(30));
-                        
-                        
-                        tui_app.update_hosts(connected_hosts);
                        tui_app.update_metrics(&self.metric_store);
                    }
                }
@@ -269,6 +241,20 @@ impl Dashboard {
                last_metrics_check = Instant::now();
            }

+            // Check for host connectivity changes (heartbeat timeouts) periodically
+            if last_heartbeat_check.elapsed() >= heartbeat_check_interval {
+                let timeout = Duration::from_secs(self.config.zmq.heartbeat_timeout_seconds);
+                
+                // Clean up metrics for offline hosts
+                self.metric_store.cleanup_offline_hosts(timeout);
+                
+                if let Some(ref mut tui_app) = self.tui_app {
+                    let connected_hosts = self.metric_store.get_connected_hosts(timeout);
+                    tui_app.update_hosts(connected_hosts);
+                }
+                last_heartbeat_check = Instant::now();
+            }
+
            // Render TUI (only if not headless)
            if !self.headless {
                if let Some(ref mut terminal) = self.terminal {
@@ -294,22 +280,6 @@ impl Dashboard {
    /// Execute a UI command by sending it to the appropriate agent
    async fn execute_ui_command(&self, command: UiCommand) -> Result<()> {
        match command {
-            UiCommand::ServiceStart { hostname, service_name } => {
-                info!("Sending start command for service {} on {}", service_name, hostname);
-                let agent_command = AgentCommand::ServiceControl {
-                    service_name: service_name.clone(),
-                    action: ServiceAction::Start,
-                };
-                self.zmq_command_sender.send_command(&hostname, agent_command).await?;
-            }
-            UiCommand::ServiceStop { hostname, service_name } => {
-                info!("Sending stop command for service {} on {}", service_name, hostname);
-                let agent_command = AgentCommand::ServiceControl {
-                    service_name: service_name.clone(),
-                    action: ServiceAction::Stop,
-                };
-                self.zmq_command_sender.send_command(&hostname, agent_command).await?;
-            }
            UiCommand::TriggerBackup { hostname } => {
                info!("Trigger backup requested for {}", hostname);
                // TODO: Implement backup trigger command
--- a/dashboard/src/communication/mod.rs
+++ b/dashboard/src/communication/mod.rs
@@ -5,38 +5,6 @@ use zmq::{Context, Socket, SocketType};

 use crate::config::ZmqConfig;

-/// Commands that can be sent to agents
-#[derive(Debug, Clone, serde::Deserialize, serde::Serialize)]
-pub enum AgentCommand {
-    /// Request immediate metric collection
-    CollectNow,
-    /// Change collection interval
-    SetInterval { seconds: u64 },
-    /// Enable/disable a collector
-    ToggleCollector { name: String, enabled: bool },
-    /// Request status/health check
-    Ping,
-    /// Control systemd service
-    ServiceControl {
-        service_name: String,
-        action: ServiceAction,
-    },
-    /// Rebuild NixOS system
-    SystemRebuild {
-        git_url: String,
-        git_branch: String,
-        working_dir: String,
-        api_key_file: Option<String>,
-    },
-}
-
-/// Service control actions
-#[derive(Debug, Clone, serde::Deserialize, serde::Serialize)]
-pub enum ServiceAction {
-    Start,
-    Stop,
-    Status,
-}

 /// ZMQ consumer for receiving metrics from agents
 pub struct ZmqConsumer {
@@ -82,13 +50,14 @@ impl ZmqConsumer {
        }
    }

-    /// Connect to predefined hosts
-    pub async fn connect_to_predefined_hosts(&mut self, hosts: &[String]) -> Result<()> {
+
+    /// Connect to predefined hosts using their configuration
+    pub async fn connect_to_predefined_hosts(&mut self, hosts: &std::collections::HashMap<String, crate::config::HostDetails>) -> Result<()> {
        let default_port = self.config.subscriber_ports[0];

-        for hostname in hosts {
-            // Try to connect, but don't fail if some hosts are unreachable
-            if let Err(e) = self.connect_to_host(hostname, default_port).await {
+        for (hostname, host_details) in hosts {
+            // Try to connect using configured IP, but don't fail if some hosts are unreachable
+            if let Err(e) = self.connect_to_host_with_details(hostname, host_details, default_port).await {
                warn!("Could not connect to {}: {}", hostname, e);
            }
        }
@@ -102,6 +71,15 @@ impl ZmqConsumer {
        Ok(())
    }

+    /// Connect to a host using its configuration details
+    pub async fn connect_to_host_with_details(&mut self, hostname: &str, host_details: &crate::config::HostDetails, port: u16) -> Result<()> {
+        // Get primary connection IP only - no fallbacks
+        let primary_ip = host_details.get_connection_ip(hostname);
+        
+        // Connect directly without fallback attempts
+        self.connect_to_host(&primary_ip, port).await
+    }
+
    /// Receive command output from any connected agent (non-blocking)  
    pub async fn receive_command_output(&mut self) -> Result<Option<CommandOutputMessage>> {
        match self.subscriber.recv_bytes(zmq::DONTWAIT) {
@@ -190,42 +168,3 @@ impl ZmqConsumer {
    }
 }

-/// ZMQ command sender for sending commands to agents
-pub struct ZmqCommandSender {
-    context: Context,
-}
-
-impl ZmqCommandSender {
-    pub fn new(_config: &ZmqConfig) -> Result<Self> {
-        let context = Context::new();
-
-        info!("ZMQ command sender initialized");
-
-        Ok(Self { context })
-    }
-
-    /// Send a command to a specific agent
-    pub async fn send_command(&self, hostname: &str, command: AgentCommand) -> Result<()> {
-        // Create a new PUSH socket for this command (ZMQ best practice)
-        let socket = self.context.socket(SocketType::PUSH)?;
-
-        // Set socket options
-        socket.set_linger(1000)?; // Wait up to 1 second on close
-        socket.set_sndtimeo(5000)?; // 5 second send timeout
-
-        // Connect to agent's command port (6131)
-        let address = format!("tcp://{}:6131", hostname);
-        socket.connect(&address)?;
-
-        // Serialize command
-        let serialized = serde_json::to_vec(&command)?;
-
-        // Send command
-        socket.send(&serialized, 0)?;
-
-        info!("Sent command {:?} to agent at {}", command, hostname);
-
-        // Socket will be automatically closed when dropped
-        Ok(())
-    }
-}
--- a/dashboard/src/config/mod.rs
+++ b/dashboard/src/config/mod.rs
@@ -6,21 +6,40 @@ use std::path::Path;
 #[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct DashboardConfig {
    pub zmq: ZmqConfig,
-    pub hosts: HostsConfig,
+    pub hosts: std::collections::HashMap<String, HostDetails>,
    pub system: SystemConfig,
    pub ssh: SshConfig,
+    pub service_logs: std::collections::HashMap<String, Vec<ServiceLogConfig>>,
 }

 /// ZMQ consumer configuration
 #[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct ZmqConfig {
    pub subscriber_ports: Vec<u16>,
+    /// Heartbeat timeout in seconds - hosts considered offline if no heartbeat received within this time
+    #[serde(default = "default_heartbeat_timeout_seconds")]
+    pub heartbeat_timeout_seconds: u64,
 }

-/// Hosts configuration
+fn default_heartbeat_timeout_seconds() -> u64 {
+    10 // Default to 10 seconds - allows for multiple missed heartbeats
+}
+
+/// Individual host configuration details
 #[derive(Debug, Clone, Serialize, Deserialize)]
-pub struct HostsConfig {
-    pub predefined_hosts: Vec<String>,
+pub struct HostDetails {
+    pub mac_address: Option<String>,
+    /// Primary IP address (local network)
+    pub ip: Option<String>,
+}
+
+
+impl HostDetails {
+    /// Get the IP address for connection (uses ip field or hostname as fallback)
+    pub fn get_connection_ip(&self, hostname: &str) -> String {
+        self.ip.as_ref().unwrap_or(&hostname.to_string()).clone()
+    }
+
 }

 /// System configuration
@@ -32,11 +51,19 @@ pub struct SystemConfig {
    pub nixos_config_api_key_file: Option<String>,
 }

-/// SSH configuration for rebuild operations
+/// SSH configuration for rebuild and backup operations
 #[derive(Debug, Clone, Serialize, Deserialize)]
 pub struct SshConfig {
    pub rebuild_user: String,
    pub rebuild_alias: String,
+    pub backup_alias: String,
+}
+
+/// Service log file configuration per host
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct ServiceLogConfig {
+    pub service_name: String,
+    pub log_file_path: String,
 }

 impl DashboardConfig {
@@ -60,8 +87,3 @@ impl Default for ZmqConfig {
    }
 }

-impl Default for HostsConfig {
-    fn default() -> Self {
-        panic!("Dashboard configuration must be loaded from file - no hardcoded defaults allowed")
-    }
-}
--- a/dashboard/src/main.rs
+++ b/dashboard/src/main.rs
@@ -12,10 +12,6 @@ mod ui;

 use app::Dashboard;

-/// Get hardcoded version
-fn get_version() -> &'static str {
-    "v0.1.33"
-}

 /// Check if running inside tmux session
 fn check_tmux_session() {
@@ -42,7 +38,7 @@ fn check_tmux_session() {
 #[derive(Parser)]
 #[command(name = "cm-dashboard")]
 #[command(about = "CM Dashboard TUI with individual metric consumption")]
-#[command(version = get_version())]
+#[command(version)]
 struct Cli {
    /// Increase logging verbosity (-v, -vv)
    #[arg(short, long, action = clap::ArgAction::Count)]
--- a/dashboard/src/metrics/store.rs
+++ b/dashboard/src/metrics/store.rs
@@ -11,8 +11,8 @@ pub struct MetricStore {
    current_metrics: HashMap<String, HashMap<String, Metric>>,
    /// Historical metrics for trending
    historical_metrics: HashMap<String, Vec<MetricDataPoint>>,
-    /// Last update timestamp per host
-    last_update: HashMap<String, Instant>,
+    /// Last heartbeat timestamp per host
+    last_heartbeat: HashMap<String, Instant>,
    /// Configuration
    max_metrics_per_host: usize,
    history_retention: Duration,
@@ -23,7 +23,7 @@ impl MetricStore {
        Self {
            current_metrics: HashMap::new(),
            historical_metrics: HashMap::new(),
-            last_update: HashMap::new(),
+            last_heartbeat: HashMap::new(),
            max_metrics_per_host,
            history_retention: Duration::from_secs(history_retention_hours * 3600),
        }
@@ -56,10 +56,13 @@ impl MetricStore {

            // Add to history
            host_history.push(MetricDataPoint { received_at: now });
-        }

-        // Update last update timestamp
-        self.last_update.insert(hostname.to_string(), now);
+            // Track heartbeat metrics for connectivity detection
+            if metric_name == "agent_heartbeat" {
+                self.last_heartbeat.insert(hostname.to_string(), now);
+                debug!("Updated heartbeat for host {}", hostname);
+            }
+        }

        // Get metrics count before cleanup
        let metrics_count = host_metrics.len();
@@ -88,22 +91,46 @@ impl MetricStore {
        }
    }

-    /// Get connected hosts (hosts with recent updates)
+    /// Get connected hosts (hosts with recent heartbeats)
    pub fn get_connected_hosts(&self, timeout: Duration) -> Vec<String> {
        let now = Instant::now();

-        self.last_update
+        self.last_heartbeat
            .iter()
-            .filter_map(|(hostname, &last_update)| {
-                if now.duration_since(last_update) <= timeout {
+            .filter_map(|(hostname, &last_heartbeat)| {
+                if now.duration_since(last_heartbeat) <= timeout {
                    Some(hostname.clone())
                } else {
+                    debug!("Host {} considered offline - last heartbeat was {:?} ago", 
+                           hostname, now.duration_since(last_heartbeat));
                    None
                }
            })
            .collect()
    }

+    /// Clean up data for offline hosts
+    pub fn cleanup_offline_hosts(&mut self, timeout: Duration) {
+        let now = Instant::now();
+        let mut hosts_to_cleanup = Vec::new();
+
+        // Find hosts that are offline (no recent heartbeat)
+        for (hostname, &last_heartbeat) in &self.last_heartbeat {
+            if now.duration_since(last_heartbeat) > timeout {
+                hosts_to_cleanup.push(hostname.clone());
+            }
+        }
+
+        // Clear metrics for offline hosts
+        for hostname in hosts_to_cleanup {
+            if let Some(metrics) = self.current_metrics.remove(&hostname) {
+                info!("Cleared {} metrics for offline host: {}", metrics.len(), hostname);
+            }
+            // Keep heartbeat timestamp for reconnection detection
+            // Don't remove from last_heartbeat to track when host was last seen
+        }
+    }
+
    /// Cleanup old data and enforce limits
    fn cleanup_host_data(&mut self, hostname: &str) {
        let now = Instant::now();
--- a/dashboard/src/ui/mod.rs
+++ b/dashboard/src/ui/mod.rs
@@ -9,6 +9,7 @@ use ratatui::{
 use std::collections::HashMap;
 use std::time::Instant;
 use tracing::info;
+use wake_on_lan::MagicPacket;

 pub mod theme;
 pub mod widgets;
@@ -22,8 +23,6 @@ use widgets::{BackupWidget, ServicesWidget, SystemWidget, Widget};
 /// Commands that can be triggered from the UI
 #[derive(Debug, Clone)]
 pub enum UiCommand {
-    ServiceStart { hostname: String, service_name: String },
-    ServiceStop { hostname: String, service_name: String },
    TriggerBackup { hostname: String },
 }

@@ -89,19 +88,33 @@ pub struct TuiApp {
    user_navigated_away: bool,
    /// Dashboard configuration
    config: DashboardConfig,
+    /// Cached localhost hostname to avoid repeated system calls
+    localhost: String,
 }

 impl TuiApp {
    pub fn new(config: DashboardConfig) -> Self {
-        Self {
+        let localhost = gethostname::gethostname().to_string_lossy().to_string();
+        let mut app = Self {
            host_widgets: HashMap::new(),
            current_host: None,
-            available_hosts: Vec::new(),
+            available_hosts: config.hosts.keys().cloned().collect(),
            host_index: 0,
            should_quit: false,
            user_navigated_away: false,
            config,
+            localhost,
+        };
+        
+        // Sort predefined hosts
+        app.available_hosts.sort();
+        
+        // Initialize with first host if available
+        if !app.available_hosts.is_empty() {
+            app.current_host = Some(app.available_hosts[0].clone());
        }
+        
+        app
    }

    /// Get or create host widgets for the given hostname
@@ -120,31 +133,31 @@ impl TuiApp {
            // Only update widgets if we have metrics for this host
            let all_metrics = metric_store.get_metrics_for_host(&hostname);
            if !all_metrics.is_empty() {
-                // Get metrics first while hostname is borrowed
-                let cpu_metrics: Vec<&Metric> = all_metrics
-                    .iter()
-                    .filter(|m| {
-                        m.name.starts_with("cpu_")
-                            || m.name.contains("c_state_")
-                            || m.name.starts_with("process_top_")
-                    })
-                    .copied()
-                    .collect();
-                let memory_metrics: Vec<&Metric> = all_metrics
-                    .iter()
-                    .filter(|m| m.name.starts_with("memory_") || m.name.starts_with("disk_tmp_"))
-                    .copied()
-                    .collect();
-                let service_metrics: Vec<&Metric> = all_metrics
-                    .iter()
-                    .filter(|m| m.name.starts_with("service_"))
-                    .copied()
-                    .collect();
-                let all_backup_metrics: Vec<&Metric> = all_metrics
-                    .iter()
-                    .filter(|m| m.name.starts_with("backup_"))
-                    .copied()
-                    .collect();
+                // Single pass metric categorization for better performance
+                let mut cpu_metrics = Vec::new();
+                let mut memory_metrics = Vec::new();
+                let mut service_metrics = Vec::new();
+                let mut backup_metrics = Vec::new();
+                let mut nixos_metrics = Vec::new();
+                let mut disk_metrics = Vec::new();
+                
+                for metric in all_metrics {
+                    if metric.name.starts_with("cpu_") 
+                        || metric.name.contains("c_state_") 
+                        || metric.name.starts_with("process_top_") {
+                        cpu_metrics.push(metric);
+                    } else if metric.name.starts_with("memory_") || metric.name.starts_with("disk_tmp_") {
+                        memory_metrics.push(metric);
+                    } else if metric.name.starts_with("service_") {
+                        service_metrics.push(metric);
+                    } else if metric.name.starts_with("backup_") {
+                        backup_metrics.push(metric);
+                    } else if metric.name == "system_nixos_build" || metric.name == "system_active_users" || metric.name == "agent_version" {
+                        nixos_metrics.push(metric);
+                    } else if metric.name.starts_with("disk_") {
+                        disk_metrics.push(metric);
+                    }
+                }

                // Clear completed transitions first
                self.clear_completed_transitions(&hostname, &service_metrics);
@@ -155,21 +168,7 @@ impl TuiApp {
                // Collect all system metrics (CPU, memory, NixOS, disk/storage)
                let mut system_metrics = cpu_metrics;
                system_metrics.extend(memory_metrics);
-                
-                // Add NixOS metrics - using exact matching for build display fix
-                let nixos_metrics: Vec<&Metric> = all_metrics
-                    .iter()
-                    .filter(|m| m.name == "system_nixos_build" || m.name == "system_active_users" || m.name == "agent_version")
-                    .copied()
-                    .collect();
                system_metrics.extend(nixos_metrics);
-                
-                // Add disk/storage metrics
-                let disk_metrics: Vec<&Metric> = all_metrics
-                    .iter()
-                    .filter(|m| m.name.starts_with("disk_"))
-                    .copied()
-                    .collect();
                system_metrics.extend(disk_metrics);

                host_widgets.system_widget.update_from_metrics(&system_metrics);
@@ -178,7 +177,7 @@ impl TuiApp {
                    .update_from_metrics(&service_metrics);
                host_widgets
                    .backup_widget
-                    .update_from_metrics(&all_backup_metrics);
+                    .update_from_metrics(&backup_metrics);

                host_widgets.last_update = Some(Instant::now());
            }
@@ -186,30 +185,36 @@ impl TuiApp {
    }

    /// Update available hosts with localhost prioritization
-    pub fn update_hosts(&mut self, hosts: Vec<String>) {
-        // Sort hosts alphabetically
-        let mut sorted_hosts = hosts.clone();
+    pub fn update_hosts(&mut self, discovered_hosts: Vec<String>) {
+        // Start with configured hosts (always visible)
+        let mut all_hosts: Vec<String> = self.config.hosts.keys().cloned().collect();
+        
+        // Add any discovered hosts that aren't already configured
+        for host in discovered_hosts {
+            if !all_hosts.contains(&host) {
+                all_hosts.push(host);
+            }
+        }
        
        // Keep hosts that have pending transitions even if they're offline
        for (hostname, host_widgets) in &self.host_widgets {
            if !host_widgets.pending_service_transitions.is_empty() {
-                if !sorted_hosts.contains(hostname) {
-                    sorted_hosts.push(hostname.clone());
+                if !all_hosts.contains(hostname) {
+                    all_hosts.push(hostname.clone());
                }
            }
        }
        
-        sorted_hosts.sort();
-        self.available_hosts = sorted_hosts;
+        all_hosts.sort();
+        self.available_hosts = all_hosts;
        
        // Get the current hostname (localhost) for auto-selection
-        let localhost = gethostname::gethostname().to_string_lossy().to_string();
        if !self.available_hosts.is_empty() {
-            if self.available_hosts.contains(&localhost) && !self.user_navigated_away {
+            if self.available_hosts.contains(&self.localhost) && !self.user_navigated_away {
                // Localhost is available and user hasn't navigated away - switch to it
-                self.current_host = Some(localhost.clone());
+                self.current_host = Some(self.localhost.clone());
                // Find the actual index of localhost in the sorted list
-                self.host_index = self.available_hosts.iter().position(|h| h == &localhost).unwrap_or(0);
+                self.host_index = self.available_hosts.iter().position(|h| h == &self.localhost).unwrap_or(0);
            } else if self.current_host.is_none() {
                // No current host - select first available (which is localhost if available)
                self.current_host = Some(self.available_hosts[0].clone());
@@ -244,33 +249,151 @@ impl TuiApp {
                KeyCode::Char('r') => {
                    // System rebuild command - works on any panel for current host
                    if let Some(hostname) = self.current_host.clone() {
-                        // Launch tmux popup with SSH using config values
-                        let ssh_command = format!(
-                            "ssh -tt {}@{} 'bash -ic {}'",
-                            self.config.ssh.rebuild_user,
+                        let connection_ip = self.get_connection_ip(&hostname);
+                        // Create command that shows logo, rebuilds, and waits for user input
+                        let logo_and_rebuild = format!(
+                            "bash -c 'cat << \"EOF\"\nNixOS System Rebuild\nTarget: {} ({})\n\nEOF\nssh -tt {}@{} \"bash -ic {}\"\necho\necho \"========================================\"\necho \"Rebuild completed. Press any key to close...\"\necho \"========================================\"\nread -n 1 -s\nexit'",
                            hostname,
+                            connection_ip,
+                            self.config.ssh.rebuild_user,
+                            connection_ip,
                            self.config.ssh.rebuild_alias
                        );
+                        
                        std::process::Command::new("tmux")
-                            .arg("display-popup")
-                            .arg(&ssh_command)
+                            .arg("split-window")
+                            .arg("-v")
+                            .arg("-p")
+                            .arg("30")
+                            .arg(&logo_and_rebuild)
+                            .spawn()
+                            .ok(); // Ignore errors, tmux will handle them
+                    }
+                }
+                KeyCode::Char('B') => {
+                    // Backup command - works on any panel for current host
+                    if let Some(hostname) = self.current_host.clone() {
+                        let connection_ip = self.get_connection_ip(&hostname);
+                        // Create command that shows logo, runs backup, and waits for user input
+                        let logo_and_backup = format!(
+                            "bash -c 'cat << \"EOF\"\nBackup Operation\nTarget: {} ({})\n\nEOF\nssh -tt {}@{} \"bash -ic {}\"\necho\necho \"========================================\"\necho \"Backup completed. Press any key to close...\"\necho \"========================================\"\nread -n 1 -s\nexit'",
+                            hostname,
+                            connection_ip,
+                            self.config.ssh.rebuild_user,
+                            connection_ip,
+                            self.config.ssh.backup_alias
+                        );
+                        
+                        std::process::Command::new("tmux")
+                            .arg("split-window")
+                            .arg("-v")
+                            .arg("-p")
+                            .arg("30")
+                            .arg(&logo_and_backup)
                            .spawn()
                            .ok(); // Ignore errors, tmux will handle them
                    }
                }
                KeyCode::Char('s') => {
-                    // Service start command
+                    // Service start command via SSH with progress display
                    if let (Some(service_name), Some(hostname)) = (self.get_selected_service(), self.current_host.clone()) {
-                        if self.start_command(&hostname, CommandType::ServiceStart, service_name.clone()) {
-                            return Ok(Some(UiCommand::ServiceStart { hostname, service_name }));
-                        }
+                        // Start transition tracking for visual feedback
+                        self.start_command(&hostname, CommandType::ServiceStart, service_name.clone());
+                        
+                        let connection_ip = self.get_connection_ip(&hostname);
+                        let service_start_command = format!(
+                            "bash -c 'cat << \"EOF\"\nService Start: {}.service\nTarget: {} ({})\n\nEOF\nssh -tt {}@{} \"sudo systemctl start {}.service && echo \\\"Service started successfully\\\" && sudo systemctl status {}.service --no-pager -l\"\necho\necho \"========================================\"\necho \"Operation completed. Press any key to close...\"\necho \"========================================\"\nread -n 1 -s\nexit'",
+                            service_name,
+                            hostname,
+                            connection_ip,
+                            self.config.ssh.rebuild_user,
+                            connection_ip,
+                            service_name,
+                            service_name
+                        );
+                        
+                        std::process::Command::new("tmux")
+                            .arg("split-window")
+                            .arg("-v")
+                            .arg("-p")
+                            .arg("30")
+                            .arg(&service_start_command)
+                            .spawn()
+                            .ok(); // Ignore errors, tmux will handle them
                    }
                }
                KeyCode::Char('S') => {
-                    // Service stop command
+                    // Service stop command via SSH with progress display
                    if let (Some(service_name), Some(hostname)) = (self.get_selected_service(), self.current_host.clone()) {
-                        if self.start_command(&hostname, CommandType::ServiceStop, service_name.clone()) {
-                            return Ok(Some(UiCommand::ServiceStop { hostname, service_name }));
+                        // Start transition tracking for visual feedback
+                        self.start_command(&hostname, CommandType::ServiceStop, service_name.clone());
+                        
+                        let connection_ip = self.get_connection_ip(&hostname);
+                        let service_stop_command = format!(
+                            "bash -c 'cat << \"EOF\"\nService Stop: {}.service\nTarget: {} ({})\n\nEOF\nssh -tt {}@{} \"sudo systemctl stop {}.service && echo \\\"Service stopped successfully\\\" && sudo systemctl status {}.service --no-pager -l\"\necho\necho \"========================================\"\necho \"Operation completed. Press any key to close...\"\necho \"========================================\"\nread -n 1 -s\nexit'",
+                            service_name,
+                            hostname,
+                            connection_ip,
+                            self.config.ssh.rebuild_user,
+                            connection_ip,
+                            service_name,
+                            service_name
+                        );
+                        
+                        std::process::Command::new("tmux")
+                            .arg("split-window")
+                            .arg("-v")
+                            .arg("-p")
+                            .arg("30")
+                            .arg(&service_stop_command)
+                            .spawn()
+                            .ok(); // Ignore errors, tmux will handle them
+                    }
+                }
+                KeyCode::Char('J') => {
+                    // Show service logs via journalctl in tmux split window
+                    if let (Some(service_name), Some(hostname)) = (self.get_selected_service(), self.current_host.clone()) {
+                        let connection_ip = self.get_connection_ip(&hostname);
+                        let journalctl_command = format!(
+                            "bash -c \"ssh -tt {}@{} 'sudo journalctl -u {}.service -f --no-pager -n 50'; exit\"",
+                            self.config.ssh.rebuild_user,
+                            connection_ip,
+                            service_name
+                        );
+                        
+                        std::process::Command::new("tmux")
+                            .arg("split-window")
+                            .arg("-v")
+                            .arg("-p")
+                            .arg("30")
+                            .arg(&journalctl_command)
+                            .spawn()
+                            .ok(); // Ignore errors, tmux will handle them
+                    }
+                }
+                KeyCode::Char('L') => {
+                    // Show custom service log file in tmux split window
+                    if let (Some(service_name), Some(hostname)) = (self.get_selected_service(), self.current_host.clone()) {
+                        // Check if this service has a custom log file configured
+                        if let Some(host_logs) = self.config.service_logs.get(&hostname) {
+                            if let Some(log_config) = host_logs.iter().find(|config| config.service_name == service_name) {
+                                let connection_ip = self.get_connection_ip(&hostname);
+                                let tail_command = format!(
+                                    "bash -c \"ssh -tt {}@{} 'sudo tail -n 50 -f {}'; exit\"",
+                                    self.config.ssh.rebuild_user,
+                                    connection_ip,
+                                    log_config.log_file_path
+                                );
+                                
+                                std::process::Command::new("tmux")
+                                    .arg("split-window")
+                                    .arg("-v")
+                                    .arg("-p")
+                                    .arg("30")
+                                    .arg(&tail_command)
+                                    .spawn()
+                                    .ok(); // Ignore errors, tmux will handle them
+                            }
                        }
                    }
                }
@@ -281,6 +404,53 @@ impl TuiApp {
                        return Ok(Some(UiCommand::TriggerBackup { hostname }));
                    }
                }
+                KeyCode::Char('w') => {
+                    // Wake on LAN for offline hosts
+                    if let Some(hostname) = self.current_host.clone() {
+                        // Check if host has MAC address configured
+                        if let Some(host_details) = self.config.hosts.get(&hostname) {
+                            if let Some(mac_address) = &host_details.mac_address {
+                                // Parse MAC address and send WoL packet
+                                let mac_bytes = Self::parse_mac_address(mac_address);
+                                match mac_bytes {
+                                    Ok(mac) => {
+                                        match MagicPacket::new(&mac).send() {
+                                            Ok(_) => {
+                                                info!("WakeOnLAN packet sent successfully to {} ({})", hostname, mac_address);
+                                            }
+                                            Err(e) => {
+                                                tracing::error!("Failed to send WakeOnLAN packet to {}: {}", hostname, e);
+                                            }
+                                        }
+                                    }
+                                    Err(_) => {
+                                        tracing::error!("Invalid MAC address format for {}: {}", hostname, mac_address);
+                                    }
+                                }
+                            }
+                        }
+                    }
+                }
+                KeyCode::Char('t') => {
+                    // Open SSH terminal session in tmux window
+                    if let Some(hostname) = self.current_host.clone() {
+                        let connection_ip = self.get_connection_ip(&hostname);
+                        let ssh_command = format!(
+                            "ssh -tt {}@{}",
+                            self.config.ssh.rebuild_user,
+                            connection_ip
+                        );
+                        
+                        std::process::Command::new("tmux")
+                            .arg("split-window")
+                            .arg("-v")
+                            .arg("-p")
+                            .arg("30") // Use 30% like other commands
+                            .arg(&ssh_command)
+                            .spawn()
+                            .ok(); // Ignore errors, tmux will handle them
+                    }
+                }
                KeyCode::Tab => {
                    // Tab cycles to next host
                    self.navigate_host(1);
@@ -329,9 +499,8 @@ impl TuiApp {
        self.current_host = Some(self.available_hosts[self.host_index].clone());
        
        // Check if user navigated away from localhost
-        let localhost = gethostname::gethostname().to_string_lossy().to_string();
        if let Some(ref current) = self.current_host {
-            if current != &localhost {
+            if current != &self.localhost {
                self.user_navigated_away = true;
            } else {
                self.user_navigated_away = false; // User navigated back to localhost
@@ -475,6 +644,21 @@ impl TuiApp {
            ])
            .split(main_chunks[1]); // main_chunks[1] is now the content area (between title and statusbar)

+        // Check if current host is offline
+        let current_host_offline = if let Some(hostname) = self.current_host.clone() {
+            self.calculate_host_status(&hostname, metric_store) == Status::Offline
+        } else {
+            true // No host selected is considered offline
+        };
+
+        // If host is offline, render wake-up message instead of panels
+        if current_host_offline {
+            self.render_offline_host_message(frame, main_chunks[1]);
+            self.render_btop_title(frame, main_chunks[0], metric_store);
+            self.render_statusbar(frame, main_chunks[2]);
+            return;
+        }
+
        // Check if backup panel should be shown
        let show_backup = if let Some(hostname) = self.current_host.clone() {
            let host_widgets = self.get_or_create_host_widgets(&hostname);
@@ -542,11 +726,14 @@ impl TuiApp {
            return;
        }

-        // Calculate worst-case status across all hosts
+        // Calculate worst-case status across all hosts (excluding offline)
        let mut worst_status = Status::Ok;
        for host in &self.available_hosts {
            let host_status = self.calculate_host_status(host, metric_store);
-            worst_status = Status::aggregate(&[worst_status, host_status]);
+            // Don't include offline hosts in status aggregation
+            if host_status != Status::Offline {
+                worst_status = Status::aggregate(&[worst_status, host_status]);
+            }
        }

        // Use the worst status color as background
@@ -561,7 +748,7 @@ impl TuiApp {
        // Left side: "cm-dashboard" text
        let left_span = Span::styled(
            " cm-dashboard", 
-            Style::default().fg(Theme::background()).bg(background_color)
+            Style::default().fg(Theme::background()).bg(background_color).add_modifier(Modifier::BOLD)
        );
        let left_title = Paragraph::new(Line::from(vec![left_span]))
            .style(Style::default().bg(background_color));
@@ -624,7 +811,7 @@ impl TuiApp {
        let metrics = metric_store.get_metrics_for_host(hostname);

        if metrics.is_empty() {
-            return Status::Unknown;
+            return Status::Offline;
        }

        // First check if we have the aggregated host status summary from the agent
@@ -644,7 +831,8 @@ impl TuiApp {
                Status::Warning => has_warning = true,
                Status::Pending => has_pending = true,
                Status::Ok => ok_count += 1,
-                Status::Unknown => {} // Ignore unknown for aggregation
+                Status::Unknown => {}, // Ignore unknown for aggregation
+                Status::Offline => {}, // Ignore offline for aggregation
            }
        }

@@ -679,10 +867,13 @@ impl TuiApp {
        let mut shortcuts = Vec::new();
        
        // Global shortcuts
-        shortcuts.push("Tab: Switch Host".to_string());
-        shortcuts.push("↑↓/jk: Select Service".to_string());
-        shortcuts.push("r: Rebuild Host".to_string());
-        shortcuts.push("s/S: Start/Stop Service".to_string());
+        shortcuts.push("Tab: Host".to_string());
+        shortcuts.push("↑↓/jk: Select".to_string());
+        shortcuts.push("r: Rebuild".to_string());
+        shortcuts.push("s/S: Start/Stop".to_string());
+        shortcuts.push("J: Logs".to_string());
+        shortcuts.push("L: Custom".to_string());
+        shortcuts.push("w: Wake".to_string());
        
        // Always show quit
        shortcuts.push("q: Quit".to_string());
@@ -700,8 +891,10 @@ impl TuiApp {
                let host_widgets = self.get_or_create_host_widgets(&hostname);
                host_widgets.system_scroll_offset
            };
+            // Clone the config to avoid borrowing issues
+            let config = self.config.clone();
            let host_widgets = self.get_or_create_host_widgets(&hostname);
-            host_widgets.system_widget.render_with_scroll(frame, inner_area, scroll_offset);
+            host_widgets.system_widget.render_with_scroll(frame, inner_area, scroll_offset, &hostname, Some(&config));
        }
    }

@@ -721,5 +914,100 @@ impl TuiApp {
        }
    }

+    /// Render offline host message with wake-up option
+    fn render_offline_host_message(&self, frame: &mut Frame, area: Rect) {
+        use ratatui::layout::Alignment;
+        use ratatui::style::Modifier;
+        use ratatui::text::{Line, Span};
+        use ratatui::widgets::{Block, Borders, Paragraph};

+        // Get hostname for message
+        let hostname = self.current_host.as_ref()
+            .map(|h| h.as_str())
+            .unwrap_or("Unknown");
+
+        // Check if host has MAC address for wake-on-LAN
+        let has_mac = self.current_host.as_ref()
+            .and_then(|hostname| self.config.hosts.get(hostname))
+            .and_then(|details| details.mac_address.as_ref())
+            .is_some();
+
+        // Create message content
+        let mut lines = vec![
+            Line::from(Span::styled(
+                format!("Host '{}' is offline", hostname),
+                Style::default().fg(Theme::muted_text()).add_modifier(Modifier::BOLD),
+            )),
+            Line::from(""),
+        ];
+
+        if has_mac {
+            lines.push(Line::from(Span::styled(
+                "Press 'w' to wake up host",
+                Style::default().fg(Theme::primary_text()).add_modifier(Modifier::BOLD),
+            )));
+        } else {
+            lines.push(Line::from(Span::styled(
+                "No MAC address configured - cannot wake up",
+                Style::default().fg(Theme::muted_text()),
+            )));
+        }
+
+        // Create centered message
+        let message = Paragraph::new(lines)
+            .block(Block::default()
+                .borders(Borders::ALL)
+                .border_style(Style::default().fg(Theme::muted_text()))
+                .title(" Offline Host ")
+                .title_style(Style::default().fg(Theme::muted_text()).add_modifier(Modifier::BOLD)))
+            .style(Style::default().bg(Theme::background()).fg(Theme::primary_text()))
+            .alignment(Alignment::Center);
+
+        // Center the message in the available area
+        let popup_area = ratatui::layout::Layout::default()
+            .direction(Direction::Vertical)
+            .constraints([
+                Constraint::Percentage(40),
+                Constraint::Length(6),
+                Constraint::Percentage(40),
+            ])
+            .split(area)[1];
+
+        let popup_area = ratatui::layout::Layout::default()
+            .direction(Direction::Horizontal)
+            .constraints([
+                Constraint::Percentage(25),
+                Constraint::Percentage(50),
+                Constraint::Percentage(25),
+            ])
+            .split(popup_area)[1];
+
+        frame.render_widget(message, popup_area);
+    }
+
+    /// Parse MAC address string (e.g., "AA:BB:CC:DD:EE:FF") to [u8; 6]
+    /// Get the connection IP for a hostname based on host configuration
+    fn get_connection_ip(&self, hostname: &str) -> String {
+        if let Some(host_details) = self.config.hosts.get(hostname) {
+            host_details.get_connection_ip(hostname)
+        } else {
+            hostname.to_string()
+        }
+    }
+
+    fn parse_mac_address(mac_str: &str) -> Result<[u8; 6], &'static str> {
+        let parts: Vec<&str> = mac_str.split(':').collect();
+        if parts.len() != 6 {
+            return Err("MAC address must have 6 parts separated by colons");
+        }
+
+        let mut mac = [0u8; 6];
+        for (i, part) in parts.iter().enumerate() {
+            match u8::from_str_radix(part, 16) {
+                Ok(byte) => mac[i] = byte,
+                Err(_) => return Err("Invalid hexadecimal byte in MAC address"),
+            }
+        }
+        Ok(mac)
+    }
 }
--- a/dashboard/src/ui/theme.rs
+++ b/dashboard/src/ui/theme.rs
@@ -147,6 +147,7 @@ impl Theme {
            Status::Warning => Self::warning(),
            Status::Critical => Self::error(),
            Status::Unknown => Self::muted_text(),
+            Status::Offline => Self::muted_text(), // Dark gray for offline
        }
    }

@@ -244,8 +245,9 @@ impl StatusIcons {
            Status::Ok => "●",
            Status::Pending => "◉", // Hollow circle for pending
            Status::Warning => "◐",
-            Status::Critical => "◯",
+            Status::Critical => "!",
            Status::Unknown => "?",
+            Status::Offline => "○", // Empty circle for offline
        }
    }

@@ -258,6 +260,7 @@ impl StatusIcons {
            Status::Warning => Theme::warning(),    // Yellow
            Status::Critical => Theme::error(),     // Red
            Status::Unknown => Theme::muted_text(), // Gray
+            Status::Offline => Theme::muted_text(), // Dark gray for offline
        };

        vec![
@@ -292,12 +295,6 @@ impl Components {
 }

 impl Typography {
-    /// Main title style (dashboard header)
-    pub fn title() -> Style {
-        Style::default()
-            .fg(Theme::primary_text())
-            .bg(Theme::background())
-    }

    /// Widget title style (panel headers) - bold bright white
    pub fn widget_title() -> Style {
--- a/dashboard/src/ui/widgets/services.rs
+++ b/dashboard/src/ui/widgets/services.rs
@@ -113,13 +113,10 @@ impl ServicesWidget {
            name.to_string()
        };

-        // Parent services always show active/inactive status
+        // Parent services always show actual systemctl status
        let status_str = match info.widget_status {
-            Status::Ok => "active".to_string(),
            Status::Pending => "pending".to_string(),
-            Status::Warning => "inactive".to_string(),
-            Status::Critical => "failed".to_string(),
-            Status::Unknown => "unknown".to_string(),
+            _ => info.status.clone(), // Use actual status from agent (active/inactive/failed)
        };

        format!(
@@ -149,6 +146,7 @@ impl ServicesWidget {
            Status::Warning => Theme::warning(),
            Status::Critical => Theme::error(),
            Status::Unknown => Theme::muted_text(),
+            Status::Offline => Theme::muted_text(),
        };
        
        (icon.to_string(), info.status.clone(), status_color)
--- a/dashboard/src/ui/widgets/system.rs
+++ b/dashboard/src/ui/widgets/system.rs
@@ -439,12 +439,12 @@ impl Widget for SystemWidget {

 impl SystemWidget {
    /// Render with scroll offset support
-    pub fn render_with_scroll(&mut self, frame: &mut Frame, area: Rect, scroll_offset: usize) {
+    pub fn render_with_scroll(&mut self, frame: &mut Frame, area: Rect, scroll_offset: usize, hostname: &str, config: Option<&crate::config::DashboardConfig>) {
        let mut lines = Vec::new();

        // NixOS section
        lines.push(Line::from(vec![
-            Span::styled("NixOS:", Typography::widget_title())
+            Span::styled(format!("NixOS {}:", hostname), Typography::widget_title())
        ]));
        
        let build_text = self.nixos_build.as_deref().unwrap_or("unknown");
@@ -457,6 +457,16 @@ impl SystemWidget {
            Span::styled(format!("Agent: {}", agent_version_text), Typography::secondary())
        ]));
        
+        // Display detected connection IP
+        if let Some(config) = config {
+            if let Some(host_details) = config.hosts.get(hostname) {
+                let detected_ip = host_details.get_connection_ip(hostname);
+                lines.push(Line::from(vec![
+                    Span::styled(format!("IP: {}", detected_ip), Typography::secondary())
+                ]));
+            }
+        }
+        

        // CPU section
        lines.push(Line::from(vec![
--- a/hardcoded_values_removed.md
+++ b/hardcoded_values_removed.md
@@ -1,88 +0,0 @@
-# Hardcoded Values Removed - Configuration Summary
-
-## ✅ All Hardcoded Values Converted to Configuration
-
-### **1. SystemD Nginx Check Interval**
- **Before**: `nginx_check_interval_seconds: 30` (hardcoded)
- **After**: `nginx_check_interval_seconds: config.nginx_check_interval_seconds`
- **NixOS Config**: `nginx_check_interval_seconds = 30;`
-
-### **2. ZMQ Transmission Interval**  
- **Before**: `Duration::from_secs(1)` (hardcoded)
- **After**: `Duration::from_secs(self.config.zmq.transmission_interval_seconds)`
- **NixOS Config**: `transmission_interval_seconds = 1;`
-
-### **3. HTTP Timeouts in SystemD Collector**
- **Before**: 
-  ```rust
-  .timeout(Duration::from_secs(10))
-  .connect_timeout(Duration::from_secs(10))
-  ```
- **After**:
-  ```rust
-  .timeout(Duration::from_secs(self.config.http_timeout_seconds))
-  .connect_timeout(Duration::from_secs(self.config.http_connect_timeout_seconds))
-  ```
- **NixOS Config**: 
-  ```nix
-  http_timeout_seconds = 10;
-  http_connect_timeout_seconds = 10;
-  ```
-
-## **Configuration Structure Changes**
-
-### **SystemdConfig** (agent/src/config/mod.rs)
-```rust
-pub struct SystemdConfig {
-    // ... existing fields ...
-    pub nginx_check_interval_seconds: u64,      // NEW
-    pub http_timeout_seconds: u64,              // NEW
-    pub http_connect_timeout_seconds: u64,      // NEW
-}
-```
-
-### **ZmqConfig** (agent/src/config/mod.rs)
-```rust
-pub struct ZmqConfig {
-    // ... existing fields ...
-    pub transmission_interval_seconds: u64,     // NEW
-}
-```
-
-## **NixOS Configuration Updates**
-
-### **ZMQ Section** (hosts/common/cm-dashboard.nix)
-```nix
-zmq = {
-  # ... existing fields ...
-  transmission_interval_seconds = 1;           # NEW
-};
-```
-
-### **SystemD Section** (hosts/common/cm-dashboard.nix)
-```nix
-systemd = {
-  # ... existing fields ...
-  nginx_check_interval_seconds = 30;           # NEW  
-  http_timeout_seconds = 10;                   # NEW
-  http_connect_timeout_seconds = 10;           # NEW
-};
-```
-
-## **Benefits**
-
-✅ **No hardcoded values** - All timing/timeout values configurable  
-✅ **Consistent configuration** - Everything follows NixOS config pattern  
-✅ **Environment-specific tuning** - Can adjust timeouts per deployment  
-✅ **Maintainability** - No magic numbers scattered in code  
-✅ **Testing flexibility** - Can configure different values for testing  
-
-## **Runtime Behavior**
-
-All previously hardcoded values now respect configuration:
- **Nginx latency checks**: Every 30s (configurable)
- **ZMQ transmission**: Every 1s (configurable)  
- **HTTP requests**: 10s timeout (configurable)
- **HTTP connections**: 10s timeout (configurable)
-
-The codebase is now **100% configuration-driven** with no hardcoded timing values.
--- a/shared/Cargo.toml
+++ b/shared/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "cm-dashboard-shared"
-version = "0.1.38"
+version = "0.1.74"
 edition = "2021"

 [dependencies]
--- a/shared/src/metrics.rs
+++ b/shared/src/metrics.rs
@@ -87,6 +87,7 @@ pub enum Status {
    Warning,
    Critical,
    Unknown,
+    Offline,
 }

 impl Status {
@@ -190,6 +191,16 @@ impl HysteresisThresholds {
                    Status::Ok
                }
            }
+            Status::Offline => {
+                // Host coming back online, use normal thresholds like first measurement
+                if value >= self.critical_high {
+                    Status::Critical
+                } else if value >= self.warning_high {
+                    Status::Warning
+                } else {
+                    Status::Ok
+                }
+            }
        }
    }
 }
--- a/test_intervals.sh
+++ b/test_intervals.sh
@@ -1,42 +0,0 @@
-#!/bin/bash
-
-# Test script to verify collector intervals are working correctly
-# Expected behavior:
-# - CPU/Memory: Every 2 seconds
-# - Systemd/Network: Every 10 seconds  
-# - Backup/NixOS: Every 60 seconds
-# - Disk: Every 300 seconds (5 minutes)
-
-echo "=== Testing Collector Interval Implementation ==="
-echo "Expected intervals from NixOS config:"
-echo "  CPU: 2s, Memory: 2s"
-echo "  Systemd: 10s, Network: 10s" 
-echo "  Backup: 60s, NixOS: 60s"
-echo "  Disk: 300s (5m)"
-echo ""
-
-# Note: Cannot run actual agent without proper config, but we can verify the code logic
-echo "✅ Code Implementation Status:"
-echo "  - TimedCollector struct with interval tracking: IMPLEMENTED"
-echo "  - Individual collector intervals from config: IMPLEMENTED"  
-echo "  - collect_metrics_timed() respects intervals: IMPLEMENTED"
-echo "  - Debug logging shows interval compliance: IMPLEMENTED"
-echo ""
-
-echo "🔍 Key Implementation Details:"
-echo "  - MetricCollectionManager now tracks last_collection time per collector"
-echo "  - Each collector gets Duration::from_secs(config.{collector}.interval_seconds)"
-echo "  - Only collectors with elapsed >= interval are called"
-echo "  - Debug logs show actual collection with interval info"
-echo ""
-
-echo "📊 Expected Runtime Behavior:"
-echo "  At 0s:  All collectors run (startup)"
-echo "  At 2s:  CPU, Memory run"
-echo "  At 4s:  CPU, Memory run"  
-echo "  At 10s: CPU, Memory, Systemd, Network run"
-echo "  At 60s: CPU, Memory, Systemd, Network, Backup, NixOS run"
-echo "  At 300s: All collectors run including Disk"
-echo ""
-
-echo "✅ CONCLUSION: Codebase now follows NixOS configuration intervals correctly!"
--- a/test_tmux_check.rs
+++ b/test_tmux_check.rs
@@ -1,32 +0,0 @@
-#!/usr/bin/env rust-script
-
-use std::process;
-
-/// Check if running inside tmux session
-fn check_tmux_session() {
-    // Check for TMUX environment variable which is set when inside a tmux session
-    if std::env::var("TMUX").is_err() {
-        eprintln!("╭─────────────────────────────────────────────────────────────╮");
-        eprintln!("│                        ⚠️  TMUX REQUIRED                      │");
-        eprintln!("├─────────────────────────────────────────────────────────────┤");
-        eprintln!("│  CM Dashboard must be run inside a tmux session for proper   │");
-        eprintln!("│  terminal handling and remote operation functionality.       │");
-        eprintln!("│                                                             │");
-        eprintln!("│  Please start a tmux session first:                        │");
-        eprintln!("│    tmux new-session -d -s dashboard cm-dashboard           │");
-        eprintln!("│    tmux attach-session -t dashboard                        │");
-        eprintln!("│                                                             │");
-        eprintln!("│  Or simply:                                                 │");
-        eprintln!("│    tmux                                                     │");
-        eprintln!("│    cm-dashboard                                             │");
-        eprintln!("╰─────────────────────────────────────────────────────────────╯");
-        process::exit(1);
-    } else {
-        println!("✅ Running inside tmux session - OK");
-    }
-}
-
-fn main() {
-    println!("Testing tmux check function...");
-    check_tmux_session();
-}
--- a/test_tmux_simulation.sh
+++ b/test_tmux_simulation.sh
@@ -1,53 +0,0 @@
-#!/bin/bash
-
-echo "=== TMUX Check Implementation Test ==="
-echo ""
-
-echo "📋 Testing tmux check logic:"
-echo ""
-
-echo "1. Current environment:"
-if [ -n "$TMUX" ]; then
-    echo "   ✅ Running inside tmux session"
-    echo "   TMUX variable: $TMUX"
-else
-    echo "   ❌ NOT running inside tmux session"
-    echo "   TMUX variable: (not set)"
-fi
-echo ""
-
-echo "2. Simulating dashboard tmux check logic:"
-echo ""
-
-# Simulate the Rust check logic
-if [ -z "$TMUX" ]; then
-    echo "   Dashboard would show:"
-    echo "   ╭─────────────────────────────────────────────────────────────╮"
-    echo "   │                        ⚠️  TMUX REQUIRED                      │"
-    echo "   ├─────────────────────────────────────────────────────────────┤"
-    echo "   │  CM Dashboard must be run inside a tmux session for proper   │"
-    echo "   │  terminal handling and remote operation functionality.       │"
-    echo "   │                                                             │"
-    echo "   │  Please start a tmux session first:                        │"
-    echo "   │    tmux new-session -d -s dashboard cm-dashboard           │"
-    echo "   │    tmux attach-session -t dashboard                        │"
-    echo "   │                                                             │"
-    echo "   │  Or simply:                                                 │"
-    echo "   │    tmux                                                     │"
-    echo "   │    cm-dashboard                                             │"
-    echo "   ╰─────────────────────────────────────────────────────────────╯"
-    echo "   Then exit with code 1"
-else
-    echo "   ✅ Dashboard tmux check would PASS - continuing normally"
-fi
-echo ""
-
-echo "3. Implementation status:"
-echo "   ✅ check_tmux_session() function added to dashboard/src/main.rs"
-echo "   ✅ Called early in main() but only for TUI mode (not headless)"
-echo "   ✅ Uses std::env::var(\"TMUX\") to detect tmux session"
-echo "   ✅ Shows helpful error message with usage instructions"
-echo "   ✅ Exits with code 1 if not in tmux"
-echo ""
-
-echo "✅ TMUX check implementation complete!"