This commit addresses several key issues identified during development: Major Changes: - Replace hardcoded top CPU/RAM process display with real system data - Add intelligent process monitoring to CpuCollector using ps command - Fix disk metrics permission issues in systemd collector - Optimize service collection to focus on status, memory, and disk only - Update dashboard widgets to display live process information Process Monitoring Implementation: - Added collect_top_cpu_process() and collect_top_ram_process() methods - Implemented ps-based monitoring with accurate CPU percentages - Added filtering to prevent self-monitoring artifacts (ps commands) - Enhanced error handling and validation for process data - Dashboard now shows realistic values like "claude (PID 2974) 11.0%" Service Collection Optimization: - Removed CPU monitoring from systemd collector for efficiency - Enhanced service directory permission error logging - Simplified services widget to show essential metrics only - Fixed service-to-directory mapping accuracy UI and Dashboard Improvements: - Reorganized dashboard layout with btop-inspired multi-panel design - Updated system panel to include real top CPU/RAM process display - Enhanced widget formatting and data presentation - Removed placeholder/hardcoded data throughout the interface Technical Details: - Updated agent/src/collectors/cpu.rs with process monitoring - Modified dashboard/src/ui/mod.rs for real-time process display - Enhanced systemd collector error handling and disk metrics - Updated CLAUDE.md documentation with implementation details
853 lines
26 KiB
Markdown
853 lines
26 KiB
Markdown
# CM Dashboard Agent Architecture
|
|
|
|
## Overview
|
|
|
|
This document defines the architecture for the CM Dashboard Agent. The agent collects individual metrics and sends them to the dashboard via ZMQ. The dashboard decides which metrics to use in which widgets.
|
|
|
|
## Core Philosophy
|
|
|
|
**Individual Metrics Approach**: The agent collects and transmits individual metrics (e.g., `cpu_load_1min`, `memory_usage_percent`, `backup_last_run`) rather than grouped metric structures. This provides maximum flexibility for dashboard widget composition.
|
|
|
|
## Folder Structure
|
|
|
|
```
|
|
cm-dashboard/
|
|
├── agent/ # Agent application
|
|
│ ├── Cargo.toml
|
|
│ ├── src/
|
|
│ │ ├── main.rs # Entry point with CLI parsing
|
|
│ │ ├── agent.rs # Main Agent orchestrator
|
|
│ │ ├── config/
|
|
│ │ │ ├── mod.rs # Configuration module exports
|
|
│ │ │ ├── loader.rs # TOML configuration loading
|
|
│ │ │ ├── defaults.rs # Default configuration values
|
|
│ │ │ └── validation.rs # Configuration validation
|
|
│ │ ├── communication/
|
|
│ │ │ ├── mod.rs # Communication module exports
|
|
│ │ │ ├── zmq_config.rs # ZMQ configuration structures
|
|
│ │ │ ├── zmq_handler.rs # ZMQ socket management
|
|
│ │ │ ├── protocol.rs # Message format definitions
|
|
│ │ │ └── error.rs # Communication errors
|
|
│ │ ├── metrics/
|
|
│ │ │ ├── mod.rs # Metrics module exports
|
|
│ │ │ ├── registry.rs # Metric name registry and types
|
|
│ │ │ ├── value.rs # Metric value types and status
|
|
│ │ │ ├── cache.rs # Individual metric caching
|
|
│ │ │ └── collection.rs # Metric collection storage
|
|
│ │ ├── collectors/
|
|
│ │ │ ├── mod.rs # Collector trait definition
|
|
│ │ │ ├── cpu.rs # CPU-related metrics
|
|
│ │ │ ├── memory.rs # Memory-related metrics
|
|
│ │ │ ├── disk.rs # Disk usage metrics
|
|
│ │ │ ├── processes.rs # Process-related metrics
|
|
│ │ │ ├── systemd.rs # Systemd service metrics
|
|
│ │ │ ├── smart.rs # Storage SMART metrics
|
|
│ │ │ ├── backup.rs # Backup status metrics
|
|
│ │ │ ├── network.rs # Network metrics
|
|
│ │ │ └── error.rs # Collector errors
|
|
│ │ ├── notifications/
|
|
│ │ │ ├── mod.rs # Notification exports
|
|
│ │ │ ├── manager.rs # Status change detection
|
|
│ │ │ ├── email.rs # Email notification backend
|
|
│ │ │ └── status_tracker.rs # Individual metric status tracking
|
|
│ │ └── utils/
|
|
│ │ ├── mod.rs # Utility exports
|
|
│ │ ├── system.rs # System command utilities
|
|
│ │ ├── time.rs # Timestamp utilities
|
|
│ │ └── discovery.rs # Auto-discovery functions
|
|
│ ├── config/
|
|
│ │ ├── agent.example.toml # Example configuration
|
|
│ │ └── production.toml # Production template
|
|
│ └── tests/
|
|
│ ├── integration/ # Integration tests
|
|
│ ├── unit/ # Unit tests by module
|
|
│ └── fixtures/ # Test data and mocks
|
|
├── dashboard/ # Dashboard application
|
|
│ ├── Cargo.toml
|
|
│ ├── src/
|
|
│ │ ├── main.rs # Entry point with CLI parsing
|
|
│ │ ├── app.rs # Main Dashboard application state
|
|
│ │ ├── config/
|
|
│ │ │ ├── mod.rs # Configuration module exports
|
|
│ │ │ ├── loader.rs # TOML configuration loading
|
|
│ │ │ └── defaults.rs # Default configuration values
|
|
│ │ ├── communication/
|
|
│ │ │ ├── mod.rs # Communication module exports
|
|
│ │ │ ├── zmq_consumer.rs # ZMQ metric consumer
|
|
│ │ │ ├── protocol.rs # Shared message protocol
|
|
│ │ │ └── error.rs # Communication errors
|
|
│ │ ├── metrics/
|
|
│ │ │ ├── mod.rs # Metrics module exports
|
|
│ │ │ ├── store.rs # Metric storage and retrieval
|
|
│ │ │ ├── filter.rs # Metric filtering and selection
|
|
│ │ │ ├── history.rs # Historical metric storage
|
|
│ │ │ └── subscription.rs # Metric subscription management
|
|
│ │ ├── ui/
|
|
│ │ │ ├── mod.rs # UI module exports
|
|
│ │ │ ├── app.rs # Main UI application loop
|
|
│ │ │ ├── layout.rs # Layout management
|
|
│ │ │ ├── widgets/
|
|
│ │ │ │ ├── mod.rs # Widget exports
|
|
│ │ │ │ ├── base.rs # Base widget trait
|
|
│ │ │ │ ├── cpu.rs # CPU metrics widget
|
|
│ │ │ │ ├── memory.rs # Memory metrics widget
|
|
│ │ │ │ ├── storage.rs # Storage metrics widget
|
|
│ │ │ │ ├── services.rs # Services metrics widget
|
|
│ │ │ │ ├── backup.rs # Backup metrics widget
|
|
│ │ │ │ ├── hosts.rs # Host selection widget
|
|
│ │ │ │ └── alerts.rs # Alerts/status widget
|
|
│ │ │ ├── theme.rs # UI theming and colors
|
|
│ │ │ └── input.rs # Input handling
|
|
│ │ ├── hosts/
|
|
│ │ │ ├── mod.rs # Host management exports
|
|
│ │ │ ├── manager.rs # Host connection management
|
|
│ │ │ ├── discovery.rs # Host auto-discovery
|
|
│ │ │ └── connection.rs # Individual host connections
|
|
│ │ └── utils/
|
|
│ │ ├── mod.rs # Utility exports
|
|
│ │ ├── formatting.rs # Data formatting utilities
|
|
│ │ └── time.rs # Time formatting utilities
|
|
│ ├── config/
|
|
│ │ ├── dashboard.example.toml # Example configuration
|
|
│ │ └── hosts.example.toml # Example host configuration
|
|
│ └── tests/
|
|
│ ├── integration/ # Integration tests
|
|
│ ├── unit/ # Unit tests by module
|
|
│ └── fixtures/ # Test data and mocks
|
|
├── shared/ # Shared types and utilities
|
|
│ ├── Cargo.toml
|
|
│ ├── src/
|
|
│ │ ├── lib.rs # Shared library exports
|
|
│ │ ├── protocol.rs # Shared message protocol
|
|
│ │ ├── metrics.rs # Shared metric types
|
|
│ │ └── error.rs # Shared error types
|
|
└── tests/ # End-to-end tests
|
|
├── e2e/ # End-to-end test scenarios
|
|
└── fixtures/ # Shared test data
|
|
```
|
|
|
|
## Architecture Principles
|
|
|
|
### 1. Individual Metrics Philosophy
|
|
|
|
**No Grouped Structures**: Instead of `SystemMetrics` or `BackupMetrics`, we collect individual metrics:
|
|
|
|
```rust
|
|
// Good - Individual metrics
|
|
"cpu_load_1min" -> 2.5
|
|
"cpu_load_5min" -> 2.8
|
|
"cpu_temperature" -> 45.0
|
|
"memory_usage_percent" -> 78.5
|
|
"memory_total_gb" -> 32.0
|
|
"disk_root_usage_percent" -> 15.2
|
|
"service_ssh_status" -> "active"
|
|
"backup_last_run_timestamp" -> 1697123456
|
|
|
|
// Bad - Grouped structures
|
|
SystemMetrics { cpu: {...}, memory: {...} }
|
|
```
|
|
|
|
**Dashboard Flexibility**: The dashboard consumes individual metrics and decides which ones to display in each widget.
|
|
|
|
### 2. Metric Definition
|
|
|
|
Each metric has:
|
|
- **Name**: Unique identifier (e.g., `cpu_load_1min`)
|
|
- **Value**: Typed value (f32, i64, String, bool)
|
|
- **Status**: Health status (ok, warning, critical, unknown)
|
|
- **Timestamp**: When the metric was collected
|
|
- **Metadata**: Optional description, units, etc.
|
|
|
|
### 3. Module Responsibilities
|
|
|
|
- **Communication**: ZMQ protocol and message handling
|
|
- **Metrics**: Value types, caching, and storage
|
|
- **Collectors**: Gather specific metrics from system
|
|
- **Notifications**: Track status changes across all metrics
|
|
- **Config**: Configuration loading and validation
|
|
|
|
### 4. Data Flow
|
|
|
|
```
|
|
Collectors → Individual Metrics → Cache → ZMQ → Dashboard
|
|
↓ ↓ ↓
|
|
Status Calc → Status Tracker → Notifications
|
|
```
|
|
|
|
## Metric Design Rules
|
|
|
|
### 1. Naming Convention
|
|
|
|
Metrics follow hierarchical naming:
|
|
|
|
```
|
|
{category}_{subcategory}_{property}_{unit}
|
|
|
|
Examples:
|
|
cpu_load_1min
|
|
cpu_temperature_celsius
|
|
memory_usage_percent
|
|
memory_total_gb
|
|
disk_root_usage_percent
|
|
disk_nvme0_temperature_celsius
|
|
service_ssh_status
|
|
service_ssh_memory_mb
|
|
backup_last_run_timestamp
|
|
backup_status
|
|
network_eth0_rx_bytes
|
|
```
|
|
|
|
### 2. Value Types
|
|
|
|
```rust
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub enum MetricValue {
|
|
Float(f32),
|
|
Integer(i64),
|
|
String(String),
|
|
Boolean(bool),
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub enum Status {
|
|
Ok,
|
|
Warning,
|
|
Critical,
|
|
Unknown,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct Metric {
|
|
pub name: String,
|
|
pub value: MetricValue,
|
|
pub status: Status,
|
|
pub timestamp: u64,
|
|
pub description: Option<String>,
|
|
pub unit: Option<String>,
|
|
}
|
|
```
|
|
|
|
### 3. Collector Interface
|
|
|
|
Each collector provides individual metrics:
|
|
|
|
```rust
|
|
#[async_trait]
|
|
pub trait Collector {
|
|
fn name(&self) -> &str;
|
|
async fn collect(&self) -> Result<Vec<Metric>>;
|
|
}
|
|
|
|
// Example CPU collector output:
|
|
vec![
|
|
Metric { name: "cpu_load_1min", value: Float(2.5), status: Ok, ... },
|
|
Metric { name: "cpu_load_5min", value: Float(2.8), status: Ok, ... },
|
|
Metric { name: "cpu_temperature", value: Float(45.0), status: Ok, ... },
|
|
]
|
|
```
|
|
|
|
## Communication Protocol
|
|
|
|
### ZMQ Message Format
|
|
|
|
```rust
|
|
#[derive(Debug, Serialize, Deserialize)]
|
|
pub struct MetricMessage {
|
|
pub hostname: String,
|
|
pub timestamp: u64,
|
|
pub metrics: Vec<Metric>,
|
|
}
|
|
```
|
|
|
|
### ZMQ Configuration
|
|
|
|
```rust
|
|
#[derive(Debug, Deserialize)]
|
|
pub struct ZmqConfig {
|
|
pub publisher_port: u16, // Default: 6130
|
|
pub command_port: u16, // Default: 6131
|
|
pub bind_address: String, // Default: "0.0.0.0"
|
|
pub timeout_ms: u64, // Default: 5000
|
|
pub heartbeat_interval: u64, // Default: 30000
|
|
}
|
|
```
|
|
|
|
## Caching Strategy
|
|
|
|
### Configuration-Based Individual Metric Cache
|
|
|
|
```rust
|
|
pub struct MetricCache {
|
|
cache: HashMap<String, CachedMetric>,
|
|
config: CacheConfig,
|
|
}
|
|
|
|
struct CachedMetric {
|
|
metric: Metric,
|
|
collected_at: Instant,
|
|
access_count: u64,
|
|
cache_tier: CacheTier,
|
|
}
|
|
|
|
#[derive(Debug, Deserialize)]
|
|
pub struct CacheConfig {
|
|
pub enabled: bool,
|
|
pub default_ttl_seconds: u64,
|
|
pub max_entries: usize,
|
|
pub metric_tiers: HashMap<String, CacheTier>,
|
|
}
|
|
|
|
#[derive(Debug, Deserialize, Clone)]
|
|
pub struct CacheTier {
|
|
pub interval_seconds: u64,
|
|
pub description: String,
|
|
}
|
|
```
|
|
|
|
**Configuration-Based Caching Rules**:
|
|
- Each metric type has configurable cache intervals via config files
|
|
- Cache tiers defined in configuration, not hardcoded
|
|
- Individual metrics cached by name with tier-specific TTL
|
|
- Cache miss triggers single metric collection
|
|
- No grouped cache invalidation
|
|
- Performance target: <2% CPU usage through intelligent caching
|
|
|
|
## Configuration System
|
|
|
|
### Configuration Structure
|
|
|
|
```toml
|
|
[zmq]
|
|
publisher_port = 6130
|
|
command_port = 6131
|
|
bind_address = "0.0.0.0"
|
|
timeout_ms = 5000
|
|
|
|
[cache]
|
|
enabled = true
|
|
default_ttl_seconds = 30
|
|
max_entries = 10000
|
|
|
|
# Cache tiers for different metric types
|
|
[cache.tiers.realtime]
|
|
interval_seconds = 5
|
|
description = "High-frequency metrics (CPU load, memory usage)"
|
|
|
|
[cache.tiers.fast]
|
|
interval_seconds = 30
|
|
description = "Medium-frequency metrics (network stats, process lists)"
|
|
|
|
[cache.tiers.medium]
|
|
interval_seconds = 300
|
|
description = "Low-frequency metrics (service status, disk usage)"
|
|
|
|
[cache.tiers.slow]
|
|
interval_seconds = 900
|
|
description = "Very low-frequency metrics (SMART data, backup status)"
|
|
|
|
[cache.tiers.static]
|
|
interval_seconds = 3600
|
|
description = "Rarely changing metrics (hardware info, system capabilities)"
|
|
|
|
# Metric type to tier mapping
|
|
[cache.metric_assignments]
|
|
"cpu_load_*" = "realtime"
|
|
"memory_usage_*" = "realtime"
|
|
"service_*_cpu_percent" = "realtime"
|
|
"service_*_memory_mb" = "realtime"
|
|
"service_*_status" = "medium"
|
|
"service_*_disk_gb" = "medium"
|
|
"disk_*_temperature" = "slow"
|
|
"disk_*_wear_percent" = "slow"
|
|
"backup_*" = "slow"
|
|
"network_*" = "fast"
|
|
|
|
[collectors.cpu]
|
|
enabled = true
|
|
interval_seconds = 5
|
|
temperature_warning = 70.0
|
|
temperature_critical = 80.0
|
|
load_warning = 5.0
|
|
load_critical = 8.0
|
|
|
|
[collectors.memory]
|
|
enabled = true
|
|
interval_seconds = 5
|
|
usage_warning_percent = 80.0
|
|
usage_critical_percent = 95.0
|
|
|
|
[collectors.systemd]
|
|
enabled = true
|
|
interval_seconds = 30
|
|
services = ["ssh", "nginx", "docker", "gitea"]
|
|
|
|
[notifications]
|
|
enabled = true
|
|
smtp_host = "localhost"
|
|
smtp_port = 25
|
|
from_email = "{{hostname}}@cmtec.se"
|
|
to_email = "cm@cmtec.se"
|
|
rate_limit_minutes = 30
|
|
```
|
|
|
|
## Implementation Guidelines
|
|
|
|
### 1. Adding New Metrics
|
|
|
|
```rust
|
|
// 1. Define metric names in registry
|
|
pub const NETWORK_ETH0_RX_BYTES: &str = "network_eth0_rx_bytes";
|
|
pub const NETWORK_ETH0_TX_BYTES: &str = "network_eth0_tx_bytes";
|
|
|
|
// 2. Implement collector
|
|
pub struct NetworkCollector {
|
|
config: NetworkConfig,
|
|
}
|
|
|
|
impl Collector for NetworkCollector {
|
|
async fn collect(&self) -> Result<Vec<Metric>> {
|
|
vec![
|
|
Metric {
|
|
name: NETWORK_ETH0_RX_BYTES.to_string(),
|
|
value: MetricValue::Integer(rx_bytes),
|
|
status: Status::Ok,
|
|
timestamp: now(),
|
|
unit: Some("bytes".to_string()),
|
|
..Default::default()
|
|
},
|
|
// ... more metrics
|
|
]
|
|
}
|
|
}
|
|
|
|
// 3. Register in agent
|
|
agent.register_collector(Box::new(NetworkCollector::new(config.network)));
|
|
```
|
|
|
|
### 2. Status Calculation
|
|
|
|
Each collector calculates status for its metrics:
|
|
|
|
```rust
|
|
impl CpuCollector {
|
|
fn calculate_temperature_status(&self, temp: f32) -> Status {
|
|
if temp >= self.config.critical_threshold {
|
|
Status::Critical
|
|
} else if temp >= self.config.warning_threshold {
|
|
Status::Warning
|
|
} else {
|
|
Status::Ok
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Dashboard Usage
|
|
|
|
Dashboard widgets subscribe to specific metrics:
|
|
|
|
```rust
|
|
// Dashboard CPU widget
|
|
let cpu_metrics = [
|
|
"cpu_load_1min",
|
|
"cpu_load_5min",
|
|
"cpu_load_15min",
|
|
"cpu_temperature",
|
|
];
|
|
|
|
// Dashboard memory widget
|
|
let memory_metrics = [
|
|
"memory_usage_percent",
|
|
"memory_total_gb",
|
|
"memory_available_gb",
|
|
];
|
|
```
|
|
|
|
# Dashboard Architecture
|
|
|
|
## Dashboard Principles
|
|
|
|
### 1. UI Layout Preservation
|
|
|
|
**Current UI Layout Maintained**: The existing dashboard UI layout is preserved and enhanced with the new metric-centric architecture. All current widgets remain in their established positions and functionality.
|
|
|
|
**Widget Enhancement, Not Replacement**: Widgets are enhanced to consume individual metrics rather than grouped structures, but maintain their visual appearance and user interaction patterns.
|
|
|
|
### 2. Metric-to-Widget Mapping
|
|
|
|
Each widget subscribes to specific individual metrics and composes them for display:
|
|
|
|
```rust
|
|
// CPU Widget Metrics
|
|
const CPU_WIDGET_METRICS: &[&str] = &[
|
|
"cpu_load_1min",
|
|
"cpu_load_5min",
|
|
"cpu_load_15min",
|
|
"cpu_temperature_celsius",
|
|
"cpu_frequency_mhz",
|
|
"cpu_usage_percent",
|
|
];
|
|
|
|
// Memory Widget Metrics
|
|
const MEMORY_WIDGET_METRICS: &[&str] = &[
|
|
"memory_usage_percent",
|
|
"memory_total_gb",
|
|
"memory_available_gb",
|
|
"memory_used_gb",
|
|
"memory_swap_total_gb",
|
|
"memory_swap_used_gb",
|
|
];
|
|
|
|
// Storage Widget Metrics
|
|
const STORAGE_WIDGET_METRICS: &[&str] = &[
|
|
"disk_nvme0_temperature_celsius",
|
|
"disk_nvme0_wear_percent",
|
|
"disk_nvme0_spare_percent",
|
|
"disk_nvme0_hours",
|
|
"disk_nvme0_capacity_gb",
|
|
"disk_nvme0_usage_gb",
|
|
"disk_nvme0_usage_percent",
|
|
];
|
|
|
|
// Services Widget Metrics
|
|
const SERVICES_WIDGET_METRICS: &[&str] = &[
|
|
"service_ssh_status",
|
|
"service_ssh_memory_mb",
|
|
"service_ssh_cpu_percent",
|
|
"service_nginx_status",
|
|
"service_nginx_memory_mb",
|
|
"service_docker_status",
|
|
// ... per discovered service
|
|
];
|
|
|
|
// Backup Widget Metrics
|
|
const BACKUP_WIDGET_METRICS: &[&str] = &[
|
|
"backup_last_run_timestamp",
|
|
"backup_status",
|
|
"backup_size_gb",
|
|
"backup_duration_minutes",
|
|
"backup_next_scheduled_timestamp",
|
|
];
|
|
```
|
|
|
|
## Dashboard Communication
|
|
|
|
### ZMQ Consumer Architecture
|
|
|
|
```rust
|
|
// dashboard/src/communication/zmq_consumer.rs
|
|
pub struct ZmqConsumer {
|
|
subscriber: Socket,
|
|
config: ZmqConfig,
|
|
metric_filter: MetricFilter,
|
|
}
|
|
|
|
impl ZmqConsumer {
|
|
pub async fn subscribe_to_host(&mut self, hostname: &str) -> Result<()>
|
|
pub async fn receive_metrics(&mut self) -> Result<Vec<Metric>>
|
|
pub fn set_metric_filter(&mut self, filter: MetricFilter)
|
|
pub async fn request_metrics(&self, metric_names: &[String]) -> Result<()>
|
|
}
|
|
|
|
#[derive(Debug, Clone)]
|
|
pub struct MetricFilter {
|
|
pub include_patterns: Vec<String>,
|
|
pub exclude_patterns: Vec<String>,
|
|
pub hosts: Vec<String>,
|
|
}
|
|
```
|
|
|
|
### Protocol Compatibility
|
|
|
|
The dashboard uses the same protocol as defined in the agent:
|
|
|
|
```rust
|
|
// shared/src/protocol.rs (shared between agent and dashboard)
|
|
#[derive(Debug, Serialize, Deserialize)]
|
|
pub struct MetricMessage {
|
|
pub hostname: String,
|
|
pub timestamp: u64,
|
|
pub metrics: Vec<Metric>,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct Metric {
|
|
pub name: String,
|
|
pub value: MetricValue,
|
|
pub status: Status,
|
|
pub timestamp: u64,
|
|
pub description: Option<String>,
|
|
pub unit: Option<String>,
|
|
}
|
|
```
|
|
|
|
## Dashboard Metric Management
|
|
|
|
### Metric Store
|
|
|
|
```rust
|
|
// dashboard/src/metrics/store.rs
|
|
pub struct MetricStore {
|
|
current_metrics: HashMap<String, HashMap<String, Metric>>, // host -> metric_name -> metric
|
|
historical_metrics: HistoricalStore,
|
|
subscriptions: SubscriptionManager,
|
|
}
|
|
|
|
impl MetricStore {
|
|
pub fn update_metrics(&mut self, hostname: &str, metrics: Vec<Metric>)
|
|
pub fn get_metric(&self, hostname: &str, metric_name: &str) -> Option<&Metric>
|
|
pub fn get_metrics_for_widget(&self, hostname: &str, widget: WidgetType) -> Vec<&Metric>
|
|
pub fn get_hosts(&self) -> Vec<String>
|
|
pub fn get_latest_timestamp(&self, hostname: &str) -> Option<u64>
|
|
}
|
|
```
|
|
|
|
### Metric Subscription Management
|
|
|
|
```rust
|
|
// dashboard/src/metrics/subscription.rs
|
|
pub struct SubscriptionManager {
|
|
widget_subscriptions: HashMap<WidgetType, Vec<String>>,
|
|
active_hosts: HashSet<String>,
|
|
metric_filters: HashMap<String, MetricFilter>,
|
|
}
|
|
|
|
impl SubscriptionManager {
|
|
pub fn subscribe_widget(&mut self, widget: WidgetType, metrics: &[String])
|
|
pub fn get_required_metrics(&self) -> Vec<String>
|
|
pub fn add_host(&mut self, hostname: String)
|
|
pub fn remove_host(&mut self, hostname: &str)
|
|
pub fn is_metric_needed(&self, metric_name: &str) -> bool
|
|
}
|
|
```
|
|
|
|
## Widget Architecture
|
|
|
|
### Base Widget Trait
|
|
|
|
```rust
|
|
// dashboard/src/ui/widgets/base.rs
|
|
pub trait Widget {
|
|
fn widget_type(&self) -> WidgetType;
|
|
fn required_metrics(&self) -> &[&str];
|
|
fn update_metrics(&mut self, metrics: &HashMap<String, Metric>);
|
|
fn render(&self, frame: &mut Frame, area: Rect);
|
|
fn handle_input(&mut self, event: &Event) -> bool;
|
|
fn get_status(&self) -> Status;
|
|
}
|
|
|
|
#[derive(Debug, Clone, Copy, Hash, Eq, PartialEq)]
|
|
pub enum WidgetType {
|
|
Cpu,
|
|
Memory,
|
|
Storage,
|
|
Services,
|
|
Backup,
|
|
Hosts,
|
|
Alerts,
|
|
}
|
|
```
|
|
|
|
### Enhanced Widget Implementation
|
|
|
|
```rust
|
|
// dashboard/src/ui/widgets/cpu.rs
|
|
pub struct CpuWidget {
|
|
metrics: HashMap<String, Metric>,
|
|
config: CpuWidgetConfig,
|
|
}
|
|
|
|
impl Widget for CpuWidget {
|
|
fn required_metrics(&self) -> &[&str] {
|
|
CPU_WIDGET_METRICS
|
|
}
|
|
|
|
fn update_metrics(&mut self, metrics: &HashMap<String, Metric>) {
|
|
// Update only the metrics this widget cares about
|
|
for &metric_name in self.required_metrics() {
|
|
if let Some(metric) = metrics.get(metric_name) {
|
|
self.metrics.insert(metric_name.to_string(), metric.clone());
|
|
}
|
|
}
|
|
}
|
|
|
|
fn render(&self, frame: &mut Frame, area: Rect) {
|
|
// Extract specific metric values for display
|
|
let load_1min = self.get_metric_value("cpu_load_1min").unwrap_or(0.0);
|
|
let load_5min = self.get_metric_value("cpu_load_5min").unwrap_or(0.0);
|
|
let temperature = self.get_metric_value("cpu_temperature_celsius");
|
|
|
|
// Maintain existing UI layout and styling
|
|
// ... render implementation preserving current appearance
|
|
}
|
|
|
|
fn get_status(&self) -> Status {
|
|
// Aggregate status from individual metric statuses
|
|
self.metrics.values()
|
|
.map(|m| &m.status)
|
|
.max()
|
|
.copied()
|
|
.unwrap_or(Status::Unknown)
|
|
}
|
|
}
|
|
```
|
|
|
|
## Host Management
|
|
|
|
### Multi-Host Connection Management
|
|
|
|
```rust
|
|
// dashboard/src/hosts/manager.rs
|
|
pub struct HostManager {
|
|
connections: HashMap<String, HostConnection>,
|
|
discovery: HostDiscovery,
|
|
active_host: Option<String>,
|
|
metric_store: Arc<Mutex<MetricStore>>,
|
|
}
|
|
|
|
impl HostManager {
|
|
pub async fn discover_hosts(&mut self) -> Result<Vec<String>>
|
|
pub async fn connect_to_host(&mut self, hostname: &str) -> Result<()>
|
|
pub fn disconnect_from_host(&mut self, hostname: &str)
|
|
pub fn set_active_host(&mut self, hostname: String)
|
|
pub fn get_active_host(&self) -> Option<&str>
|
|
pub fn get_connected_hosts(&self) -> Vec<&str>
|
|
pub async fn refresh_all_hosts(&mut self) -> Result<()>
|
|
}
|
|
|
|
// dashboard/src/hosts/connection.rs
|
|
pub struct HostConnection {
|
|
hostname: String,
|
|
zmq_consumer: ZmqConsumer,
|
|
last_seen: Instant,
|
|
connection_status: ConnectionStatus,
|
|
metric_buffer: VecDeque<Metric>,
|
|
}
|
|
|
|
#[derive(Debug, Clone)]
|
|
pub enum ConnectionStatus {
|
|
Connected,
|
|
Connecting,
|
|
Disconnected,
|
|
Error(String),
|
|
}
|
|
```
|
|
|
|
## Configuration Integration
|
|
|
|
### Dashboard Configuration
|
|
|
|
```toml
|
|
# dashboard/config/dashboard.toml
|
|
[zmq]
|
|
subscriber_ports = [6130] # Ports to listen on for metrics
|
|
connection_timeout_ms = 15000
|
|
reconnect_interval_ms = 5000
|
|
|
|
[ui]
|
|
refresh_rate_ms = 100
|
|
theme = "default"
|
|
preserve_layout = true
|
|
|
|
[hosts]
|
|
auto_discovery = true
|
|
predefined_hosts = ["cmbox", "labbox", "simonbox", "steambox", "srv01"]
|
|
default_host = "cmbox"
|
|
|
|
[metrics]
|
|
history_retention_hours = 24
|
|
max_metrics_per_host = 10000
|
|
|
|
[widgets.cpu]
|
|
enabled = true
|
|
metrics = [
|
|
"cpu_load_1min",
|
|
"cpu_load_5min",
|
|
"cpu_load_15min",
|
|
"cpu_temperature_celsius"
|
|
]
|
|
|
|
[widgets.memory]
|
|
enabled = true
|
|
metrics = [
|
|
"memory_usage_percent",
|
|
"memory_total_gb",
|
|
"memory_available_gb"
|
|
]
|
|
|
|
[widgets.storage]
|
|
enabled = true
|
|
metrics = [
|
|
"disk_nvme0_temperature_celsius",
|
|
"disk_nvme0_wear_percent",
|
|
"disk_nvme0_usage_percent"
|
|
]
|
|
```
|
|
|
|
## UI Layout Preservation Rules
|
|
|
|
### 1. Maintain Current Widget Positions
|
|
|
|
- **CPU widget**: Top-left position preserved
|
|
- **Memory widget**: Top-right position preserved
|
|
- **Storage widget**: Left-center position preserved
|
|
- **Services widget**: Right-center position preserved
|
|
- **Backup widget**: Bottom-right position preserved
|
|
- **Host navigation**: Bottom status bar preserved
|
|
|
|
### 2. Preserve Visual Styling
|
|
|
|
- **Colors**: Existing status colors (green, yellow, red) maintained
|
|
- **Borders**: Current border styles and characters preserved
|
|
- **Text formatting**: Font styles, alignment, and spacing preserved
|
|
- **Progress bars**: Current progress bar implementations maintained
|
|
|
|
### 3. Maintain User Interactions
|
|
|
|
- **Navigation keys**: `←→` for host switching preserved
|
|
- **Refresh key**: `r` for manual refresh preserved
|
|
- **Quit key**: `q` for exit preserved
|
|
- **Additional keys**: All current keyboard shortcuts maintained
|
|
|
|
### 4. Status Display Consistency
|
|
|
|
- **Status aggregation**: Widget-level status calculated from individual metric statuses
|
|
- **Color mapping**: Status enum maps to existing color scheme
|
|
- **Status indicators**: Current status display format preserved
|
|
|
|
## Implementation Migration Strategy
|
|
|
|
### Phase 1: Shared Types
|
|
1. Create `shared/` crate with common protocol and metric types
|
|
2. Update both agent and dashboard to use shared types
|
|
|
|
### Phase 2: Agent Migration
|
|
1. Implement new agent architecture with individual metrics
|
|
2. Maintain backward compatibility during transition
|
|
|
|
### Phase 3: Dashboard Migration
|
|
1. Update dashboard to consume individual metrics
|
|
2. Preserve all existing UI layouts and interactions
|
|
3. Enhance widgets with new metric subscription system
|
|
|
|
### Phase 4: Integration Testing
|
|
1. End-to-end testing with real multi-host scenarios
|
|
2. Performance validation and optimization
|
|
3. UI/UX validation to ensure no regressions
|
|
|
|
## Benefits of This Architecture
|
|
|
|
1. **Maximum Flexibility**: Dashboard can compose any widget from any metrics
|
|
2. **Easy Extension**: Adding new metrics doesn't affect existing code
|
|
3. **Granular Caching**: Cache individual metrics based on collection cost
|
|
4. **Simple Testing**: Test individual metric collection in isolation
|
|
5. **Clear Separation**: Agent collects, dashboard consumes and displays
|
|
6. **Efficient Updates**: Only send changed metrics to dashboard
|
|
|
|
## Future Extensions
|
|
|
|
- **Metric Filtering**: Dashboard requests only needed metrics
|
|
- **Historical Storage**: Store metric history for trending
|
|
- **Metric Aggregation**: Calculate derived metrics from base metrics
|
|
- **Dynamic Discovery**: Auto-discover new metric sources
|
|
- **Metric Validation**: Validate metric values and ranges |