# CM Dashboard Agent Architecture ## Overview This document defines the architecture for the CM Dashboard Agent. The agent collects individual metrics and sends them to the dashboard via ZMQ. The dashboard decides which metrics to use in which widgets. ## Core Philosophy **Individual Metrics Approach**: The agent collects and transmits individual metrics (e.g., `cpu_load_1min`, `memory_usage_percent`, `backup_last_run`) rather than grouped metric structures. This provides maximum flexibility for dashboard widget composition. ## Folder Structure ``` cm-dashboard/ ├── agent/ # Agent application │ ├── Cargo.toml │ ├── src/ │ │ ├── main.rs # Entry point with CLI parsing │ │ ├── agent.rs # Main Agent orchestrator │ │ ├── config/ │ │ │ ├── mod.rs # Configuration module exports │ │ │ ├── loader.rs # TOML configuration loading │ │ │ ├── defaults.rs # Default configuration values │ │ │ └── validation.rs # Configuration validation │ │ ├── communication/ │ │ │ ├── mod.rs # Communication module exports │ │ │ ├── zmq_config.rs # ZMQ configuration structures │ │ │ ├── zmq_handler.rs # ZMQ socket management │ │ │ ├── protocol.rs # Message format definitions │ │ │ └── error.rs # Communication errors │ │ ├── metrics/ │ │ │ ├── mod.rs # Metrics module exports │ │ │ ├── registry.rs # Metric name registry and types │ │ │ ├── value.rs # Metric value types and status │ │ │ ├── cache.rs # Individual metric caching │ │ │ └── collection.rs # Metric collection storage │ │ ├── collectors/ │ │ │ ├── mod.rs # Collector trait definition │ │ │ ├── cpu.rs # CPU-related metrics │ │ │ ├── memory.rs # Memory-related metrics │ │ │ ├── disk.rs # Disk usage metrics │ │ │ ├── processes.rs # Process-related metrics │ │ │ ├── systemd.rs # Systemd service metrics │ │ │ ├── smart.rs # Storage SMART metrics │ │ │ ├── backup.rs # Backup status metrics │ │ │ ├── network.rs # Network metrics │ │ │ └── error.rs # Collector errors │ │ ├── notifications/ │ │ │ ├── mod.rs # Notification exports │ │ │ ├── manager.rs # Status change detection │ │ │ ├── email.rs # Email notification backend │ │ │ └── status_tracker.rs # Individual metric status tracking │ │ └── utils/ │ │ ├── mod.rs # Utility exports │ │ ├── system.rs # System command utilities │ │ ├── time.rs # Timestamp utilities │ │ └── discovery.rs # Auto-discovery functions │ ├── config/ │ │ ├── agent.example.toml # Example configuration │ │ └── production.toml # Production template │ └── tests/ │ ├── integration/ # Integration tests │ ├── unit/ # Unit tests by module │ └── fixtures/ # Test data and mocks ├── dashboard/ # Dashboard application │ ├── Cargo.toml │ ├── src/ │ │ ├── main.rs # Entry point with CLI parsing │ │ ├── app.rs # Main Dashboard application state │ │ ├── config/ │ │ │ ├── mod.rs # Configuration module exports │ │ │ ├── loader.rs # TOML configuration loading │ │ │ └── defaults.rs # Default configuration values │ │ ├── communication/ │ │ │ ├── mod.rs # Communication module exports │ │ │ ├── zmq_consumer.rs # ZMQ metric consumer │ │ │ ├── protocol.rs # Shared message protocol │ │ │ └── error.rs # Communication errors │ │ ├── metrics/ │ │ │ ├── mod.rs # Metrics module exports │ │ │ ├── store.rs # Metric storage and retrieval │ │ │ ├── filter.rs # Metric filtering and selection │ │ │ ├── history.rs # Historical metric storage │ │ │ └── subscription.rs # Metric subscription management │ │ ├── ui/ │ │ │ ├── mod.rs # UI module exports │ │ │ ├── app.rs # Main UI application loop │ │ │ ├── layout.rs # Layout management │ │ │ ├── widgets/ │ │ │ │ ├── mod.rs # Widget exports │ │ │ │ ├── base.rs # Base widget trait │ │ │ │ ├── cpu.rs # CPU metrics widget │ │ │ │ ├── memory.rs # Memory metrics widget │ │ │ │ ├── storage.rs # Storage metrics widget │ │ │ │ ├── services.rs # Services metrics widget │ │ │ │ ├── backup.rs # Backup metrics widget │ │ │ │ ├── hosts.rs # Host selection widget │ │ │ │ └── alerts.rs # Alerts/status widget │ │ │ ├── theme.rs # UI theming and colors │ │ │ └── input.rs # Input handling │ │ ├── hosts/ │ │ │ ├── mod.rs # Host management exports │ │ │ ├── manager.rs # Host connection management │ │ │ ├── discovery.rs # Host auto-discovery │ │ │ └── connection.rs # Individual host connections │ │ └── utils/ │ │ ├── mod.rs # Utility exports │ │ ├── formatting.rs # Data formatting utilities │ │ └── time.rs # Time formatting utilities │ ├── config/ │ │ ├── dashboard.example.toml # Example configuration │ │ └── hosts.example.toml # Example host configuration │ └── tests/ │ ├── integration/ # Integration tests │ ├── unit/ # Unit tests by module │ └── fixtures/ # Test data and mocks ├── shared/ # Shared types and utilities │ ├── Cargo.toml │ ├── src/ │ │ ├── lib.rs # Shared library exports │ │ ├── protocol.rs # Shared message protocol │ │ ├── metrics.rs # Shared metric types │ │ └── error.rs # Shared error types └── tests/ # End-to-end tests ├── e2e/ # End-to-end test scenarios └── fixtures/ # Shared test data ``` ## Architecture Principles ### 1. Individual Metrics Philosophy **No Grouped Structures**: Instead of `SystemMetrics` or `BackupMetrics`, we collect individual metrics: ```rust // Good - Individual metrics "cpu_load_1min" -> 2.5 "cpu_load_5min" -> 2.8 "cpu_temperature" -> 45.0 "memory_usage_percent" -> 78.5 "memory_total_gb" -> 32.0 "disk_root_usage_percent" -> 15.2 "service_ssh_status" -> "active" "backup_last_run_timestamp" -> 1697123456 // Bad - Grouped structures SystemMetrics { cpu: {...}, memory: {...} } ``` **Dashboard Flexibility**: The dashboard consumes individual metrics and decides which ones to display in each widget. ### 2. Metric Definition Each metric has: - **Name**: Unique identifier (e.g., `cpu_load_1min`) - **Value**: Typed value (f32, i64, String, bool) - **Status**: Health status (ok, warning, critical, unknown) - **Timestamp**: When the metric was collected - **Metadata**: Optional description, units, etc. ### 3. Module Responsibilities - **Communication**: ZMQ protocol and message handling - **Metrics**: Value types, caching, and storage - **Collectors**: Gather specific metrics from system - **Notifications**: Track status changes across all metrics - **Config**: Configuration loading and validation ### 4. Data Flow ``` Collectors → Individual Metrics → Cache → ZMQ → Dashboard ↓ ↓ ↓ Status Calc → Status Tracker → Notifications ``` ## Metric Design Rules ### 1. Naming Convention Metrics follow hierarchical naming: ``` {category}_{subcategory}_{property}_{unit} Examples: cpu_load_1min cpu_temperature_celsius memory_usage_percent memory_total_gb disk_root_usage_percent disk_nvme0_temperature_celsius service_ssh_status service_ssh_memory_mb backup_last_run_timestamp backup_status network_eth0_rx_bytes ``` ### 2. Value Types ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub enum MetricValue { Float(f32), Integer(i64), String(String), Boolean(bool), } #[derive(Debug, Clone, Serialize, Deserialize)] pub enum Status { Ok, Warning, Critical, Unknown, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct Metric { pub name: String, pub value: MetricValue, pub status: Status, pub timestamp: u64, pub description: Option, pub unit: Option, } ``` ### 3. Collector Interface Each collector provides individual metrics: ```rust #[async_trait] pub trait Collector { fn name(&self) -> &str; async fn collect(&self) -> Result>; } // Example CPU collector output: vec![ Metric { name: "cpu_load_1min", value: Float(2.5), status: Ok, ... }, Metric { name: "cpu_load_5min", value: Float(2.8), status: Ok, ... }, Metric { name: "cpu_temperature", value: Float(45.0), status: Ok, ... }, ] ``` ## Communication Protocol ### ZMQ Message Format ```rust #[derive(Debug, Serialize, Deserialize)] pub struct MetricMessage { pub hostname: String, pub timestamp: u64, pub metrics: Vec, } ``` ### ZMQ Configuration ```rust #[derive(Debug, Deserialize)] pub struct ZmqConfig { pub publisher_port: u16, // Default: 6130 pub command_port: u16, // Default: 6131 pub bind_address: String, // Default: "0.0.0.0" pub timeout_ms: u64, // Default: 5000 pub heartbeat_interval: u64, // Default: 30000 } ``` ## Caching Strategy ### Configuration-Based Individual Metric Cache ```rust pub struct MetricCache { cache: HashMap, config: CacheConfig, } struct CachedMetric { metric: Metric, collected_at: Instant, access_count: u64, cache_tier: CacheTier, } #[derive(Debug, Deserialize)] pub struct CacheConfig { pub enabled: bool, pub default_ttl_seconds: u64, pub max_entries: usize, pub metric_tiers: HashMap, } #[derive(Debug, Deserialize, Clone)] pub struct CacheTier { pub interval_seconds: u64, pub description: String, } ``` **Configuration-Based Caching Rules**: - Each metric type has configurable cache intervals via config files - Cache tiers defined in configuration, not hardcoded - Individual metrics cached by name with tier-specific TTL - Cache miss triggers single metric collection - No grouped cache invalidation - Performance target: <2% CPU usage through intelligent caching ## Configuration System ### Configuration Structure ```toml [zmq] publisher_port = 6130 command_port = 6131 bind_address = "0.0.0.0" timeout_ms = 5000 [cache] enabled = true default_ttl_seconds = 30 max_entries = 10000 # Cache tiers for different metric types [cache.tiers.realtime] interval_seconds = 5 description = "High-frequency metrics (CPU load, memory usage)" [cache.tiers.fast] interval_seconds = 30 description = "Medium-frequency metrics (network stats, process lists)" [cache.tiers.medium] interval_seconds = 300 description = "Low-frequency metrics (service status, disk usage)" [cache.tiers.slow] interval_seconds = 900 description = "Very low-frequency metrics (SMART data, backup status)" [cache.tiers.static] interval_seconds = 3600 description = "Rarely changing metrics (hardware info, system capabilities)" # Metric type to tier mapping [cache.metric_assignments] "cpu_load_*" = "realtime" "memory_usage_*" = "realtime" "service_*_cpu_percent" = "realtime" "service_*_memory_mb" = "realtime" "service_*_status" = "medium" "service_*_disk_gb" = "medium" "disk_*_temperature" = "slow" "disk_*_wear_percent" = "slow" "backup_*" = "slow" "network_*" = "fast" [collectors.cpu] enabled = true interval_seconds = 5 temperature_warning = 70.0 temperature_critical = 80.0 load_warning = 5.0 load_critical = 8.0 [collectors.memory] enabled = true interval_seconds = 5 usage_warning_percent = 80.0 usage_critical_percent = 95.0 [collectors.systemd] enabled = true interval_seconds = 30 services = ["ssh", "nginx", "docker", "gitea"] [notifications] enabled = true smtp_host = "localhost" smtp_port = 25 from_email = "{{hostname}}@cmtec.se" to_email = "cm@cmtec.se" rate_limit_minutes = 30 ``` ## Implementation Guidelines ### 1. Adding New Metrics ```rust // 1. Define metric names in registry pub const NETWORK_ETH0_RX_BYTES: &str = "network_eth0_rx_bytes"; pub const NETWORK_ETH0_TX_BYTES: &str = "network_eth0_tx_bytes"; // 2. Implement collector pub struct NetworkCollector { config: NetworkConfig, } impl Collector for NetworkCollector { async fn collect(&self) -> Result> { vec![ Metric { name: NETWORK_ETH0_RX_BYTES.to_string(), value: MetricValue::Integer(rx_bytes), status: Status::Ok, timestamp: now(), unit: Some("bytes".to_string()), ..Default::default() }, // ... more metrics ] } } // 3. Register in agent agent.register_collector(Box::new(NetworkCollector::new(config.network))); ``` ### 2. Status Calculation Each collector calculates status for its metrics: ```rust impl CpuCollector { fn calculate_temperature_status(&self, temp: f32) -> Status { if temp >= self.config.critical_threshold { Status::Critical } else if temp >= self.config.warning_threshold { Status::Warning } else { Status::Ok } } } ``` ### 3. Dashboard Usage Dashboard widgets subscribe to specific metrics: ```rust // Dashboard CPU widget let cpu_metrics = [ "cpu_load_1min", "cpu_load_5min", "cpu_load_15min", "cpu_temperature", ]; // Dashboard memory widget let memory_metrics = [ "memory_usage_percent", "memory_total_gb", "memory_available_gb", ]; ``` # Dashboard Architecture ## Dashboard Principles ### 1. UI Layout Preservation **Current UI Layout Maintained**: The existing dashboard UI layout is preserved and enhanced with the new metric-centric architecture. All current widgets remain in their established positions and functionality. **Widget Enhancement, Not Replacement**: Widgets are enhanced to consume individual metrics rather than grouped structures, but maintain their visual appearance and user interaction patterns. ### 2. Metric-to-Widget Mapping Each widget subscribes to specific individual metrics and composes them for display: ```rust // CPU Widget Metrics const CPU_WIDGET_METRICS: &[&str] = &[ "cpu_load_1min", "cpu_load_5min", "cpu_load_15min", "cpu_temperature_celsius", "cpu_frequency_mhz", "cpu_usage_percent", ]; // Memory Widget Metrics const MEMORY_WIDGET_METRICS: &[&str] = &[ "memory_usage_percent", "memory_total_gb", "memory_available_gb", "memory_used_gb", "memory_swap_total_gb", "memory_swap_used_gb", ]; // Storage Widget Metrics const STORAGE_WIDGET_METRICS: &[&str] = &[ "disk_nvme0_temperature_celsius", "disk_nvme0_wear_percent", "disk_nvme0_spare_percent", "disk_nvme0_hours", "disk_nvme0_capacity_gb", "disk_nvme0_usage_gb", "disk_nvme0_usage_percent", ]; // Services Widget Metrics const SERVICES_WIDGET_METRICS: &[&str] = &[ "service_ssh_status", "service_ssh_memory_mb", "service_ssh_cpu_percent", "service_nginx_status", "service_nginx_memory_mb", "service_docker_status", // ... per discovered service ]; // Backup Widget Metrics const BACKUP_WIDGET_METRICS: &[&str] = &[ "backup_last_run_timestamp", "backup_status", "backup_size_gb", "backup_duration_minutes", "backup_next_scheduled_timestamp", ]; ``` ## Dashboard Communication ### ZMQ Consumer Architecture ```rust // dashboard/src/communication/zmq_consumer.rs pub struct ZmqConsumer { subscriber: Socket, config: ZmqConfig, metric_filter: MetricFilter, } impl ZmqConsumer { pub async fn subscribe_to_host(&mut self, hostname: &str) -> Result<()> pub async fn receive_metrics(&mut self) -> Result> pub fn set_metric_filter(&mut self, filter: MetricFilter) pub async fn request_metrics(&self, metric_names: &[String]) -> Result<()> } #[derive(Debug, Clone)] pub struct MetricFilter { pub include_patterns: Vec, pub exclude_patterns: Vec, pub hosts: Vec, } ``` ### Protocol Compatibility The dashboard uses the same protocol as defined in the agent: ```rust // shared/src/protocol.rs (shared between agent and dashboard) #[derive(Debug, Serialize, Deserialize)] pub struct MetricMessage { pub hostname: String, pub timestamp: u64, pub metrics: Vec, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct Metric { pub name: String, pub value: MetricValue, pub status: Status, pub timestamp: u64, pub description: Option, pub unit: Option, } ``` ## Dashboard Metric Management ### Metric Store ```rust // dashboard/src/metrics/store.rs pub struct MetricStore { current_metrics: HashMap>, // host -> metric_name -> metric historical_metrics: HistoricalStore, subscriptions: SubscriptionManager, } impl MetricStore { pub fn update_metrics(&mut self, hostname: &str, metrics: Vec) pub fn get_metric(&self, hostname: &str, metric_name: &str) -> Option<&Metric> pub fn get_metrics_for_widget(&self, hostname: &str, widget: WidgetType) -> Vec<&Metric> pub fn get_hosts(&self) -> Vec pub fn get_latest_timestamp(&self, hostname: &str) -> Option } ``` ### Metric Subscription Management ```rust // dashboard/src/metrics/subscription.rs pub struct SubscriptionManager { widget_subscriptions: HashMap>, active_hosts: HashSet, metric_filters: HashMap, } impl SubscriptionManager { pub fn subscribe_widget(&mut self, widget: WidgetType, metrics: &[String]) pub fn get_required_metrics(&self) -> Vec pub fn add_host(&mut self, hostname: String) pub fn remove_host(&mut self, hostname: &str) pub fn is_metric_needed(&self, metric_name: &str) -> bool } ``` ## Widget Architecture ### Base Widget Trait ```rust // dashboard/src/ui/widgets/base.rs pub trait Widget { fn widget_type(&self) -> WidgetType; fn required_metrics(&self) -> &[&str]; fn update_metrics(&mut self, metrics: &HashMap); fn render(&self, frame: &mut Frame, area: Rect); fn handle_input(&mut self, event: &Event) -> bool; fn get_status(&self) -> Status; } #[derive(Debug, Clone, Copy, Hash, Eq, PartialEq)] pub enum WidgetType { Cpu, Memory, Storage, Services, Backup, Hosts, Alerts, } ``` ### Enhanced Widget Implementation ```rust // dashboard/src/ui/widgets/cpu.rs pub struct CpuWidget { metrics: HashMap, config: CpuWidgetConfig, } impl Widget for CpuWidget { fn required_metrics(&self) -> &[&str] { CPU_WIDGET_METRICS } fn update_metrics(&mut self, metrics: &HashMap) { // Update only the metrics this widget cares about for &metric_name in self.required_metrics() { if let Some(metric) = metrics.get(metric_name) { self.metrics.insert(metric_name.to_string(), metric.clone()); } } } fn render(&self, frame: &mut Frame, area: Rect) { // Extract specific metric values for display let load_1min = self.get_metric_value("cpu_load_1min").unwrap_or(0.0); let load_5min = self.get_metric_value("cpu_load_5min").unwrap_or(0.0); let temperature = self.get_metric_value("cpu_temperature_celsius"); // Maintain existing UI layout and styling // ... render implementation preserving current appearance } fn get_status(&self) -> Status { // Aggregate status from individual metric statuses self.metrics.values() .map(|m| &m.status) .max() .copied() .unwrap_or(Status::Unknown) } } ``` ## Host Management ### Multi-Host Connection Management ```rust // dashboard/src/hosts/manager.rs pub struct HostManager { connections: HashMap, discovery: HostDiscovery, active_host: Option, metric_store: Arc>, } impl HostManager { pub async fn discover_hosts(&mut self) -> Result> pub async fn connect_to_host(&mut self, hostname: &str) -> Result<()> pub fn disconnect_from_host(&mut self, hostname: &str) pub fn set_active_host(&mut self, hostname: String) pub fn get_active_host(&self) -> Option<&str> pub fn get_connected_hosts(&self) -> Vec<&str> pub async fn refresh_all_hosts(&mut self) -> Result<()> } // dashboard/src/hosts/connection.rs pub struct HostConnection { hostname: String, zmq_consumer: ZmqConsumer, last_seen: Instant, connection_status: ConnectionStatus, metric_buffer: VecDeque, } #[derive(Debug, Clone)] pub enum ConnectionStatus { Connected, Connecting, Disconnected, Error(String), } ``` ## Configuration Integration ### Dashboard Configuration ```toml # dashboard/config/dashboard.toml [zmq] subscriber_ports = [6130] # Ports to listen on for metrics connection_timeout_ms = 15000 reconnect_interval_ms = 5000 [ui] refresh_rate_ms = 100 theme = "default" preserve_layout = true [hosts] auto_discovery = true predefined_hosts = ["cmbox", "labbox", "simonbox", "steambox", "srv01"] default_host = "cmbox" [metrics] history_retention_hours = 24 max_metrics_per_host = 10000 [widgets.cpu] enabled = true metrics = [ "cpu_load_1min", "cpu_load_5min", "cpu_load_15min", "cpu_temperature_celsius" ] [widgets.memory] enabled = true metrics = [ "memory_usage_percent", "memory_total_gb", "memory_available_gb" ] [widgets.storage] enabled = true metrics = [ "disk_nvme0_temperature_celsius", "disk_nvme0_wear_percent", "disk_nvme0_usage_percent" ] ``` ## UI Layout Preservation Rules ### 1. Maintain Current Widget Positions - **CPU widget**: Top-left position preserved - **Memory widget**: Top-right position preserved - **Storage widget**: Left-center position preserved - **Services widget**: Right-center position preserved - **Backup widget**: Bottom-right position preserved - **Host navigation**: Bottom status bar preserved ### 2. Preserve Visual Styling - **Colors**: Existing status colors (green, yellow, red) maintained - **Borders**: Current border styles and characters preserved - **Text formatting**: Font styles, alignment, and spacing preserved - **Progress bars**: Current progress bar implementations maintained ### 3. Maintain User Interactions - **Navigation keys**: `←→` for host switching preserved - **Refresh key**: `r` for manual refresh preserved - **Quit key**: `q` for exit preserved - **Additional keys**: All current keyboard shortcuts maintained ### 4. Status Display Consistency - **Status aggregation**: Widget-level status calculated from individual metric statuses - **Color mapping**: Status enum maps to existing color scheme - **Status indicators**: Current status display format preserved ## Implementation Migration Strategy ### Phase 1: Shared Types 1. Create `shared/` crate with common protocol and metric types 2. Update both agent and dashboard to use shared types ### Phase 2: Agent Migration 1. Implement new agent architecture with individual metrics 2. Maintain backward compatibility during transition ### Phase 3: Dashboard Migration 1. Update dashboard to consume individual metrics 2. Preserve all existing UI layouts and interactions 3. Enhance widgets with new metric subscription system ### Phase 4: Integration Testing 1. End-to-end testing with real multi-host scenarios 2. Performance validation and optimization 3. UI/UX validation to ensure no regressions ## Benefits of This Architecture 1. **Maximum Flexibility**: Dashboard can compose any widget from any metrics 2. **Easy Extension**: Adding new metrics doesn't affect existing code 3. **Granular Caching**: Cache individual metrics based on collection cost 4. **Simple Testing**: Test individual metric collection in isolation 5. **Clear Separation**: Agent collects, dashboard consumes and displays 6. **Efficient Updates**: Only send changed metrics to dashboard ## Future Extensions - **Metric Filtering**: Dashboard requests only needed metrics - **Historical Storage**: Store metric history for trending - **Metric Aggregation**: Calculate derived metrics from base metrics - **Dynamic Discovery**: Auto-discover new metric sources - **Metric Validation**: Validate metric values and ranges