Root cause: sda's temperature exceeded threshold in the past, causing
smartctl to return exit code 32 (warning: "Attributes have been <= threshold
in the past"). The agent checked output.status.success() and rejected the
entire output as failed, even though the data (serial, temperature, health)
was perfectly valid.
Smartctl exit codes are bit flags for informational warnings:
- Exit 0: No warnings
- Exit 32 (bit 5): Attributes were at/below threshold in past
- Exit 64 (bit 6): Error log has entries
- etc.
The output data is valid regardless of these warning flags.
Solution: Parse output as long as it's not empty, ignore exit code.
Only return UNKNOWN if output is actually empty (command truly failed).
Result: Data_3 will now show "ZDZ4VE0B T: 31°C" instead of "? Data_3: sda"
Bump version to v0.1.225
Root cause: SMART data was collected TWICE:
1. Sequential collection during pool detection in get_drive_info_for_path()
using problematic tokio::task::block_in_place() nesting
2. Parallel collection in get_smart_data_for_drives() (v0.1.223)
The sequential collection happened FIRST during pool detection, causing
sda (Data_3) to timeout due to:
- Bad async nesting: block_in_place() wrapping block_on()
- Sequential execution causing runtime issues
- sda being third in sequence, runtime degraded by then
Solution: Remove SMART collection from get_drive_info_for_path().
Pool drive temperatures are populated later from the parallel SMART
collection which properly uses futures::join_all.
Benefits:
- Eliminates problematic async nesting
- All SMART queries happen once in parallel only
- sda/Data_3 should now show serial (ZDZ4VE0B) and temperature
Bump version to v0.1.224
Root cause: SMART data was collected sequentially, one drive at a time.
With 5 drives taking ~500ms each, total collection time was 2.5+ seconds.
When disk collector runs every 1 second, this caused overlapping
collections creating resource contention. The last drive (sda/Data_3)
would timeout due to the drive being accessed by the previous collection.
Solution: Query all drives in parallel using futures::join_all. Now all
drives get their SMART data collected simultaneously with independent
3-second timeouts, eliminating contention and reducing total collection
time from 2.5+ seconds to ~500ms (the slowest single drive).
Benefits:
- All drives complete in ~500ms instead of 2.5+ seconds
- No overlapping collections causing resource contention
- Each drive gets full 3-second timeout window
- sda/Data_3 should now show temperature and serial number
Bump version to v0.1.223
Root cause: run_command_with_timeout() was calling cmd.spawn() without
configuring stdout/stderr pipes. This caused command output to go to
journald instead of being captured by wait_with_output(). The disk
collector received empty output and failed silently.
Solution: Configure stdout(Stdio::piped()) and stderr(Stdio::piped())
before spawning commands. This ensures wait_with_output() can properly
capture command output.
Fixes: Empty Storage section, lsblk output appearing in journald
Bump version to v0.1.222
v0.1.220 broke disk collector by changing the import from
std::process::Command to tokio::process::Command, but lines 193 and
767 explicitly used std::process::Command::new() which silently failed.
Solution: Import both as aliases (TokioCommand/StdCommand) and use
appropriate type for each operation - async commands use TokioCommand
with run_command_with_timeout, sync commands use StdCommand with
system timeout wrapper.
Fixes: Empty Storage section after v0.1.220 deployment
Bump version to v0.1.221
- Changed disk collector to use tokio::process::Command instead of std::process::Command
- Updated run_command_with_timeout to properly kill processes on timeout
- Fixes issue where smartctl hangs on problematic drives (/dev/sda) freezing entire agent
- Timeout now force-kills hung processes using kill -9, preventing orphaned smartctl processes
This resolves the issue where Data_3 showed unknown status because smartctl was hanging
indefinitely trying to read from a problematic drive, blocking the entire collector.
Bump version to v0.1.220
Co-Authored-By: Claude <noreply@anthropic.com>
- Sort repositories alphabetically before rendering
- Sort backup disks by serial number
- Prevents display jumping between different orderings on updates
- Consistent display order across refreshes
Bump version to v0.1.214
Co-Authored-By: Claude <noreply@anthropic.com>
- Update BackupData structure to support multiple backup disks
- Scan /var/lib/backup/status/ directory for all status files
- Calculate status icons for backup and disk usage
- Aggregate repository status from all disks
- Update dashboard to display all backup disks with per-disk status
- Display repository list with count and aggregated status
- Agent now extracts "C" + digits pattern (C3, C10) using char parsing
- Removes suffixes like "_ACPI", "_MWAIT" at source
- Reduces JSON payload size over ZMQ
- No regex dependency - uses fast char iteration (~1μs overhead)
- Robust fallback to original name if pattern not found
- Dashboard simplified to use clean names directly
Bump version to v0.1.212
Co-Authored-By: Claude <noreply@anthropic.com>
- Strip suffixes like "_ACPI" from C-state names
- Display changes from "C3_ACPI:51%" to "C3:51%"
- Cleaner, more concise presentation
Bump version to v0.1.211
Co-Authored-By: Claude <noreply@anthropic.com>
- Changed code to use zmq.transmission_interval_seconds instead of top-level collection_interval_seconds
- Removed collection_interval_seconds from AgentConfig
- Updated validation to check zmq.transmission_interval_seconds
- Improves config organization by grouping all ZMQ settings together
Bump version to v0.1.210
Co-Authored-By: Claude <noreply@anthropic.com>
- Changed CpuData.cstate from String to Vec<CStateInfo>
- Added CStateInfo struct with name and percent fields
- Collector calculates percentage for each C-state based on accumulated time
- Sorts and returns top 3 C-states by usage
- Dashboard displays: "C10:79% C8:10% C6:8%"
Provides better visibility into CPU idle state distribution.
Bump version to v0.1.209
Co-Authored-By: Claude <noreply@anthropic.com>
- Changed CpuData.frequency_mhz to CpuData.cstate (String)
- Implemented collect_cstate() to read CPU idle depth from sysfs
- Finds deepest C-state with most accumulated time (C0-C10)
- Updated dashboard to display C-state instead of frequency
- More accurate indicator of CPU activity vs power management
Bump version to v0.1.208
Co-Authored-By: Claude <noreply@anthropic.com>
The service_status_cache from discovery only has active_state with
all detailed metrics set to None. During collection, get_service_status()
was returning cached data instead of fetching fresh systemctl show data.
Now always fetch fresh data to populate memory_bytes, restart_count,
and uptime_seconds properly.
Shared:
- Add memory_bytes, restart_count, uptime_seconds to ServiceData
Agent:
- Add new fields to ServiceStatusInfo struct
- Fetch MemoryCurrent, NRestarts, ExecMainStartTimestamp from systemctl show
- Calculate uptime from start timestamp
- Parse and populate new fields in ServiceData
- Remove unused load_state and sub_state fields
Dashboard:
- Add memory_bytes, restart_count, uptime_seconds to ServiceInfo
- Update header: Service, Status, RAM, Uptime, ↻ (restarts)
- Format memory as MB/GB
- Format uptime as Xd Xh, Xh Xm, or Xm
- Show restart count with ! prefix if > 0 to indicate instability
All metrics obtained from single systemctl show call - zero overhead.
Changed header from 4 columns to 2 columns:
- Before: Service, Status, RAM, Disk
- After: Service, Status
Matches the removal of memory_mb and disk_gb fields.
Complete removal of service resource metrics:
Agent:
- Remove memory_mb and disk_gb fields from ServiceData struct
- Remove get_service_memory_usage() method
- Remove get_service_disk_usage() method
- Remove get_directory_size() method
- Remove unused warn import
Dashboard:
- Remove memory_mb and disk_gb from ServiceInfo struct
- Remove memory/disk display from format_parent_service_line
- Remove memory/disk parsing in legacy metric path
- Remove unused format_disk_size() function
Service resource metrics were slow, unreliable, and never worked
properly since structured data migration. Will be handled differently
in the future.
Service control migrated to SSH, command receiver no longer needed.
- Remove command_receiver Socket from ZmqHandler
- Remove try_receive_command method
- Remove AgentCommand enum
- Remove command_port from ZmqConfig
- Remove zbus dependency from agent
- Replace D-Bus Connection calls with systemctl show commands
- Fix agent hang by eliminating blocking D-Bus operations
- get_unit_property now uses systemctl show with property flags
- Memory, disk usage, and nginx config queries use systemctl
- Simpler, more reliable service monitoring
The D-Bus ListUnits call in discover_services_internal() was causing
the agent to hang on startup.
**Root cause:**
- D-Bus ListUnits call with complex tuple destructuring hung indefinitely
- Agent never completed first collection cycle
- No collector output in logs
**Fix:**
- Revert discover_services_internal() to use systemctl list-units/list-unit-files
- Keep D-Bus-based property queries (WorkingDirectory, MemoryCurrent, ExecStart)
- Hybrid approach: systemctl for discovery, D-Bus for individual queries
**External commands still used:**
- systemctl list-units, list-unit-files (service discovery)
- smartctl (SMART data)
- sudo du (directory sizes)
- nginx -T (config fallback)
Version bump: 0.1.198 → 0.1.199
Complete migration from systemctl subprocess calls to native D-Bus communication:
**Removed systemctl commands:**
- systemctl is-active (fallback) - use D-Bus cache from ListUnits
- systemctl show --property=LoadState,ActiveState,SubState - use D-Bus cache
- systemctl show --property=WorkingDirectory - use D-Bus Properties.Get
- systemctl show --property=MemoryCurrent - use D-Bus Properties.Get
- systemctl show nginx --property=ExecStart - use D-Bus Properties.Get
**Implementation details:**
- Added get_unit_property() helper for D-Bus property access
- Made get_nginx_site_metrics() async to support D-Bus calls
- Made get_nginx_sites_internal() async
- Made discover_nginx_sites() async
- Made get_nginx_config_from_systemd() async
- Fixed RwLock guard Send issues by using scoped locks
**Remaining external commands:**
- smartctl (disk.rs) - No Rust alternative for SMART data
- sudo du (systemd.rs) - Directory size measurement
- nginx -T (systemd.rs) - Nginx config fallback
- timeout hostname (nixos.rs) - Rare fallback only
Version bump: 0.1.197 → 0.1.198
CRITICAL FIX: The previous cached collector architecture still had ZMQ sending
in the main event loop, where it could block waiting for RwLock when collectors
were writing. This caused the 3-8 second delays you observed.
Changes:
- Move ZMQ publisher to dedicated std::thread (ZMQ sockets aren't thread-safe)
- Use try_read() instead of read() to avoid blocking on write locks
- Send previous data if cache is locked by collector
- ZMQ now sends every 2s regardless of collector timing
- Remove publisher from ZmqHandler (now only handles commands)
Architecture:
- Collectors: Independent tokio tasks updating shared cache
- ZMQ Sender: Dedicated OS thread with its own publisher socket
- Main Loop: Only handles commands and notifications
This ensures ZMQ transmission is NEVER blocked by slow collectors.
Bump version to v0.1.195
Critical bug fix: Collectors were appending to Vecs instead of replacing them,
causing duplicate entries with each collection cycle.
Fixed by adding .clear() calls before populating:
- Memory collector: tmpfs Vec (was showing 11+ duplicates)
- Disk collector: drives and pools Vecs
- Systemd collector: services Vec
- Network collector: Already correct (assigns new Vec)
This prevents the exponential growth of duplicate entries in the dashboard UI.
Major architectural refactor to eliminate false "host offline" alerts:
- Replace sequential blocking collectors with independent async tasks
- Each collector runs at configurable interval and updates shared cache
- ZMQ sender reads cache every 1-2s regardless of collector speed
- Collector intervals: CPU/Memory (1-10s), Backup/NixOS (30-60s), Disk/Systemd (60-300s)
All intervals now configurable via NixOS config:
- collectors.*.interval_seconds (collection frequency per collector)
- collectors.*.command_timeout_seconds (timeout for shell commands)
- notifications.check_interval_seconds (status change detection rate)
Command timeouts increased from hardcoded 2-3s to configurable 10-30s:
- Disk collector: 30s (SMART operations, lsblk)
- Systemd collector: 15s (systemctl, docker, du commands)
- Network collector: 10s (ip route, ip addr)
Benefits:
- No false "offline" alerts when slow collectors take >10s
- Different update rates for different metric types
- Better resource management with longer timeouts
- Full NixOS configuration control
Bump version to v0.1.193