Files
symon/docker-compose/METRICS.md
2025-11-07 19:13:23 +01:00

6.3 KiB

Bottom OpenTelemetry Metrics Reference

This document lists all metrics exported by Bottom when running with the opentelemetry feature enabled.

System Metrics

CPU

Metric Name Type Labels Description
system_cpu_usage_percent Gauge cpu_id CPU usage percentage per core

Example:

# Average CPU across all cores
avg(system_cpu_usage_percent)

# CPU usage for core 0
system_cpu_usage_percent{cpu_id="0"}

Memory

Metric Name Type Labels Description
system_memory_usage_bytes Gauge - RAM memory currently in use
system_memory_total_bytes Gauge - Total RAM memory available
system_swap_usage_bytes Gauge - Swap memory currently in use
system_swap_total_bytes Gauge - Total swap memory available

Example:

# Memory usage percentage
(system_memory_usage_bytes / system_memory_total_bytes) * 100

# Available memory
system_memory_total_bytes - system_memory_usage_bytes

Network

Metric Name Type Labels Description
system_network_rx_bytes_rate Gauge interface Network receive rate in bytes/sec
system_network_tx_bytes_rate Gauge interface Network transmit rate in bytes/sec

Example:

# Total network throughput
sum(system_network_rx_bytes_rate) + sum(system_network_tx_bytes_rate)

# RX rate for specific interface
system_network_rx_bytes_rate{interface="eth0"}

Disk

Metric Name Type Labels Description
system_disk_usage_bytes Gauge device, mount Disk space currently in use
system_disk_total_bytes Gauge device, mount Total disk space available

Example:

# Disk usage percentage
(system_disk_usage_bytes / system_disk_total_bytes) * 100

# Free disk space
system_disk_total_bytes - system_disk_usage_bytes

Temperature

Metric Name Type Labels Description
system_temperature_celsius Gauge sensor Temperature readings in Celsius

Example:

# Average temperature across all sensors
avg(system_temperature_celsius)

# Maximum temperature
max(system_temperature_celsius)

Process Metrics

Metric Name Type Labels Description
system_process_cpu_usage_percent Gauge name, pid CPU usage percentage per process
system_process_memory_usage_bytes Gauge name, pid Memory usage in bytes per process
system_process_count Gauge - Total number of processes

Example:

# Top 10 processes by CPU
topk(10, system_process_cpu_usage_percent)

# Top 10 processes by memory
topk(10, system_process_memory_usage_bytes)

# Total memory used by all Chrome processes
sum(system_process_memory_usage_bytes{name=~".*chrome.*"})

Recording Rules

The following recording rules are pre-configured in Prometheus (see rules/bottom_rules.yml):

Rule Name Expression Description
system_process_cpu_usage_percent:recent Recent process CPU metrics Filters out stale process data (>2 min old)
system_process_memory_usage_bytes:recent Recent process memory metrics Filters out stale process data (>2 min old)

Example:

# Query only recent process data
topk(10, system_process_cpu_usage_percent:recent)

Common Queries

System Health

# Overall system CPU usage
avg(system_cpu_usage_percent)

# Memory pressure (>80% is high)
(system_memory_usage_bytes / system_memory_total_bytes) * 100

# Disk pressure (>90% is critical)
(system_disk_usage_bytes / system_disk_total_bytes) * 100

Resource Hogs

# Top CPU consumers
topk(5, system_process_cpu_usage_percent)

# Top memory consumers
topk(5, system_process_memory_usage_bytes)

# Processes using >1GB memory
system_process_memory_usage_bytes > 1073741824

Network Analysis

# Total network traffic (RX + TX)
sum(system_network_rx_bytes_rate) + sum(system_network_tx_bytes_rate)

# Network traffic by interface
sum by (interface) (system_network_rx_bytes_rate + system_network_tx_bytes_rate)

# Interfaces with high RX rate (>10MB/s)
system_network_rx_bytes_rate > 10485760

Alerting Examples

Sample Prometheus Alert Rules

groups:
  - name: bottom_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: avg(system_cpu_usage_percent) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "Average CPU usage is {{ $value }}%"

      - alert: HighMemoryUsage
        expr: (system_memory_usage_bytes / system_memory_total_bytes) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is {{ $value }}%"

      - alert: DiskAlmostFull
        expr: (system_disk_usage_bytes / system_disk_total_bytes) * 100 > 90
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Disk {{ $labels.mount }} almost full"
          description: "Disk usage is {{ $value }}% on {{ $labels.mount }}"

Label Reference

Label Used In Description
cpu_id CPU metrics CPU core identifier (0, 1, 2, ...)
interface Network metrics Network interface name (eth0, wlan0, ...)
device Disk metrics Device name (/dev/sda1, ...)
mount Disk metrics Mount point (/, /home, ...)
sensor Temperature Temperature sensor name
name Process metrics Process name
pid Process metrics Process ID
exported_job All Always "bottom-system-monitor"
otel_scope_name All Always "bottom-system-monitor"

Data Retention

By default, Prometheus stores metrics for 15 days. You can adjust this in the Prometheus configuration:

# In prometheus.yml
global:
  retention_time: 30d  # Keep data for 30 days

For long-term storage, consider using:

  • TimescaleDB (see docker-compose-timescale.yml.ko)
  • Thanos for multi-cluster metrics
  • Cortex for horizontally scalable storage