init
This commit is contained in:
227
docker-compose/METRICS.md
Normal file
227
docker-compose/METRICS.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# Bottom OpenTelemetry Metrics Reference
|
||||
|
||||
This document lists all metrics exported by Bottom when running with the `opentelemetry` feature enabled.
|
||||
|
||||
## System Metrics
|
||||
|
||||
### CPU
|
||||
|
||||
| Metric Name | Type | Labels | Description |
|
||||
|------------|------|--------|-------------|
|
||||
| `system_cpu_usage_percent` | Gauge | `cpu_id` | CPU usage percentage per core |
|
||||
|
||||
**Example:**
|
||||
```promql
|
||||
# Average CPU across all cores
|
||||
avg(system_cpu_usage_percent)
|
||||
|
||||
# CPU usage for core 0
|
||||
system_cpu_usage_percent{cpu_id="0"}
|
||||
```
|
||||
|
||||
### Memory
|
||||
|
||||
| Metric Name | Type | Labels | Description |
|
||||
|------------|------|--------|-------------|
|
||||
| `system_memory_usage_bytes` | Gauge | - | RAM memory currently in use |
|
||||
| `system_memory_total_bytes` | Gauge | - | Total RAM memory available |
|
||||
| `system_swap_usage_bytes` | Gauge | - | Swap memory currently in use |
|
||||
| `system_swap_total_bytes` | Gauge | - | Total swap memory available |
|
||||
|
||||
**Example:**
|
||||
```promql
|
||||
# Memory usage percentage
|
||||
(system_memory_usage_bytes / system_memory_total_bytes) * 100
|
||||
|
||||
# Available memory
|
||||
system_memory_total_bytes - system_memory_usage_bytes
|
||||
```
|
||||
|
||||
### Network
|
||||
|
||||
| Metric Name | Type | Labels | Description |
|
||||
|------------|------|--------|-------------|
|
||||
| `system_network_rx_bytes_rate` | Gauge | `interface` | Network receive rate in bytes/sec |
|
||||
| `system_network_tx_bytes_rate` | Gauge | `interface` | Network transmit rate in bytes/sec |
|
||||
|
||||
**Example:**
|
||||
```promql
|
||||
# Total network throughput
|
||||
sum(system_network_rx_bytes_rate) + sum(system_network_tx_bytes_rate)
|
||||
|
||||
# RX rate for specific interface
|
||||
system_network_rx_bytes_rate{interface="eth0"}
|
||||
```
|
||||
|
||||
### Disk
|
||||
|
||||
| Metric Name | Type | Labels | Description |
|
||||
|------------|------|--------|-------------|
|
||||
| `system_disk_usage_bytes` | Gauge | `device`, `mount` | Disk space currently in use |
|
||||
| `system_disk_total_bytes` | Gauge | `device`, `mount` | Total disk space available |
|
||||
|
||||
**Example:**
|
||||
```promql
|
||||
# Disk usage percentage
|
||||
(system_disk_usage_bytes / system_disk_total_bytes) * 100
|
||||
|
||||
# Free disk space
|
||||
system_disk_total_bytes - system_disk_usage_bytes
|
||||
```
|
||||
|
||||
### Temperature
|
||||
|
||||
| Metric Name | Type | Labels | Description |
|
||||
|------------|------|--------|-------------|
|
||||
| `system_temperature_celsius` | Gauge | `sensor` | Temperature readings in Celsius |
|
||||
|
||||
**Example:**
|
||||
```promql
|
||||
# Average temperature across all sensors
|
||||
avg(system_temperature_celsius)
|
||||
|
||||
# Maximum temperature
|
||||
max(system_temperature_celsius)
|
||||
```
|
||||
|
||||
## Process Metrics
|
||||
|
||||
| Metric Name | Type | Labels | Description |
|
||||
|------------|------|--------|-------------|
|
||||
| `system_process_cpu_usage_percent` | Gauge | `name`, `pid` | CPU usage percentage per process |
|
||||
| `system_process_memory_usage_bytes` | Gauge | `name`, `pid` | Memory usage in bytes per process |
|
||||
| `system_process_count` | Gauge | - | Total number of processes |
|
||||
|
||||
**Example:**
|
||||
```promql
|
||||
# Top 10 processes by CPU
|
||||
topk(10, system_process_cpu_usage_percent)
|
||||
|
||||
# Top 10 processes by memory
|
||||
topk(10, system_process_memory_usage_bytes)
|
||||
|
||||
# Total memory used by all Chrome processes
|
||||
sum(system_process_memory_usage_bytes{name=~".*chrome.*"})
|
||||
```
|
||||
|
||||
## Recording Rules
|
||||
|
||||
The following recording rules are pre-configured in Prometheus (see `rules/bottom_rules.yml`):
|
||||
|
||||
| Rule Name | Expression | Description |
|
||||
|-----------|------------|-------------|
|
||||
| `system_process_cpu_usage_percent:recent` | Recent process CPU metrics | Filters out stale process data (>2 min old) |
|
||||
| `system_process_memory_usage_bytes:recent` | Recent process memory metrics | Filters out stale process data (>2 min old) |
|
||||
|
||||
**Example:**
|
||||
```promql
|
||||
# Query only recent process data
|
||||
topk(10, system_process_cpu_usage_percent:recent)
|
||||
```
|
||||
|
||||
## Common Queries
|
||||
|
||||
### System Health
|
||||
|
||||
```promql
|
||||
# Overall system CPU usage
|
||||
avg(system_cpu_usage_percent)
|
||||
|
||||
# Memory pressure (>80% is high)
|
||||
(system_memory_usage_bytes / system_memory_total_bytes) * 100
|
||||
|
||||
# Disk pressure (>90% is critical)
|
||||
(system_disk_usage_bytes / system_disk_total_bytes) * 100
|
||||
```
|
||||
|
||||
### Resource Hogs
|
||||
|
||||
```promql
|
||||
# Top CPU consumers
|
||||
topk(5, system_process_cpu_usage_percent)
|
||||
|
||||
# Top memory consumers
|
||||
topk(5, system_process_memory_usage_bytes)
|
||||
|
||||
# Processes using >1GB memory
|
||||
system_process_memory_usage_bytes > 1073741824
|
||||
```
|
||||
|
||||
### Network Analysis
|
||||
|
||||
```promql
|
||||
# Total network traffic (RX + TX)
|
||||
sum(system_network_rx_bytes_rate) + sum(system_network_tx_bytes_rate)
|
||||
|
||||
# Network traffic by interface
|
||||
sum by (interface) (system_network_rx_bytes_rate + system_network_tx_bytes_rate)
|
||||
|
||||
# Interfaces with high RX rate (>10MB/s)
|
||||
system_network_rx_bytes_rate > 10485760
|
||||
```
|
||||
|
||||
## Alerting Examples
|
||||
|
||||
### Sample Prometheus Alert Rules
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: bottom_alerts
|
||||
interval: 30s
|
||||
rules:
|
||||
- alert: HighCPUUsage
|
||||
expr: avg(system_cpu_usage_percent) > 80
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High CPU usage detected"
|
||||
description: "Average CPU usage is {{ $value }}%"
|
||||
|
||||
- alert: HighMemoryUsage
|
||||
expr: (system_memory_usage_bytes / system_memory_total_bytes) * 100 > 90
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "High memory usage detected"
|
||||
description: "Memory usage is {{ $value }}%"
|
||||
|
||||
- alert: DiskAlmostFull
|
||||
expr: (system_disk_usage_bytes / system_disk_total_bytes) * 100 > 90
|
||||
for: 10m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "Disk {{ $labels.mount }} almost full"
|
||||
description: "Disk usage is {{ $value }}% on {{ $labels.mount }}"
|
||||
```
|
||||
|
||||
## Label Reference
|
||||
|
||||
| Label | Used In | Description |
|
||||
|-------|---------|-------------|
|
||||
| `cpu_id` | CPU metrics | CPU core identifier (0, 1, 2, ...) |
|
||||
| `interface` | Network metrics | Network interface name (eth0, wlan0, ...) |
|
||||
| `device` | Disk metrics | Device name (/dev/sda1, ...) |
|
||||
| `mount` | Disk metrics | Mount point (/, /home, ...) |
|
||||
| `sensor` | Temperature | Temperature sensor name |
|
||||
| `name` | Process metrics | Process name |
|
||||
| `pid` | Process metrics | Process ID |
|
||||
| `exported_job` | All | Always "bottom-system-monitor" |
|
||||
| `otel_scope_name` | All | Always "bottom-system-monitor" |
|
||||
|
||||
## Data Retention
|
||||
|
||||
By default, Prometheus stores metrics for 15 days. You can adjust this in the Prometheus configuration:
|
||||
|
||||
```yaml
|
||||
# In prometheus.yml
|
||||
global:
|
||||
retention_time: 30d # Keep data for 30 days
|
||||
```
|
||||
|
||||
For long-term storage, consider using:
|
||||
- **TimescaleDB** (see `docker-compose-timescale.yml.ko`)
|
||||
- **Thanos** for multi-cluster metrics
|
||||
- **Cortex** for horizontally scalable storage
|
||||
195
docker-compose/README.md
Normal file
195
docker-compose/README.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Bottom OpenTelemetry Docker Compose Setup
|
||||
|
||||
This directory contains a Docker Compose setup for running an observability stack to monitor Bottom with OpenTelemetry.
|
||||
|
||||
## Architecture
|
||||
|
||||
The stack includes:
|
||||
|
||||
1. **OpenTelemetry Collector** - Receives metrics from Bottom via OTLP protocol
|
||||
2. **Prometheus** - Scrapes and stores metrics from the OTEL Collector
|
||||
3. **Grafana** - Visualizes metrics from Prometheus
|
||||
|
||||
```
|
||||
Bottom (with --headless flag)
|
||||
↓ (OTLP/gRPC on port 4317)
|
||||
OpenTelemetry Collector
|
||||
↓ (Prometheus scrape on port 8889)
|
||||
Prometheus
|
||||
↓ (Query on port 9090)
|
||||
Grafana (accessible on port 3000)
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Start the observability stack
|
||||
|
||||
```bash
|
||||
cd docker-compose
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
This will start:
|
||||
- OpenTelemetry Collector on ports 4317 (gRPC), 4318 (HTTP), 8889 (metrics)
|
||||
- Prometheus on port 9090
|
||||
- Grafana on port 3000
|
||||
|
||||
### 2. Build Bottom with OpenTelemetry support
|
||||
|
||||
```bash
|
||||
cd ..
|
||||
cargo build --release --features opentelemetry
|
||||
```
|
||||
|
||||
### 3. Create a configuration file
|
||||
|
||||
Create a `bottom-config.toml` file:
|
||||
|
||||
```toml
|
||||
[opentelemetry]
|
||||
enabled = true
|
||||
endpoint = "http://localhost:4317"
|
||||
service_name = "bottom-system-monitor"
|
||||
export_interval_ms = 5000
|
||||
|
||||
[opentelemetry.metrics]
|
||||
cpu = true
|
||||
memory = true
|
||||
network = true
|
||||
disk = true
|
||||
processes = true
|
||||
temperature = true
|
||||
gpu = true
|
||||
```
|
||||
|
||||
### 4. Run Bottom in headless mode
|
||||
|
||||
```bash
|
||||
./target/release/btm --config bottom-config.toml --headless
|
||||
```
|
||||
|
||||
Or without config file:
|
||||
|
||||
```bash
|
||||
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
|
||||
./target/release/btm --headless
|
||||
```
|
||||
|
||||
### 5. Access the dashboards
|
||||
|
||||
- **Prometheus**: http://localhost:9090
|
||||
- **Grafana**: http://localhost:3000 (username: `admin`, password: `admin`)
|
||||
|
||||
## Configuration Files
|
||||
|
||||
### otel-collector-config.yml
|
||||
|
||||
Configures the OpenTelemetry Collector to:
|
||||
- Receive OTLP data on ports 4317 (gRPC) and 4318 (HTTP)
|
||||
- Export metrics in Prometheus format on port 9090
|
||||
- Debug log all received data
|
||||
|
||||
### prometheus.yml
|
||||
|
||||
Configures Prometheus to:
|
||||
- Scrape metrics from the OTEL Collector every 10 seconds
|
||||
- Load alerting rules from `rules/bottom_rules.yml`
|
||||
|
||||
### rules/bottom_rules.yml
|
||||
|
||||
Contains Prometheus recording rules for Bottom metrics, including:
|
||||
- Recent process CPU usage metrics
|
||||
- Recent process memory usage metrics
|
||||
|
||||
## Viewing Metrics in Prometheus
|
||||
|
||||
1. Go to http://localhost:9090
|
||||
2. Click on "Graph"
|
||||
3. Try these example queries:
|
||||
|
||||
```promql
|
||||
# CPU usage by core
|
||||
system_cpu_usage_percent
|
||||
|
||||
# Memory usage
|
||||
system_memory_usage_bytes
|
||||
|
||||
# Network RX/TX
|
||||
system_network_rx_bytes
|
||||
system_network_tx_bytes
|
||||
|
||||
# Disk usage
|
||||
system_disk_usage_bytes
|
||||
|
||||
# Top processes by CPU
|
||||
topk(10, system_process_cpu_usage_percent)
|
||||
|
||||
# Top processes by memory
|
||||
topk(10, system_process_memory_usage_bytes)
|
||||
```
|
||||
|
||||
## Grafana Configuration
|
||||
|
||||
Grafana is automatically configured with:
|
||||
- **Prometheus data source** (http://prometheus:9090) - pre-configured
|
||||
- **Bottom System Overview dashboard** - pre-loaded
|
||||
|
||||
To access:
|
||||
1. Go to http://localhost:3000 (username: `admin`, password: `admin`)
|
||||
2. Navigate to Dashboards → Browse → "Bottom System Overview"
|
||||
|
||||
The dashboard includes:
|
||||
- CPU usage by core
|
||||
- Memory usage (RAM/Swap)
|
||||
- Network traffic
|
||||
- Disk usage
|
||||
- Top 10 processes by CPU
|
||||
- Top 10 processes by Memory
|
||||
|
||||
## Stopping the Stack
|
||||
|
||||
```bash
|
||||
docker-compose down
|
||||
```
|
||||
|
||||
To also remove volumes:
|
||||
|
||||
```bash
|
||||
docker-compose down -v
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Bottom not sending metrics
|
||||
|
||||
Check the OTEL Collector logs:
|
||||
```bash
|
||||
docker-compose logs -f otel-collector
|
||||
```
|
||||
|
||||
You should see messages about receiving metrics.
|
||||
|
||||
### Prometheus not scraping
|
||||
|
||||
1. Check Prometheus targets at http://localhost:9090/targets
|
||||
2. The `otel-collector` target should be UP
|
||||
|
||||
### No data in Grafana
|
||||
|
||||
1. Verify Prometheus data source is configured correctly
|
||||
2. Check that Prometheus has data by querying directly
|
||||
3. Ensure your time range in Grafana includes when Bottom was running
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Using with TimescaleDB (optional)
|
||||
|
||||
A TimescaleDB configuration file is available as `docker-compose-timescale.yml.ko` for long-term storage of metrics. Rename it to include it in your stack.
|
||||
|
||||
### Custom Prometheus Rules
|
||||
|
||||
Edit `rules/bottom_rules.yml` to add custom recording or alerting rules.
|
||||
|
||||
### OTEL Collector Sampling
|
||||
|
||||
Edit `otel-collector-config.yml` to adjust the batch processor settings for different performance characteristics.
|
||||
61
docker-compose/docker-compose-timescale.yml.ko
Normal file
61
docker-compose/docker-compose-timescale.yml.ko
Normal file
@@ -0,0 +1,61 @@
|
||||
services:
|
||||
timescaledb:
|
||||
image: timescale/timescaledb-ha:pg15
|
||||
environment:
|
||||
POSTGRES_PASSWORD: password
|
||||
POSTGRES_DB: promscale
|
||||
POSTGRES_USER: postgres
|
||||
ports:
|
||||
- "5432:5432"
|
||||
volumes:
|
||||
- timescale_data:/var/lib/postgresql/data
|
||||
|
||||
promscale:
|
||||
image: timescale/promscale:latest
|
||||
ports:
|
||||
- "9201:9201"
|
||||
depends_on:
|
||||
- timescaledb
|
||||
environment:
|
||||
PROMSCALE_DB_URI: postgres://postgres:password@timescaledb:5432/promscale?sslmode=disable
|
||||
PROMSCALE_STARTUP_INSTALL_EXTENSIONS: "true"
|
||||
restart: on-failure
|
||||
|
||||
otel-collector:
|
||||
image: otel/opentelemetry-collector-contrib:latest
|
||||
container_name: otel-collector
|
||||
command: ["--config=/etc/otel-collector-config.yml"]
|
||||
volumes:
|
||||
- ./otel-collector-config.yml:/etc/otel-collector-config.yml
|
||||
|
||||
ports:
|
||||
- "4317:4317"
|
||||
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
container_name: prometheus
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
||||
- ./rules:/etc/prometheus/rules
|
||||
ports:
|
||||
- "9090:9090" # Interfaccia Web di Prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
depends_on:
|
||||
- otel-collector
|
||||
|
||||
grafana:
|
||||
image: grafana/grafana:latest
|
||||
ports:
|
||||
- "3000:3000"
|
||||
environment:
|
||||
- GF_SECURITY_ADMIN_PASSWORD=admin
|
||||
- GF_SECURITY_ADMIN_USER=admin
|
||||
volumes:
|
||||
- grafana-storage:/var/lib/grafana
|
||||
depends_on:
|
||||
- prometheus
|
||||
|
||||
volumes:
|
||||
grafana-storage:
|
||||
timescale_data:
|
||||
52
docker-compose/docker-compose.yml
Normal file
52
docker-compose/docker-compose.yml
Normal file
@@ -0,0 +1,52 @@
|
||||
services:
|
||||
|
||||
otel-collector:
|
||||
image: otel/opentelemetry-collector-contrib:latest
|
||||
container_name: otel-collector
|
||||
command: ["--config=/etc/otel-collector-config.yml"]
|
||||
volumes:
|
||||
- ./otel-collector-config.yml:/etc/otel-collector-config.yml
|
||||
ports:
|
||||
- "4317:4317" # gRPC
|
||||
- "4318:4318" # HTTP
|
||||
- "8889:8889" # Prometheus metrics endpoint
|
||||
networks:
|
||||
- observ-net
|
||||
|
||||
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
container_name: prometheus
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
||||
- ./rules:/etc/prometheus/rules
|
||||
ports:
|
||||
- "9090:9090" # Interfaccia Web di Prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
depends_on:
|
||||
- otel-collector
|
||||
networks:
|
||||
- observ-net
|
||||
|
||||
grafana:
|
||||
image: grafana/grafana:latest
|
||||
ports:
|
||||
- "3000:3000"
|
||||
environment:
|
||||
- GF_SECURITY_ADMIN_PASSWORD=admin
|
||||
- GF_SECURITY_ADMIN_USER=admin
|
||||
volumes:
|
||||
- grafana-storage:/var/lib/grafana
|
||||
- ./grafana/provisioning:/etc/grafana/provisioning
|
||||
depends_on:
|
||||
- prometheus
|
||||
networks:
|
||||
- observ-net
|
||||
|
||||
volumes:
|
||||
grafana-storage:
|
||||
|
||||
networks:
|
||||
observ-net:
|
||||
driver: bridge
|
||||
@@ -0,0 +1,278 @@
|
||||
{
|
||||
"title": "Bottom System Overview",
|
||||
"uid": "bottom-overview",
|
||||
"timezone": "browser",
|
||||
"schemaVersion": 16,
|
||||
"refresh": "5s",
|
||||
"editable": true,
|
||||
"panels": [
|
||||
{
|
||||
"id": 1,
|
||||
"title": "CPU Usage by Core",
|
||||
"type": "timeseries",
|
||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "system_cpu_usage_percent",
|
||||
"legendFormat": "Core {{cpu_id}}",
|
||||
"refId": "CPU"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "percent",
|
||||
"min": 0,
|
||||
"max": 100
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Memory Usage",
|
||||
"type": "timeseries",
|
||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "system_memory_usage_bytes",
|
||||
"legendFormat": "RAM Used",
|
||||
"refId": "RAM"
|
||||
},
|
||||
{
|
||||
"expr": "system_memory_total_bytes",
|
||||
"legendFormat": "RAM Total",
|
||||
"refId": "RAM_Total"
|
||||
},
|
||||
{
|
||||
"expr": "system_swap_usage_bytes",
|
||||
"legendFormat": "Swap Used",
|
||||
"refId": "Swap"
|
||||
},
|
||||
{
|
||||
"expr": "system_swap_total_bytes",
|
||||
"legendFormat": "Swap Total",
|
||||
"refId": "Swap_Total"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "bytes"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Network Traffic",
|
||||
"type": "timeseries",
|
||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "system_network_rx_bytes_rate",
|
||||
"legendFormat": "RX - {{interface}}",
|
||||
"refId": "RX"
|
||||
},
|
||||
{
|
||||
"expr": "system_network_tx_bytes_rate",
|
||||
"legendFormat": "TX - {{interface}}",
|
||||
"refId": "TX"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "Bps"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "Disk Usage",
|
||||
"type": "gauge",
|
||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "(system_disk_usage_bytes / system_disk_total_bytes) * 100",
|
||||
"legendFormat": "{{mount}} ({{device}})",
|
||||
"refId": "Disk"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "percent",
|
||||
"min": 0,
|
||||
"max": 100,
|
||||
"thresholds": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"value": 0, "color": "green"},
|
||||
{"value": 70, "color": "yellow"},
|
||||
{"value": 90, "color": "red"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Top 10 Processes by CPU",
|
||||
"type": "table",
|
||||
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 16},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "topk(10, system_process_cpu_usage_percent and (time() - timestamp(system_process_cpu_usage_percent) < 30))",
|
||||
"format": "table",
|
||||
"instant": true,
|
||||
"refId": "Process"
|
||||
}
|
||||
],
|
||||
"transformations": [
|
||||
{
|
||||
"id": "organize",
|
||||
"options": {
|
||||
"excludeByName": {
|
||||
"Time": true,
|
||||
"__name__": true,
|
||||
"job": true,
|
||||
"instance": true,
|
||||
"exported_job": true,
|
||||
"otel_scope_name": true
|
||||
},
|
||||
"indexByName": {
|
||||
"name": 0,
|
||||
"pid": 1,
|
||||
"Value": 2
|
||||
},
|
||||
"renameByName": {
|
||||
"name": "Process Name",
|
||||
"pid": "PID",
|
||||
"Value": "CPU %"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"showHeader": true,
|
||||
"sortBy": [
|
||||
{
|
||||
"displayName": "CPU %",
|
||||
"desc": true
|
||||
}
|
||||
]
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"custom": {
|
||||
"align": "auto",
|
||||
"displayMode": "auto"
|
||||
}
|
||||
},
|
||||
"overrides": [
|
||||
{
|
||||
"matcher": {"id": "byName", "options": "CPU %"},
|
||||
"properties": [
|
||||
{
|
||||
"id": "unit",
|
||||
"value": "percent"
|
||||
},
|
||||
{
|
||||
"id": "custom.displayMode",
|
||||
"value": "color-background"
|
||||
},
|
||||
{
|
||||
"id": "thresholds",
|
||||
"value": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"value": 0, "color": "green"},
|
||||
{"value": 50, "color": "yellow"},
|
||||
{"value": 80, "color": "red"}
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 6,
|
||||
"title": "Top 10 Processes by Memory",
|
||||
"type": "table",
|
||||
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 16},
|
||||
"targets": [
|
||||
{
|
||||
"expr": "topk(10, system_process_memory_usage_bytes and (time() - timestamp(system_process_memory_usage_bytes) < 30))",
|
||||
"format": "table",
|
||||
"instant": true,
|
||||
"refId": "Process"
|
||||
}
|
||||
],
|
||||
"transformations": [
|
||||
{
|
||||
"id": "organize",
|
||||
"options": {
|
||||
"excludeByName": {
|
||||
"Time": true,
|
||||
"__name__": true,
|
||||
"job": true,
|
||||
"instance": true,
|
||||
"exported_job": true,
|
||||
"otel_scope_name": true
|
||||
},
|
||||
"indexByName": {
|
||||
"name": 0,
|
||||
"pid": 1,
|
||||
"Value": 2
|
||||
},
|
||||
"renameByName": {
|
||||
"name": "Process Name",
|
||||
"pid": "PID",
|
||||
"Value": "Memory"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"showHeader": true,
|
||||
"sortBy": [
|
||||
{
|
||||
"displayName": "Memory",
|
||||
"desc": true
|
||||
}
|
||||
]
|
||||
},
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"custom": {
|
||||
"align": "auto",
|
||||
"displayMode": "auto"
|
||||
}
|
||||
},
|
||||
"overrides": [
|
||||
{
|
||||
"matcher": {"id": "byName", "options": "Memory"},
|
||||
"properties": [
|
||||
{
|
||||
"id": "unit",
|
||||
"value": "bytes"
|
||||
},
|
||||
{
|
||||
"id": "custom.displayMode",
|
||||
"value": "color-background"
|
||||
},
|
||||
{
|
||||
"id": "thresholds",
|
||||
"value": {
|
||||
"mode": "absolute",
|
||||
"steps": [
|
||||
{"value": 0, "color": "green"},
|
||||
{"value": 1073741824, "color": "yellow"},
|
||||
{"value": 2147483648, "color": "red"}
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,12 @@
|
||||
apiVersion: 1
|
||||
|
||||
providers:
|
||||
- name: 'Bottom Dashboards'
|
||||
orgId: 1
|
||||
folder: ''
|
||||
type: file
|
||||
disableDeletion: false
|
||||
updateIntervalSeconds: 10
|
||||
allowUiUpdates: true
|
||||
options:
|
||||
path: /etc/grafana/provisioning/dashboards
|
||||
@@ -0,0 +1,12 @@
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Prometheus
|
||||
type: prometheus
|
||||
access: proxy
|
||||
url: http://prometheus:9090
|
||||
isDefault: true
|
||||
editable: true
|
||||
jsonData:
|
||||
timeInterval: 10s
|
||||
queryTimeout: 60s
|
||||
31
docker-compose/otel-collector-config.yml
Normal file
31
docker-compose/otel-collector-config.yml
Normal file
@@ -0,0 +1,31 @@
|
||||
receivers:
|
||||
otlp:
|
||||
protocols:
|
||||
grpc:
|
||||
endpoint: 0.0.0.0:4317
|
||||
http:
|
||||
endpoint: 0.0.0.0:4318
|
||||
|
||||
processors:
|
||||
batch:
|
||||
send_batch_size: 10000
|
||||
timeout: 10s
|
||||
metricsgeneration: {}
|
||||
|
||||
exporters:
|
||||
prometheus:
|
||||
endpoint: "0.0.0.0:8889"
|
||||
debug:
|
||||
verbosity: detailed
|
||||
|
||||
service:
|
||||
pipelines:
|
||||
metrics:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [prometheus, debug]
|
||||
|
||||
logs:
|
||||
receivers: [otlp]
|
||||
processors: [batch]
|
||||
exporters: [debug]
|
||||
67
docker-compose/processes-example.toml
Normal file
67
docker-compose/processes-example.toml
Normal file
@@ -0,0 +1,67 @@
|
||||
# Example process filter configuration file
|
||||
# This file can be included from the main bottom config to keep
|
||||
# server-specific process lists separate.
|
||||
#
|
||||
# Usage in bottom-config.toml:
|
||||
# [opentelemetry.metrics.process_filter]
|
||||
# include = "processes.toml"
|
||||
|
||||
# Filter mode: "whitelist" or "blacklist"
|
||||
# - whitelist: Only export metrics for processes in the lists below
|
||||
# - blacklist: Export metrics for all processes EXCEPT those in the lists
|
||||
filter_mode = "whitelist"
|
||||
|
||||
# Process names to monitor (case-insensitive substring match)
|
||||
# Examples for common server processes:
|
||||
names = [
|
||||
# Web servers
|
||||
"nginx",
|
||||
"apache",
|
||||
"httpd",
|
||||
|
||||
# Databases
|
||||
"postgres",
|
||||
"mysql",
|
||||
"redis",
|
||||
"mongodb",
|
||||
|
||||
# Application servers
|
||||
"java",
|
||||
"node",
|
||||
"python",
|
||||
|
||||
# Your custom applications
|
||||
# "myapp",
|
||||
]
|
||||
|
||||
# Regex patterns to match process names (case-sensitive)
|
||||
# More powerful than simple substring matching
|
||||
patterns = [
|
||||
# Match specific versions
|
||||
# "^nginx-[0-9.]+$",
|
||||
# "^node-v[0-9]+",
|
||||
|
||||
# Match Java applications with specific main class
|
||||
# "java.*MyApplication",
|
||||
|
||||
# Match processes with specific format
|
||||
# "^gunicorn: worker",
|
||||
|
||||
# Match kernel threads (for blacklist)
|
||||
# "^\\[.*\\]$",
|
||||
]
|
||||
|
||||
# Specific process PIDs to monitor (optional)
|
||||
# Useful for monitoring specific long-running processes
|
||||
pids = []
|
||||
|
||||
# Example blacklist configuration:
|
||||
# filter_mode = "blacklist"
|
||||
# names = [
|
||||
# "systemd", # Exclude system processes
|
||||
# "kworker",
|
||||
# "migration",
|
||||
# ]
|
||||
# patterns = [
|
||||
# "^\\[.*\\]$", # Exclude all kernel threads
|
||||
# ]
|
||||
21
docker-compose/prometheus.yml
Normal file
21
docker-compose/prometheus.yml
Normal file
@@ -0,0 +1,21 @@
|
||||
global:
|
||||
scrape_interval: 10s # Quanto spesso fare lo scraping
|
||||
evaluation_interval: 10s
|
||||
|
||||
rule_files:
|
||||
- /etc/prometheus/rules/*.yml
|
||||
|
||||
scrape_configs:
|
||||
# Job 1: Monitora se Prometheus stesso è attivo
|
||||
- job_name: 'prometheus'
|
||||
static_configs:
|
||||
- targets: ['localhost:9090']
|
||||
|
||||
# Job 2: Scrape dell'OpenTelemetry Collector
|
||||
- job_name: 'otel-collector'
|
||||
# Il Collector espone le metriche per lo scraping sulla sua porta 8889
|
||||
metrics_path: '/metrics'
|
||||
static_configs:
|
||||
# Raggiunge il Collector usando il suo nome di servizio Docker
|
||||
- targets: ['otel-collector:8889']
|
||||
|
||||
15
docker-compose/rules/bottom_rules.yml
Normal file
15
docker-compose/rules/bottom_rules.yml
Normal file
@@ -0,0 +1,15 @@
|
||||
groups:
|
||||
- name: bottom_process_metrics
|
||||
interval: 30s
|
||||
rules:
|
||||
- record: system_process_cpu_usage_percent:recent
|
||||
expr: |
|
||||
system_process_cpu_usage_percent
|
||||
and on(pid, name)
|
||||
(time() - timestamp(system_process_cpu_usage_percent) < 120)
|
||||
|
||||
- record: system_process_memory_usage_bytes:recent
|
||||
expr: |
|
||||
system_process_memory_usage_bytes
|
||||
and on(pid, name)
|
||||
(time() - timestamp(system_process_memory_usage_bytes) < 120)
|
||||
61
docker-compose/symon-config-example.toml
Normal file
61
docker-compose/symon-config-example.toml
Normal file
@@ -0,0 +1,61 @@
|
||||
# Example Symon configuration file for OpenTelemetry export
|
||||
# Copy this file and customize it for your needs
|
||||
|
||||
# Collection interval in seconds
|
||||
collection_interval_secs = 5
|
||||
|
||||
# OTLP configuration
|
||||
[otlp]
|
||||
# OTLP endpoint (gRPC)
|
||||
# For local docker-compose setup: http://localhost:4317
|
||||
# For remote collector: http://your-collector-host:4317
|
||||
endpoint = "http://localhost:4317"
|
||||
|
||||
# Export interval in seconds
|
||||
export_interval_secs = 10
|
||||
|
||||
# Service name that will appear in metrics
|
||||
service_name = "symon"
|
||||
|
||||
# Service version
|
||||
service_version = "0.1.0"
|
||||
|
||||
# Export timeout in seconds
|
||||
export_timeout_secs = 30
|
||||
|
||||
# Additional resource attributes (key-value pairs)
|
||||
[otlp.resource_attributes]
|
||||
environment = "production"
|
||||
host = "server-01"
|
||||
|
||||
# Metrics configuration - enable/disable specific metric types
|
||||
[metrics]
|
||||
cpu = true # CPU usage per core and average
|
||||
memory = true # RAM, swap usage
|
||||
network = true # Network RX/TX
|
||||
disk = true # Disk usage
|
||||
temperature = true # CPU/GPU temperatures
|
||||
processes = true # Top 10 processes by CPU/Memory
|
||||
|
||||
# Process filtering configuration
|
||||
[metrics.process_filter]
|
||||
# Option 1: Use an external file for server-specific process lists
|
||||
# This allows different servers to monitor different processes
|
||||
# Path can be relative to this config file or absolute
|
||||
#include = "processes.toml"
|
||||
|
||||
# Option 2: Configure inline
|
||||
# Filter mode: "whitelist" (only listed processes) or "blacklist" (exclude listed)
|
||||
filter_mode = "whitelist"
|
||||
|
||||
# List of process names to filter (case-insensitive substring match)
|
||||
# Examples: ["nginx", "postgres", "redis", "myapp"]
|
||||
names = ["nginx", "postgres", "redis"]
|
||||
|
||||
# List of regex patterns to match process names (case-sensitive)
|
||||
# More powerful than substring matching
|
||||
# Examples: ["^nginx-[0-9.]+$", "java.*MyApp", "^gunicorn: worker"]
|
||||
patterns = []
|
||||
|
||||
# List of specific process PIDs to filter
|
||||
pids = []
|
||||
80
docker-compose/test-stack.sh
Executable file
80
docker-compose/test-stack.sh
Executable file
@@ -0,0 +1,80 @@
|
||||
#!/bin/bash
|
||||
# Test script to verify the observability stack is running correctly
|
||||
|
||||
set -e
|
||||
|
||||
echo "🔍 Testing Bottom OpenTelemetry Stack..."
|
||||
echo ""
|
||||
|
||||
# Colors
|
||||
GREEN='\033[0;32m'
|
||||
RED='\033[0;31m'
|
||||
YELLOW='\033[1;33m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Test OTEL Collector gRPC endpoint
|
||||
echo -n "Testing OTEL Collector gRPC (port 4317)... "
|
||||
if nc -zv localhost 4317 2>&1 | grep -q "succeeded\|open"; then
|
||||
echo -e "${GREEN}✓ OK${NC}"
|
||||
else
|
||||
echo -e "${RED}✗ FAILED${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Test OTEL Collector HTTP endpoint
|
||||
echo -n "Testing OTEL Collector HTTP (port 4318)... "
|
||||
if nc -zv localhost 4318 2>&1 | grep -q "succeeded\|open"; then
|
||||
echo -e "${GREEN}✓ OK${NC}"
|
||||
else
|
||||
echo -e "${RED}✗ FAILED${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Test OTEL Collector metrics endpoint
|
||||
echo -n "Testing OTEL Collector metrics (port 8889)... "
|
||||
if curl -s http://localhost:8889/metrics > /dev/null; then
|
||||
echo -e "${GREEN}✓ OK${NC}"
|
||||
else
|
||||
echo -e "${RED}✗ FAILED${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Test Prometheus
|
||||
echo -n "Testing Prometheus (port 9090)... "
|
||||
if curl -s http://localhost:9090/-/healthy | grep -q "Prometheus"; then
|
||||
echo -e "${GREEN}✓ OK${NC}"
|
||||
else
|
||||
echo -e "${RED}✗ FAILED${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Test Prometheus targets
|
||||
echo -n "Testing Prometheus targets... "
|
||||
TARGETS=$(curl -s http://localhost:9090/api/v1/targets | grep -o '"health":"up"' | wc -l)
|
||||
if [ "$TARGETS" -gt 0 ]; then
|
||||
echo -e "${GREEN}✓ OK${NC} (${TARGETS} targets up)"
|
||||
else
|
||||
echo -e "${YELLOW}⚠ WARNING${NC} (no targets up yet - this is normal if just started)"
|
||||
fi
|
||||
|
||||
# Test Grafana
|
||||
echo -n "Testing Grafana (port 3000)... "
|
||||
if curl -s http://localhost:3000/api/health | grep -q "ok"; then
|
||||
echo -e "${GREEN}✓ OK${NC}"
|
||||
else
|
||||
echo -e "${RED}✗ FAILED${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${GREEN}✓ All tests passed!${NC}"
|
||||
echo ""
|
||||
echo "📊 Access points:"
|
||||
echo " - Prometheus: http://localhost:9090"
|
||||
echo " - Grafana: http://localhost:3000 (admin/admin)"
|
||||
echo " - OTEL Collector metrics: http://localhost:8889/metrics"
|
||||
echo ""
|
||||
echo "💡 Next steps:"
|
||||
echo " 1. Build bottom with: cargo build --release --features opentelemetry"
|
||||
echo " 2. Run in headless mode: ./target/release/btm --headless"
|
||||
echo " 3. Check metrics in Prometheus: http://localhost:9090/graph"
|
||||
Reference in New Issue
Block a user