Alert Me on Disk Full (Telegraf + InfluxDB + Grafana)
Problem
I have about 40 linux containers running, and I want to be notified when any of them are close to maxing out its disk space.
Solution
I’ve decided to use the (Telegraf + Influx + Grafana) stack again. Previously, I’ve used it for alerting me on high temperatures.
So here goes configuring another 40 plus linux containers…
Look how beautiful my alerts are :)
ToC
This article assumes you have already set up and are familiar with InfluxDB and Grafana.
For each linux container, we would do the following steps:
Or we can do the automated route:
1. InfluxDB
In the InfluxDB web console, create:
2. Telegraf
Download the Telegraf binary from their GitHub release page.
The telegraf.conf
is simple because we will be just monitoring the root disk (make sure to replace those values below!):
[global_tags]
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = "0s"
[[inputs.disk]]
mount_points = ["/"]
[[outputs.influxdb_v2]]
urls = ["$INFLUXDB_URL_HERE"]
organization = "$INFLUXDB_ORG_NAME_HERE"
bucket = "$INFLUXDB_BUCKET_NAME_HERE"
token = "$INFLUXDB_API_TOKEN_HERE"
Run telegraf with this configuration:
telegraf -config telegraf.conf
Verify in InfluxDB Web UI that metrics are being received.
3. Grafana
Here we will be creating an alert for a single linux container.
Let’s create a new rule:
Automate for 40 Linux Containers
Now doing the same 3 steps above for 40 times is too much. I’m lazy. So I’ve written a script to automate all of this.
Here it is in Github: https://github.com/TheRealMarcusChiu/proxmox-scripts
Auto Telegraf & InfluxDB
To install Telegraf on all your Linux containers,
and auto create buckets on InfluxDB,
create a file proxmox-server-setup-lxcs.sh
in your proxmox server:
#!/bin/bash
INFLUXDB_URL="REPLACE_ME"
INFLUXDB_API_TOKEN="REPLACE_ME"
INFLUXDB_ORG_ID="REPLACE_ME"
INFLUXDB_ORG_NAME="REPLACE_ME"
INFLUXDB_BUCKET_NAME_PREFIX="telegraf-lxc-"
pct list | grep running | while read line; do
ID=$(echo "$line" | awk '{print $1}')
NAME=$(echo "$line" | awk '{print $3}')
INFLUXDB_BUCKET_NAME="$INFLUXDB_BUCKET_NAME_PREFIX$NAME"
# Create telegraf bucket
curl -X POST "$INFLUXDB_URL/api/v2/buckets" \
-H "Authorization: Token $INFLUXDB_API_TOKEN" \
-H "Content-type: application/json" \
-d "{
\"name\": \"$INFLUXDB_BUCKET_NAME\",
\"orgID\": \"$INFLUXDB_ORG_ID\",
\"retentionRules\": [{
\"type\": \"expire\",
\"everySeconds\": 604800
}]
}"
echo "Updating container: $ID $NAME"
pct exec $ID -- bash -c "apt update && apt-get update && apt install git -y"
pct exec $ID -- bash -c "git clone https://github.com/TheRealMarcusChiu/proxmox-scripts.git"
pct exec $ID -- bash -c "cd /root/proxmox-scripts && git pull"
pct exec $ID -- bash -c "export INFLUXDB_URL=\"$INFLUXDB_URL\" && export INFLUXDB_API_TOKEN=\"$INFLUXDB_API_TOKEN\" && export INFLUXDB_ORG_NAME=\"$INFLUXDB_ORG_NAME\" && export INFLUXDB_BUCKET_NAME=\"$INFLUXDB_BUCKET_NAME\" && cd /root/proxmox-scripts/telegraf && /root/proxmox-scripts/telegraf/setup.sh > /root/proxmox-scripts/telegraf/log.txt"
pct exec $ID -- bash -c "systemctl show -p SubState,ActiveState,Result telegraf > /root/proxmox-scripts/telegraf/output.txt"
echo "Finished updating container: $ID $NAME"
done
Make it executable, then execute it:
chmod +x proxmox-server-setup-lxcs.sh
./proxmox-server-setup-lxcs.sh
Auto Grafana
To auto create alert rules on Grafana, create the following Python file create-grafana-alert-group.py
:
import hashlib
header = """apiVersion: 1
groups:
- orgId: 1
name: evaluation-group-disk-almost-full
folder: Disk Almost Full
interval: 1m
rules:
"""
alert_rule_template = """ - uid: ALERT_UID_HERE
title: BUCKET_NAME_HERE
condition: C
data:
- refId: A
relativeTimeRange:
from: 60
to: 0
datasourceUid: deu40om3f7xfkc
model:
intervalMs: 1000
maxDataPoints: 43200
query: |-
from(bucket: "BUCKET_NAME_HERE")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "disk")
|> filter(fn: (r) => r["_field"] == "used_percent")
|> aggregateWindow(every: v.windowPeriod, fn: last, createEmpty: false)
|> yield(name: "last")
refId: A
- refId: B
datasourceUid: __expr__
model:
conditions:
- evaluator:
params: []
type: gt
operator:
type: and
query:
params:
- B
reducer:
params: []
type: last
type: query
datasource:
type: __expr__
uid: __expr__
expression: A
intervalMs: 1000
maxDataPoints: 43200
reducer: last
refId: B
type: reduce
- refId: C
datasourceUid: __expr__
model:
conditions:
- evaluator:
params:
- 90
type: gt
operator:
type: and
query:
params:
- C
reducer:
params: []
type: last
type: query
datasource:
type: __expr__
uid: __expr__
expression: B
intervalMs: 1000
maxDataPoints: 43200
refId: C
type: threshold
noDataState: NoData
execErrState: Error
annotations:
description: optional description
summary: optional summary
isPaused: false
notification_settings:
receiver: Discord
"""
bucket_names = [
"telegraf-lxc-13ft",
"telegraf-lxc-adguard"
]
with open("disk-almost-full.yaml", "w") as file:
file.write(header)
for bucket_name in bucket_names:
hashed = hashlib.sha256(bucket_name.encode()).hexdigest()
short_hash = hashed[:14]
new_text = alert_rule_template.replace("ALERT_UID_HERE", short_hash, 1)
new_text = new_text.replace("BUCKET_NAME_HERE", bucket_name, 2)
file.write(new_text)
In this file, modify bucket_names
accordingly.
Run this:
python create-grafana-alert-group.py
This should output a file disk-almost-full.yaml
.
Now in the Grafana server, put this file under /etc/grafana/provisioning/alerting/disk-almost-full.yaml
.
Then restart the grafana server:
systemctl restart grafana-server
This will auto provision the alert rules accordingly. Verify on Grafana console UI.
Optional Delete the Alert Rules
Create a python file create-grafana-alert-group-deletion.py
:
import hashlib
header = """apiVersion: 1
deleteRules:
"""
alert_rule_template = """ - orgId: 1
uid: ALERT_UID_HERE
"""
bucket_names = [
"telegraf-lxc-13ft",
"telegraf-lxc-adguard"
]
with open("disk-almost-full-deletion.yaml", "w") as file:
file.write(header)
for bucket_name in bucket_names:
hashed = hashlib.sha256(bucket_name.encode()).hexdigest()
short_hash = hashed[:14]
new_text = alert_rule_template.replace("ALERT_UID_HERE", short_hash, 1)
file.write(new_text)
In this file, modify bucket_names
accordingly.
Run this:
python create-grafana-alert-group-deletion.py
This should output a file disk-almost-full-deletion.yaml
.
Now in the Grafana server, put this file under /etc/grafana/provisioning/alerting/disk-almost-full-deletion.yaml
while removing the other one.
Then restart the grafana server:
systemctl restart grafana-server
This will delete all the auto provisioned alert rules. Verify on Grafana console UI.