Anomaly Alerts for monitoring using Grafana and Prometheus

p. “I started off by setting up threshold alerts and soon the alerts became too noisy as all these nodes are very small ranging from 1-2GB Mem with 1vCPU.”

p. “Most of the stress was on the node due to its limited capacity and the alerts would be for Load, CPU, Memory and Disk on node level and not the services level.”

p. ""the 3-sigma rule states that approximately all our “normal” data should be within 3 standard deviations of the average value of your data.""

CPU

p. “avg_over_time(node_cpu_seconds_total{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”, mode=“idle”}[$__rate_interval])”

p. ”- avg_over_time(node_cpu_seconds_total{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”, mode=“idle”}[1d])”

p. “/stddev_over_time(node_cpu_seconds_total{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”, mode=“idle”}[1d])“

Memory

p. “avg_over_time(node_memory_MemAvailable_bytes{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”}[$__rate_interval])”

p. “-avg_over_time(node_memory_MemAvailable_bytes{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”}[1d])”

p. ”/(stddev_over_time(node_memory_MemAvailable_bytes{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”}[1d]))“

Load

p. “avg_over_time(node_load15{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”}[$__rate_interval])”

p. ”- avg_over_time(node_load15{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”}[1d])”

p. “/stddev_over_time(node_load15{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”}[1d])“

Disk

p. “avg_over_time(node_filesystem_avail_bytes{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”, device=“/dev/sda”}[$__rate_interval])”

p. ”- avg_over_time(node_filesystem_avail_bytes{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”, device=“/dev/sda”}[1d])”

p. ”/(stddev_over_time(node_filesystem_avail_bytes{instance=“mark-00-sin”, job=“node-exporter-mark-00-sin”, device=“/dev/sda”}[1d]))”

p. “Thats a lot of PromQL theory but does it work any better than Threshold alerts and brings some sanity to alerting.”

p. “As always, the effectiveness of anomaly detection depends on the quality and consistency of the data being collected, in my case using Prometheus, and you may need to adjust thresholds or use more advanced techniques based on your specific use case and system characteristics.”

단순 Threshold 기반 vs. Z-score 기반의 Anomaly Detection 무엇이 더 우열에 있다고 하기에는 더 정해져야 할 부분이 많다 (지표의 퀄리티 / 샘플의 수 / ...)

dwywdo.xyz

Recent Posts

Troubleshooting - Could not connect to Ryuk at null

Tidy Up - Changelog

Tidy Up Before Type Up

Building Your Own Obsidian Publishing Platform with Quartz

Recent Fleetings

Java Profiler

Visual VM

equals() and hashCode() with Hash-based Collections

Reduce implicit parameter passing

Recent Annotations

Troubleshooting Java

How to use Prometheus to efficiently detect anomalies at scale

도메인 주도 설계