mindef-overdracht/grafana/ubiops-sre/README.md
2026-06-02 11:46:19 +02:00

37 lines
1.8 KiB
Markdown

# UbiOps Deployments Dashboard
Grafana dashboard (`dashboard.json`) for monitoring UbiOps deployment pods on Kubernetes — health, resource usage, restarts, and limits. Data comes from Prometheus (`kube-state-metrics` + cAdvisor `container_*` metrics).
## Variables
| Variable | Source | Purpose |
|----------|--------|---------|
| `datasource` | Prometheus datasource picker | Select the Prometheus instance |
| `namespace` | `label_values(kube_pod_info, namespace)` | Namespace to scope to |
| `deployment` | `label_values(kube_deployment_metadata_generation{namespace=$namespace}, deployment)` | Deployment to inspect (defaults to all, `.*`) |
Pods are matched by `pod=~"$deployment.*"`, so a deployment selection covers all of its pods.
## Rows & panels
**Overview** — at-a-glance stat tiles: Running / Pending / Failed pods, Restarts (1h), OOMKilled (1h), Waiting containers.
**Resource Usage** — CPU and memory working-set usage per pod over time.
**Deployment Status** — desired vs. available replicas, and container restart rate.
**Resource Limits** — usage vs. limits for CPU and memory (aggregate and per-pod), plus per-pod limits and **% of limit** (green/yellow/red at 70%/90%) to spot pods approaching OOM.
**Pod Details** — table of every pod with restart count and memory % of limit, sorted by restarts.
## Usage
Default time range is the last 1h with 30s auto-refresh. Import into Grafana (schema `dashboard.grafana.app/v2`, built on Grafana v13), then pick a datasource, namespace, and deployment.
## Key things to watch
- **OOMKilled (1h)** and **Memory % of Limit** — memory pressure / under-provisioned limits.
- **Restarts** and **Container Restart Rate** — crash loops.
- **Pending / Failed pods** — scheduling or startup problems.
- **Replicas** (desired vs. available) — incomplete rollouts.