mindef-overdracht/grafana/ubiops-sre/README.md

# UbiOps Deployments Dashboard

Grafana dashboard (`dashboard.json`) for monitoring UbiOps deployment pods on Kubernetes — health, resource usage, restarts, and limits. Data comes from Prometheus (`kube-state-metrics` + cAdvisor `container_*` metrics).

## Variables

| Variable | Source | Purpose |
|----------|--------|---------|
| `datasource` | Prometheus datasource picker | Select the Prometheus instance |
| `namespace` | `label_values(kube_pod_info, namespace)` | Namespace to scope to |
| `deployment` | `label_values(kube_deployment_metadata_generation{namespace=$namespace}, deployment)` | Deployment to inspect (defaults to all, `.*`) |

Pods are matched by `pod=~"$deployment.*"`, so a deployment selection covers all of its pods.

## Rows & panels

**Overview** — at-a-glance stat tiles: Running / Pending / Failed pods, Restarts (1h), OOMKilled (1h), Waiting containers.

**Resource Usage** — CPU and memory working-set usage per pod over time.

**Deployment Status** — desired vs. available replicas, and container restart rate.

**Resource Limits** — usage vs. limits for CPU and memory (aggregate and per-pod), plus per-pod limits and **% of limit** (green/yellow/red at 70%/90%) to spot pods approaching OOM.

**Pod Details** — table of every pod with restart count and memory % of limit, sorted by restarts.

## Usage

Default time range is the last 1h with 30s auto-refresh. Import into Grafana (schema `dashboard.grafana.app/v2`, built on Grafana v13), then pick a datasource, namespace, and deployment.

## Key things to watch

- **OOMKilled (1h)** and **Memory % of Limit** — memory pressure / under-provisioned limits.
- **Restarts** and **Container Restart Rate** — crash loops.
- **Pending / Failed pods** — scheduling or startup problems.
- **Replicas** (desired vs. available) — incomplete rollouts.