37 lines
1.8 KiB
Markdown
37 lines
1.8 KiB
Markdown
# UbiOps Deployments Dashboard
|
|
|
|
Grafana dashboard (`dashboard.json`) for monitoring UbiOps deployment pods on Kubernetes — health, resource usage, restarts, and limits. Data comes from Prometheus (`kube-state-metrics` + cAdvisor `container_*` metrics).
|
|
|
|
## Variables
|
|
|
|
| Variable | Source | Purpose |
|
|
|----------|--------|---------|
|
|
| `datasource` | Prometheus datasource picker | Select the Prometheus instance |
|
|
| `namespace` | `label_values(kube_pod_info, namespace)` | Namespace to scope to |
|
|
| `deployment` | `label_values(kube_deployment_metadata_generation{namespace=$namespace}, deployment)` | Deployment to inspect (defaults to all, `.*`) |
|
|
|
|
Pods are matched by `pod=~"$deployment.*"`, so a deployment selection covers all of its pods.
|
|
|
|
## Rows & panels
|
|
|
|
**Overview** — at-a-glance stat tiles: Running / Pending / Failed pods, Restarts (1h), OOMKilled (1h), Waiting containers.
|
|
|
|
**Resource Usage** — CPU and memory working-set usage per pod over time.
|
|
|
|
**Deployment Status** — desired vs. available replicas, and container restart rate.
|
|
|
|
**Resource Limits** — usage vs. limits for CPU and memory (aggregate and per-pod), plus per-pod limits and **% of limit** (green/yellow/red at 70%/90%) to spot pods approaching OOM.
|
|
|
|
**Pod Details** — table of every pod with restart count and memory % of limit, sorted by restarts.
|
|
|
|
## Usage
|
|
|
|
Default time range is the last 1h with 30s auto-refresh. Import into Grafana (schema `dashboard.grafana.app/v2`, built on Grafana v13), then pick a datasource, namespace, and deployment.
|
|
|
|
## Key things to watch
|
|
|
|
- **OOMKilled (1h)** and **Memory % of Limit** — memory pressure / under-provisioned limits.
|
|
- **Restarts** and **Container Restart Rate** — crash loops.
|
|
- **Pending / Failed pods** — scheduling or startup problems.
|
|
- **Replicas** (desired vs. available) — incomplete rollouts.
|