Import part 1

This commit is contained in:
kvanbezouw 2026-06-02 11:46:19 +02:00
commit 00e2c83beb
6 changed files with 5522 additions and 0 deletions

3
.gitignore vendored Normal file
View File

@ -0,0 +1,3 @@
# Large benchmark output logs — reproducible, not versioned
llm-throughput-tests-mindef-metadateren/results/**/benchmark_io.log
*.log

View File

@ -0,0 +1,36 @@
# UbiOps Deployments Dashboard
Grafana dashboard (`dashboard.json`) for monitoring UbiOps deployment pods on Kubernetes — health, resource usage, restarts, and limits. Data comes from Prometheus (`kube-state-metrics` + cAdvisor `container_*` metrics).
## Variables
| Variable | Source | Purpose |
|----------|--------|---------|
| `datasource` | Prometheus datasource picker | Select the Prometheus instance |
| `namespace` | `label_values(kube_pod_info, namespace)` | Namespace to scope to |
| `deployment` | `label_values(kube_deployment_metadata_generation{namespace=$namespace}, deployment)` | Deployment to inspect (defaults to all, `.*`) |
Pods are matched by `pod=~"$deployment.*"`, so a deployment selection covers all of its pods.
## Rows & panels
**Overview** — at-a-glance stat tiles: Running / Pending / Failed pods, Restarts (1h), OOMKilled (1h), Waiting containers.
**Resource Usage** — CPU and memory working-set usage per pod over time.
**Deployment Status** — desired vs. available replicas, and container restart rate.
**Resource Limits** — usage vs. limits for CPU and memory (aggregate and per-pod), plus per-pod limits and **% of limit** (green/yellow/red at 70%/90%) to spot pods approaching OOM.
**Pod Details** — table of every pod with restart count and memory % of limit, sorted by restarts.
## Usage
Default time range is the last 1h with 30s auto-refresh. Import into Grafana (schema `dashboard.grafana.app/v2`, built on Grafana v13), then pick a datasource, namespace, and deployment.
## Key things to watch
- **OOMKilled (1h)** and **Memory % of Limit** — memory pressure / under-provisioned limits.
- **Restarts** and **Container Restart Rate** — crash loops.
- **Pending / Failed pods** — scheduling or startup problems.
- **Replicas** (desired vs. available) — incomplete rollouts.

File diff suppressed because it is too large Load Diff

Binary file not shown.

After

Width:  |  Height:  |  Size: 155 KiB

View File

@ -0,0 +1,37 @@
# vLLM Performance Dashboard
Grafana dashboard for monitoring [vLLM](https://github.com/vllm-project/vllm) inference servers running as UbiOps deployments — request throughput, queue depth, KV cache pressure, and token rates. Fed by the `vllm:*` Prometheus metrics that vLLM exposes.
> **Note:** `dashboard.json` is currently empty (0 bytes) — the export did not save. These docs are reconstructed from `image.png`; re-export the dashboard to capture the panel/query definitions.
## Variables
- **Data Source** — Prometheus instance.
- **Namespace** — Kubernetes namespace (e.g. `default`).
- **Deployment** — the vLLM deployment / served model (e.g. `gpt-oss-120b`).
## Rows & panels
**Request Stats**
- *Requests Running* — requests currently being decoded.
- *Requests Waiting* — requests queued for a slot.
- *KV Cache Usage* — % of the GPU KV cache block pool in use (saturation → queuing).
- *Request Rate* — incoming requests over time.
- *Tokens Generated/sec* — output token throughput.
- *Request States Over Time* — running vs. waiting (and swapped) requests as a timeseries.
- *KV Cache Usage Over Time* — KV cache utilization trend.
**Per-Minute Metrics (RPM / ITPM / OTPM)**
- *Requests Per Minute (RPM)*.
- *Input Tokens Per Minute (ITPM)* — prompt token volume.
- *Output Tokens Per Minute (OTPM)* — generated token volume.
## Key things to watch
- **KV Cache Usage** near 100% with rising **Requests Waiting** — the server is capacity-bound; scale up or shorten contexts.
- **Tokens Generated/sec** / **OTPM** dropping while RPM holds — degraded decode throughput.
- Sustained **Requests Waiting** — queue backlog and latency.
## Usage
Default range in the screenshot is the last 2 days with auto-refresh. Import into Grafana, then select datasource, namespace, and deployment.

File diff suppressed because it is too large Load Diff