mindef-overdracht/grafana/vllm-metrics/README.md

# vLLM Performance Dashboard

Grafana dashboard for monitoring [vLLM](https://github.com/vllm-project/vllm) inference servers running as UbiOps deployments — request throughput, queue depth, KV cache pressure, and token rates. Fed by the `vllm:*` Prometheus metrics that vLLM exposes.

> **Note:** `dashboard.json` is currently empty (0 bytes) — the export did not save. These docs are reconstructed from `image.png`; re-export the dashboard to capture the panel/query definitions.

## Variables

- **Data Source** — Prometheus instance.
- **Namespace** — Kubernetes namespace (e.g. `default`).
- **Deployment** — the vLLM deployment / served model (e.g. `gpt-oss-120b`).

## Rows & panels

**Request Stats**
- *Requests Running* — requests currently being decoded.
- *Requests Waiting* — requests queued for a slot.
- *KV Cache Usage* — % of the GPU KV cache block pool in use (saturation → queuing).
- *Request Rate* — incoming requests over time.
- *Tokens Generated/sec* — output token throughput.
- *Request States Over Time* — running vs. waiting (and swapped) requests as a timeseries.
- *KV Cache Usage Over Time* — KV cache utilization trend.

**Per-Minute Metrics (RPM / ITPM / OTPM)**
- *Requests Per Minute (RPM)*.
- *Input Tokens Per Minute (ITPM)* — prompt token volume.
- *Output Tokens Per Minute (OTPM)* — generated token volume.

## Key things to watch

- **KV Cache Usage** near 100% with rising **Requests Waiting** — the server is capacity-bound; scale up or shorten contexts.
- **Tokens Generated/sec** / **OTPM** dropping while RPM holds — degraded decode throughput.
- Sustained **Requests Waiting** — queue backlog and latency.

## Usage

Default range in the screenshot is the last 2 days with auto-refresh. Import into Grafana, then select datasource, namespace, and deployment.