History

kvanbezouw 83e2efdb43 Update ReadMes		2026-06-02 11:49:50 +02:00
..
dashboard.json	Import part 1	2026-06-02 11:46:19 +02:00
image.png	Import part 2	2026-06-02 11:46:20 +02:00
README.md	Update ReadMes	2026-06-02 11:49:50 +02:00

README.md

vLLM Performance Dashboard

Grafana dashboard for monitoring vLLM inference servers running as UbiOps deployments — request throughput, queue depth, KV cache pressure, and token rates. Fed by the vllm:* Prometheus metrics that vLLM exposes.

Variables

Data Source — Prometheus instance.
Namespace — Kubernetes namespace (e.g. default).
Deployment — the vLLM deployment / served model (e.g. gpt-oss-120b).

Rows & panels

Request Stats

Requests Running — requests currently being decoded.
Requests Waiting — requests queued for a slot.
KV Cache Usage — % of the GPU KV cache block pool in use (saturation → queuing).
Request Rate — incoming requests over time.
Tokens Generated/sec — output token throughput.
Request States Over Time — running vs. waiting (and swapped) requests as a timeseries.
KV Cache Usage Over Time — KV cache utilization trend.

Per-Minute Metrics (RPM / ITPM / OTPM)

Requests Per Minute (RPM).
Input Tokens Per Minute (ITPM) — prompt token volume.
Output Tokens Per Minute (OTPM) — generated token volume.

Usage

Default range in the screenshot is the last 2 days with auto-refresh. Import into Grafana, then select datasource, namespace, and deployment.