| .. | ||
| dashboard.json | ||
| image.png | ||
| README.md | ||
vLLM Performance Dashboard
Grafana dashboard for monitoring vLLM inference servers running as UbiOps deployments — request throughput, queue depth, KV cache pressure, and token rates. Fed by the vllm:* Prometheus metrics that vLLM exposes.
Variables
- Data Source — Prometheus instance.
- Namespace — Kubernetes namespace (e.g.
default). - Deployment — the vLLM deployment / served model (e.g.
gpt-oss-120b).
Rows & panels
Request Stats
- Requests Running — requests currently being decoded.
- Requests Waiting — requests queued for a slot.
- KV Cache Usage — % of the GPU KV cache block pool in use (saturation → queuing).
- Request Rate — incoming requests over time.
- Tokens Generated/sec — output token throughput.
- Request States Over Time — running vs. waiting (and swapped) requests as a timeseries.
- KV Cache Usage Over Time — KV cache utilization trend.
Per-Minute Metrics (RPM / ITPM / OTPM)
- Requests Per Minute (RPM).
- Input Tokens Per Minute (ITPM) — prompt token volume.
- Output Tokens Per Minute (OTPM) — generated token volume.
Usage
Default range in the screenshot is the last 2 days with auto-refresh. Import into Grafana, then select datasource, namespace, and deployment.