mindef-overdracht/grafana/vllm-metrics
2026-06-02 11:49:50 +02:00
..
dashboard.json Import part 1 2026-06-02 11:46:19 +02:00
image.png Import part 2 2026-06-02 11:46:20 +02:00
README.md Update ReadMes 2026-06-02 11:49:50 +02:00

vLLM Performance Dashboard

Grafana dashboard for monitoring vLLM inference servers running as UbiOps deployments — request throughput, queue depth, KV cache pressure, and token rates. Fed by the vllm:* Prometheus metrics that vLLM exposes.

Variables

  • Data Source — Prometheus instance.
  • Namespace — Kubernetes namespace (e.g. default).
  • Deployment — the vLLM deployment / served model (e.g. gpt-oss-120b).

Rows & panels

Request Stats

  • Requests Running — requests currently being decoded.
  • Requests Waiting — requests queued for a slot.
  • KV Cache Usage — % of the GPU KV cache block pool in use (saturation → queuing).
  • Request Rate — incoming requests over time.
  • Tokens Generated/sec — output token throughput.
  • Request States Over Time — running vs. waiting (and swapped) requests as a timeseries.
  • KV Cache Usage Over Time — KV cache utilization trend.

Per-Minute Metrics (RPM / ITPM / OTPM)

  • Requests Per Minute (RPM).
  • Input Tokens Per Minute (ITPM) — prompt token volume.
  • Output Tokens Per Minute (OTPM) — generated token volume.

Usage

Default range in the screenshot is the last 2 days with auto-refresh. Import into Grafana, then select datasource, namespace, and deployment.