# vLLM Performance Dashboard Grafana dashboard for monitoring [vLLM](https://github.com/vllm-project/vllm) inference servers running as UbiOps deployments — request throughput, queue depth, KV cache pressure, and token rates. Fed by the `vllm:*` Prometheus metrics that vLLM exposes. ## Variables - **Data Source** — Prometheus instance. - **Namespace** — Kubernetes namespace (e.g. `default`). - **Deployment** — the vLLM deployment / served model (e.g. `gpt-oss-120b`). ## Rows & panels **Request Stats** - *Requests Running* — requests currently being decoded. - *Requests Waiting* — requests queued for a slot. - *KV Cache Usage* — % of the GPU KV cache block pool in use (saturation → queuing). - *Request Rate* — incoming requests over time. - *Tokens Generated/sec* — output token throughput. - *Request States Over Time* — running vs. waiting (and swapped) requests as a timeseries. - *KV Cache Usage Over Time* — KV cache utilization trend. **Per-Minute Metrics (RPM / ITPM / OTPM)** - *Requests Per Minute (RPM)*. - *Input Tokens Per Minute (ITPM)* — prompt token volume. - *Output Tokens Per Minute (OTPM)* — generated token volume. ## Usage Default range in the screenshot is the last 2 days with auto-refresh. Import into Grafana, then select datasource, namespace, and deployment.