29 lines
1.3 KiB
Markdown
29 lines
1.3 KiB
Markdown
# vLLM Performance Dashboard
|
|
|
|
Grafana dashboard for monitoring [vLLM](https://github.com/vllm-project/vllm) inference servers running as UbiOps deployments — request throughput, queue depth, KV cache pressure, and token rates. Fed by the `vllm:*` Prometheus metrics that vLLM exposes.
|
|
|
|
## Variables
|
|
|
|
- **Data Source** — Prometheus instance.
|
|
- **Namespace** — Kubernetes namespace (e.g. `default`).
|
|
- **Deployment** — the vLLM deployment / served model (e.g. `gpt-oss-120b`).
|
|
|
|
## Rows & panels
|
|
|
|
**Request Stats**
|
|
- *Requests Running* — requests currently being decoded.
|
|
- *Requests Waiting* — requests queued for a slot.
|
|
- *KV Cache Usage* — % of the GPU KV cache block pool in use (saturation → queuing).
|
|
- *Request Rate* — incoming requests over time.
|
|
- *Tokens Generated/sec* — output token throughput.
|
|
- *Request States Over Time* — running vs. waiting (and swapped) requests as a timeseries.
|
|
- *KV Cache Usage Over Time* — KV cache utilization trend.
|
|
|
|
**Per-Minute Metrics (RPM / ITPM / OTPM)**
|
|
- *Requests Per Minute (RPM)*.
|
|
- *Input Tokens Per Minute (ITPM)* — prompt token volume.
|
|
- *Output Tokens Per Minute (OTPM)* — generated token volume.
|
|
|
|
## Usage
|
|
|
|
Default range in the screenshot is the last 2 days with auto-refresh. Import into Grafana, then select datasource, namespace, and deployment. |