mindef-overdracht/llm-throughput-tests-mindef-metadateren
2026-06-02 11:46:26 +02:00
..
results Import part 11 2026-06-02 11:46:26 +02:00
benchmark_config.yaml Import part 2 2026-06-02 11:46:20 +02:00
benchmark_llm.py Import part 2 2026-06-02 11:46:20 +02:00
create_test_dataset.py Import part 2 2026-06-02 11:46:20 +02:00
README.md Import part 2 2026-06-02 11:46:20 +02:00
requirements.txt Import part 2 2026-06-02 11:46:20 +02:00

LLM Benchmarking Tool

The following benchmarks were used to 1) measure throughput of configured models, on the available hardware (NVIDIA RTX 6000 PRO GPUs), aswell as to debug connection issues that arised during the configuration of the pipelines.

Benchmarks were created for qwen 3.5 and gpt oss. Mainly GPT OSS was used during the project (because of throughput and output quality)


How-to benchmark:

Benchmark LLM deployments using batch request patterns - sends N requests simultaneously to measure concurrent throughput.

Installation

pip install -r requirements.txt

Dataset Generation (Optional)

You have 3 input options:

1. Generated Prompts (Default)

Automatically generates synthetic text to match token counts.

2. Real Conversations

Use conversations from HuggingFace datasets:

# Generate conversation dataset (takes ~5 minutes)
python create_test_dataset.py

# Custom buckets
python create_test_dataset.py --buckets 1000 5000 10000 --chains_per_bucket 64

# Output to custom location
python create_test_dataset.py --output data/conversations.json

This creates a JSON file with real conversations bucketed by token count. The benchmark will cycle through these conversations instead of repeating the same synthetic prompt.

3. Custom Text

Provide your own text directly:

# Via CLI
python benchmark_llm.py --text "Your custom text here..."

# Or in config file
text: "Analyze this large document about..."

Quick Start

1. Create Configuration File

endpoint:
  url: https://b5cee612-b599-4524-a893-7698c9e75948.services.ubiops.development.vlam.ai
  api_key: your-api-key
  model_name: your-model

benchmark:
  input_tokens: [1000, 5000, 10000]
  batch_sizes: [16, 32, 64, 128]
  num_batches: 10
  output_tokens: 256
  dataset: test_conversations.json  # Optional: real conversations
  text: null  # Optional: custom text input

runtime:
  request_timeout: 300
  delay_between_runs: 5
  log_io: false
  wait_for_ready: true

2. Run Benchmark

python benchmark_llm.py --config benchmark_config.yaml

3. Generate Visualizations

python visualize_results.py --input results/results_your-model/benchmark_results.json

Usage

Configuration File

python benchmark_llm.py --config benchmark_config.yaml

CLI Arguments

# With dataset
python benchmark_llm.py \
  --endpoint_url https://api.example.com/v1 \
  --api_key YOUR_KEY \
  --model_name gpt-4 \
  --input_tokens 1000 5000 10000 \
  --batch_sizes 16 32 64 128 \
  --num_batches 10 \
  --output_tokens 256 \
  --dataset test_conversations.json

# With custom text
python benchmark_llm.py \
  --endpoint_url https://api.example.com/v1 \
  --api_key YOUR_KEY \
  --model_name gpt-4 \
  --batch_sizes 32 \
  --num_batches 10 \
  --text "Analyze the following document about cloud architecture..."

How It Works

Batch Execution

The tool sends batches of N requests simultaneously:

Batch 0: [Req 1, Req 2, ..., Req 32]  ← All start at exact same time
         [Wait for all to complete]
         
Batch 1: [Req 33, Req 34, ..., Req 64] ← All start at exact same time
         [Wait for all to complete]

This ensures:

  • All requests in a batch have identical time_created timestamps
  • Concurrent load testing
  • Accurate burst performance measurement

Request Calculation

total_requests = num_batches × batch_size

Example:

batch_sizes: [32]
num_batches: 10

# Result: 10 batches × 32 requests = 320 total requests
# Each batch sends 32 requests simultaneously

Key Metrics

Throughput

  • Tokens/second across all requests in a batch
  • Measures system's ability to handle concurrent load
  • Higher is better

Time to First Token (TTFT)

  • Latency until first content token appears
  • Critical for user experience
  • Lower is better

Latency Percentiles

  • P50 (median): Typical request latency
  • P95: 95% of requests complete faster
  • P99: 99% of requests complete faster

Batch Metrics

{
  "batch_metrics": {
    "num_batches": 10,
    "avg_batch_throughput": 2456.78,
    "min_batch_throughput": 2301.45,
    "max_batch_throughput": 2589.12
  }
}

Output Structure

results/
└── results_your-model/
    ├── benchmark_results.json      # Raw benchmark data
    ├── benchmark_io.log             # I/O logs (if enabled)
    ├── config_used.yaml             # Config copy (API key redacted)
    ├── throughput.png               # Throughput vs batch size
    ├── ttft.png                     # TTFT vs batch size
    └── latency_percentiles.png      # Latency distribution

Configuration Reference

Endpoint Configuration

endpoint:
  url: string              # OpenAI-compatible endpoint URL
  api_key: string          # API authentication key
  model_name: string       # Model identifier

Benchmark Configuration

benchmark:
  input_tokens: list[int]     # Token counts to test [1000, 5000, 10000]
  batch_sizes: list[int]      # Batch sizes to test [16, 32, 64, 128]
  num_batches: int            # Number of batches per config (default: 10)
  output_tokens: int          # Max output tokens (default: 256)

Understanding batch_sizes:

  • batch_sizes: [16] → Sends 16 requests simultaneously
  • batch_sizes: [32] → Sends 32 requests simultaneously
  • batch_sizes: [16, 32, 64] → Tests 3 different batch sizes

Runtime Configuration

runtime:
  request_timeout: int        # Timeout per request in seconds (default: 300)
  delay_between_runs: int     # Delay between configs in seconds (default: 5)
  log_io: bool                # Enable I/O logging (default: false)
  wait_for_ready: bool        # Wait for model init (default: true)
  max_init_retries: int       # Max init attempts (default: 10)
  init_retry_delay: int       # Delay between init attempts (default: 30)

Example Output

Starting benchmark: 10 batches × 32 requests/batch = 320 total
Input: 5000 tokens, Output: 256 tokens
============================================================

Batch 0: 32/32 successful, 12.34s, 2456.78 tok/s
Batch 1: 32/32 successful, 12.45s, 2401.23 tok/s
Batch 2: 32/32 successful, 12.56s, 2389.45 tok/s
...

✓ Benchmark complete in 125.67s
  Success: 100% (320/320)
  P95 Latency: 13.45s
  Throughput: 2428.56 tokens/s
  Avg Batch Throughput: 2429.01 tokens/s

Use Cases

1. Finding Optimal Batch Size

Test multiple batch sizes to find the sweet spot:

batch_sizes: [16, 32, 64, 128, 256]
num_batches: 10

Compare the throughput.png to see where throughput peaks.

2. Stress Testing

Test maximum burst capacity:

batch_sizes: [256]
num_batches: 5

Sends 256 simultaneous requests per batch.

3. Performance Profiling

Test different input sizes at various batch sizes:

input_tokens: [1000, 2500, 5000, 10000]
batch_sizes: [16, 32, 64, 128]

Comprehensive performance matrix across configurations.

Advanced Usage

Enable I/O Logging

Log all input prompts and outputs for debugging:

python benchmark_llm.py --config benchmark_config.yaml
# Set log_io: true in config

Or:

python benchmark_llm.py --log_io ...

Results saved to benchmark_io.log.

Skip Model Initialization

If model is already warm:

python benchmark_llm.py --config benchmark_config.yaml --skip_init_wait

Custom Timeout

For large batches or slow responses:

python benchmark_llm.py --request_timeout 600 ...