Import part 2

This commit is contained in:
kvanbezouw 2026-06-02 11:46:20 +02:00
parent 00e2c83beb
commit 603841412b
8 changed files with 2590 additions and 0 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 241 KiB

View File

@ -0,0 +1,326 @@
# LLM Benchmarking Tool
The following benchmarks were used to 1) measure throughput of configured models, on the available hardware (NVIDIA RTX 6000 PRO GPUs), aswell as to debug connection issues that arised during the configuration of the pipelines.
Benchmarks were created for qwen 3.5 and gpt oss. Mainly GPT OSS was used during the project (because of throughput and
output quality)
------------------
# How-to benchmark:
Benchmark LLM deployments using **batch request patterns** - sends N requests simultaneously to measure concurrent throughput.
## Installation
```bash
pip install -r requirements.txt
```
## Dataset Generation (Optional)
You have **3 input options**:
### 1. Generated Prompts (Default)
Automatically generates synthetic text to match token counts.
### 2. Real Conversations
Use conversations from HuggingFace datasets:
```bash
# Generate conversation dataset (takes ~5 minutes)
python create_test_dataset.py
# Custom buckets
python create_test_dataset.py --buckets 1000 5000 10000 --chains_per_bucket 64
# Output to custom location
python create_test_dataset.py --output data/conversations.json
```
This creates a JSON file with real conversations bucketed by token count. The benchmark will cycle through these conversations instead of repeating the same synthetic prompt.
### 3. Custom Text
Provide your own text directly:
```bash
# Via CLI
python benchmark_llm.py --text "Your custom text here..."
# Or in config file
text: "Analyze this large document about..."
```
## Quick Start
### 1. Create Configuration File
```yaml
endpoint:
url: https://b5cee612-b599-4524-a893-7698c9e75948.services.ubiops.development.vlam.ai
api_key: your-api-key
model_name: your-model
benchmark:
input_tokens: [1000, 5000, 10000]
batch_sizes: [16, 32, 64, 128]
num_batches: 10
output_tokens: 256
dataset: test_conversations.json # Optional: real conversations
text: null # Optional: custom text input
runtime:
request_timeout: 300
delay_between_runs: 5
log_io: false
wait_for_ready: true
```
### 2. Run Benchmark
```bash
python benchmark_llm.py --config benchmark_config.yaml
```
### 3. Generate Visualizations
```bash
python visualize_results.py --input results/results_your-model/benchmark_results.json
```
## Usage
### Configuration File
```bash
python benchmark_llm.py --config benchmark_config.yaml
```
### CLI Arguments
```bash
# With dataset
python benchmark_llm.py \
--endpoint_url https://api.example.com/v1 \
--api_key YOUR_KEY \
--model_name gpt-4 \
--input_tokens 1000 5000 10000 \
--batch_sizes 16 32 64 128 \
--num_batches 10 \
--output_tokens 256 \
--dataset test_conversations.json
# With custom text
python benchmark_llm.py \
--endpoint_url https://api.example.com/v1 \
--api_key YOUR_KEY \
--model_name gpt-4 \
--batch_sizes 32 \
--num_batches 10 \
--text "Analyze the following document about cloud architecture..."
```
## How It Works
### Batch Execution
The tool sends batches of N requests **simultaneously**:
```
Batch 0: [Req 1, Req 2, ..., Req 32] ← All start at exact same time
[Wait for all to complete]
Batch 1: [Req 33, Req 34, ..., Req 64] ← All start at exact same time
[Wait for all to complete]
```
This ensures:
- All requests in a batch have **identical** `time_created` timestamps
- Concurrent load testing
- Accurate burst performance measurement
### Request Calculation
```
total_requests = num_batches × batch_size
```
**Example:**
```yaml
batch_sizes: [32]
num_batches: 10
# Result: 10 batches × 32 requests = 320 total requests
# Each batch sends 32 requests simultaneously
```
## Key Metrics
### Throughput
- **Tokens/second** across all requests in a batch
- Measures system's ability to handle concurrent load
- Higher is better
### Time to First Token (TTFT)
- Latency until first content token appears
- Critical for user experience
- Lower is better
### Latency Percentiles
- **P50 (median)**: Typical request latency
- **P95**: 95% of requests complete faster
- **P99**: 99% of requests complete faster
### Batch Metrics
```json
{
"batch_metrics": {
"num_batches": 10,
"avg_batch_throughput": 2456.78,
"min_batch_throughput": 2301.45,
"max_batch_throughput": 2589.12
}
}
```
## Output Structure
```
results/
└── results_your-model/
├── benchmark_results.json # Raw benchmark data
├── benchmark_io.log # I/O logs (if enabled)
├── config_used.yaml # Config copy (API key redacted)
├── throughput.png # Throughput vs batch size
├── ttft.png # TTFT vs batch size
└── latency_percentiles.png # Latency distribution
```
## Configuration Reference
### Endpoint Configuration
```yaml
endpoint:
url: string # OpenAI-compatible endpoint URL
api_key: string # API authentication key
model_name: string # Model identifier
```
### Benchmark Configuration
```yaml
benchmark:
input_tokens: list[int] # Token counts to test [1000, 5000, 10000]
batch_sizes: list[int] # Batch sizes to test [16, 32, 64, 128]
num_batches: int # Number of batches per config (default: 10)
output_tokens: int # Max output tokens (default: 256)
```
**Understanding batch_sizes:**
- `batch_sizes: [16]` → Sends 16 requests simultaneously
- `batch_sizes: [32]` → Sends 32 requests simultaneously
- `batch_sizes: [16, 32, 64]` → Tests 3 different batch sizes
### Runtime Configuration
```yaml
runtime:
request_timeout: int # Timeout per request in seconds (default: 300)
delay_between_runs: int # Delay between configs in seconds (default: 5)
log_io: bool # Enable I/O logging (default: false)
wait_for_ready: bool # Wait for model init (default: true)
max_init_retries: int # Max init attempts (default: 10)
init_retry_delay: int # Delay between init attempts (default: 30)
```
## Example Output
```
Starting benchmark: 10 batches × 32 requests/batch = 320 total
Input: 5000 tokens, Output: 256 tokens
============================================================
Batch 0: 32/32 successful, 12.34s, 2456.78 tok/s
Batch 1: 32/32 successful, 12.45s, 2401.23 tok/s
Batch 2: 32/32 successful, 12.56s, 2389.45 tok/s
...
✓ Benchmark complete in 125.67s
Success: 100% (320/320)
P95 Latency: 13.45s
Throughput: 2428.56 tokens/s
Avg Batch Throughput: 2429.01 tokens/s
```
## Use Cases
### 1. Finding Optimal Batch Size
Test multiple batch sizes to find the sweet spot:
```yaml
batch_sizes: [16, 32, 64, 128, 256]
num_batches: 10
```
Compare the `throughput.png` to see where throughput peaks.
### 2. Stress Testing
Test maximum burst capacity:
```yaml
batch_sizes: [256]
num_batches: 5
```
Sends 256 simultaneous requests per batch.
### 3. Performance Profiling
Test different input sizes at various batch sizes:
```yaml
input_tokens: [1000, 2500, 5000, 10000]
batch_sizes: [16, 32, 64, 128]
```
Comprehensive performance matrix across configurations.
## Advanced Usage
### Enable I/O Logging
Log all input prompts and outputs for debugging:
```bash
python benchmark_llm.py --config benchmark_config.yaml
# Set log_io: true in config
```
Or:
```bash
python benchmark_llm.py --log_io ...
```
Results saved to `benchmark_io.log`.
### Skip Model Initialization
If model is already warm:
```bash
python benchmark_llm.py --config benchmark_config.yaml --skip_init_wait
```
### Custom Timeout
For large batches or slow responses:
```bash
python benchmark_llm.py --request_timeout 600 ...
```

View File

@ -0,0 +1,69 @@
endpoint:
# internal litellm ubiops
#url: https://46e73bba-0ed9-4853-b2b0-d4509aaab06b.services.external.0a71m37v.ubiops.io/v1
#api_key:
#model_name: openai-gpt-oss-120b-max-16
#url: https://46e73bba-0ed9-4853-b2b0-d4509aaab06b.services.external.0a71m37v.ubiops.io/v1
#api_key:
#model_name: openai-gpt-oss-120b
url: https://46e73bba-0ed9-4853-b2b0-d4509aaab06b.services.external.0a71m37v.ubiops.io/v1
api_key:
model_name: openai-gpt-oss-120b-2x
#url: https://b60dd657-9ce2-4ba0-ad45-754b5be29238.services.external.0a71m37v.ubiops.io/v1
#api_key:
#model_name: openai/gpt-oss-120b
# staging litellm
#url: https://f1dfa3fc-3314-4d49-be06-98bfd3d1f5fd.services.staging.ubiops.dev/v1
#api_key:
#model_name: llama-1b
# staging vllm
#url: https://dde9ea35-6a02-4242-a3f3-5a7e7e29e7a7.services.staging.ubiops.dev/v1
#api_key:
#model_name: meta-llama/Llama-3.2-1B-Instruct
benchmark:
# Input token counts to testfhtt
input_tokens: [50000]
# Batch sizes to test (number of simultaneous requests per batch)
# Each batch sends N requests at the exact same time
batch_sizes: [64]
num_batches: 1
# Maximum output tokens per request
output_tokens: 1024
# Optional: Path to conversation dataset JSON file
# Generate with: python create_test_dataset.py
# If not provided, uses synthetic prompts
dataset: test_conversations.json # or "test_conversations.json"
# Optional: Custom text to use as input for all requests
# Uses the same text for every request (ignores input_tokens)
# Priority: text > dataset > generated prompts
# Example: "Analyze this document about machine learning..."
text: null
runtime:
# Timeout for each request (seconds)
request_timeout: 1800
# Delay between benchmark runs (seconds)
delay_between_runs: 5
# Enable detailed I/O logging (input prompts + outputs)
log_io: true
# Wait for model initialization before starting
wait_for_ready: true
# Maximum initialization check attempts
max_init_retries: 10
# Delay between initialization checks (seconds)
init_retry_delay: 30

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,338 @@
#!/usr/bin/env python3
"""
Create bucketed test dataset for LLM benchmarking.
Uses multiple strategies to fill all token buckets:
1. Natural conversations from UltraChat dataset
2. Concatenation of shorter conversations for larger buckets
Buckets aligned with benchmark input_tokens: 100, 500, 1k, 2k, 5k, 10k
Outputs 128 unique conversations per bucket for comprehensive testing.
Usage:
python create_test_dataset.py
python create_test_dataset.py --output test_conversations.json
python create_test_dataset.py --buckets 1000 5000 10000 --chains_per_bucket 64
"""
import argparse
import json
import random
from collections import defaultdict
from pathlib import Path
import tiktoken
from datasets import load_dataset
# Default buckets aligned with typical benchmark configurations
DEFAULT_BUCKETS = [100, 500, 1_000, 2_000, 5_000, 10_000]
CHAINS_PER_BUCKET = 128
DATASET_NAME = "HuggingFaceH4/ultrachat_200k"
ENCODING_NAME = "cl100k_base"
def count_tokens(messages: list[dict], encoding: tiktoken.Encoding) -> int:
"""Count total tokens in a conversation chain."""
total = 0
for msg in messages:
content = msg.get("content", "") or ""
role = msg.get("role", "") or ""
total += len(encoding.encode(content, disallowed_special=()))
total += len(encoding.encode(role, disallowed_special=()))
total += 4 # Message formatting overhead
total += 2 # Conversation formatting overhead
return total
def get_bucket(token_count: int, buckets: list[int]) -> int | None:
"""Find the appropriate bucket for a token count (within 20% of target)."""
for bucket in buckets:
if bucket * 0.8 <= token_count <= bucket * 1.2:
return bucket
return None
def format_ultrachat_messages(messages: list[dict]) -> list[dict]:
"""Format UltraChat conversations to OpenAI chat format."""
formatted = []
for msg in messages:
role = msg.get("role", "user")
if role not in ["user", "assistant", "system"]:
role = "user"
content = msg.get("content", "") or ""
if content:
formatted.append({"role": role, "content": content})
return formatted
def concatenate_conversations(
conversations: list[list[dict]],
target_tokens: int,
encoding: tiktoken.Encoding,
tolerance: float = 0.2
) -> list[dict] | None:
"""Concatenate multiple conversations to reach target token count."""
result = []
current_tokens = 0
target_min = target_tokens * (1 - tolerance)
target_max = target_tokens * (1 + tolerance)
random.shuffle(conversations)
for conv in conversations:
conv_tokens = count_tokens(conv, encoding)
# Skip if this would exceed target
if current_tokens + conv_tokens > target_max:
continue
# Add separator between conversations
if result and conv:
separator = {"role": "user", "content": "---\nNew conversation:\n---"}
result.append(separator)
current_tokens += 10 # Approximate tokens for separator
result.extend(conv)
current_tokens += conv_tokens
# Check if we've reached target
if current_tokens >= target_min:
break
# Verify we're within acceptable range
if current_tokens < target_min * 0.8:
return None
return result
def main():
parser = argparse.ArgumentParser(
description="Create bucketed test dataset for LLM benchmarking",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Default configuration (128 conversations per bucket)
python create_test_dataset.py
# Custom buckets
python create_test_dataset.py --buckets 1000 5000 10000
# Fewer conversations per bucket
python create_test_dataset.py --chains_per_bucket 64
# Custom output location
python create_test_dataset.py --output data/conversations.json
"""
)
parser.add_argument(
"--output",
type=str,
default="test_conversations.json",
help="Output file path (default: test_conversations.json)"
)
parser.add_argument(
"--buckets",
type=int,
nargs='+',
default=DEFAULT_BUCKETS,
help="Token count buckets (default: 100 500 1000 2000 5000 10000)"
)
parser.add_argument(
"--chains_per_bucket",
type=int,
default=CHAINS_PER_BUCKET,
help=f"Number of conversations per bucket (default: {CHAINS_PER_BUCKET})"
)
parser.add_argument(
"--seed",
type=int,
default=42,
help="Random seed for reproducibility (default: 42)"
)
parser.add_argument(
"--dataset",
type=str,
default=DATASET_NAME,
help=f"HuggingFace dataset name (default: {DATASET_NAME})"
)
args = parser.parse_args()
random.seed(args.seed)
buckets = sorted(args.buckets)
print("="*60)
print("LLM Benchmark Dataset Generator")
print("="*60)
print(f"Output: {args.output}")
print(f"Buckets: {buckets}")
print(f"Conversations per bucket: {args.chains_per_bucket}")
print(f"Random seed: {args.seed}")
print("="*60)
print(f"\nLoading dataset: {args.dataset}")
try:
dataset = load_dataset(args.dataset, split="train_sft")
except Exception as e:
print(f"Error loading dataset: {e}")
print("Make sure you have internet connection and the 'datasets' package installed:")
print(" pip install datasets")
return
print(f"Initializing tokenizer: {ENCODING_NAME}")
try:
encoding = tiktoken.get_encoding(ENCODING_NAME)
except Exception as e:
print(f"Error loading tokenizer: {e}")
print("Make sure you have 'tiktoken' installed:")
print(" pip install tiktoken")
return
bucketed_chains: dict[int, list[dict]] = defaultdict(list)
all_conversations: list[list[dict]] = []
print(f"\nProcessing {len(dataset)} conversation chains...")
for idx, row in enumerate(dataset):
messages = row.get("messages", [])
if not messages:
continue
formatted = format_ultrachat_messages(messages)
if not formatted:
continue
token_count = count_tokens(formatted, encoding)
bucket = get_bucket(token_count, buckets)
all_conversations.append(formatted)
if bucket is not None:
bucketed_chains[bucket].append(
{
"messages": formatted,
"token_count": token_count,
"bucket": bucket,
"original_index": idx,
"synthetic": False,
}
)
if (idx + 1) % 50000 == 0:
print(f" Processed {idx + 1:,} chains...")
print(f"\nTotal conversations collected: {len(all_conversations):,}")
print("\nNatural bucket distribution:")
print("-" * 60)
for bucket in buckets:
count = len(bucketed_chains[bucket])
status = "!" if count >= args.chains_per_bucket else f" need {args.chains_per_bucket - count} more"
print(f" {bucket:>6,} tokens: {count:>5,} chains {status}")
# Generate synthetic conversations for sparse buckets
print("\nGenerating synthetic chains for sparse buckets...")
large_buckets = [b for b in buckets if len(bucketed_chains[b]) < args.chains_per_bucket]
for bucket in large_buckets:
needed = args.chains_per_bucket - len(bucketed_chains[bucket])
if needed <= 0:
continue
print(f" Creating {needed} synthetic chains for {bucket:,} token bucket...")
attempts = 0
max_attempts = needed * 20
created = 0
while len(bucketed_chains[bucket]) < args.chains_per_bucket and attempts < max_attempts:
attempts += 1
synthetic = concatenate_conversations(
[c.copy() for c in all_conversations],
bucket,
encoding
)
if synthetic:
token_count = count_tokens(synthetic, encoding)
bucketed_chains[bucket].append(
{
"messages": synthetic,
"token_count": token_count,
"bucket": bucket,
"original_index": -1,
"synthetic": True,
}
)
created += 1
if created < needed:
print(f" Only created {created}/{needed} synthetic chains")
print("\nFinal bucket distribution:")
print("-" * 60)
final_dataset = {}
total_natural = 0
total_synthetic = 0
for bucket in buckets:
chains = bucketed_chains[bucket]
count = len(chains)
if count >= args.chains_per_bucket:
selected = random.sample(chains, args.chains_per_bucket)
else:
selected = chains
if count < args.chains_per_bucket:
print(f" {bucket:>6,} tokens: {count:>5,} chains insufficient (target: {args.chains_per_bucket})")
selected = chains # Use what we have
natural = sum(1 for c in selected if not c.get("synthetic", False))
synthetic = len(selected) - natural
total_natural += natural
total_synthetic += synthetic
print(f" {bucket:>6,} tokens: {len(selected):>3} chains ({natural} natural, {synthetic} synthetic)")
final_dataset[str(bucket)] = selected
# Save dataset
output_path = Path(args.output)
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "w", encoding="utf-8") as f:
json.dump(final_dataset, f, indent=2, ensure_ascii=False)
print("-" * 60)
print(f"\n Dataset saved to: {output_path}")
total_chains = sum(len(chains) for chains in final_dataset.values())
print(f"\nTotal chains: {total_chains:,}")
print(f"Natural conversations: {total_natural:,}")
print(f"Synthetic conversations: {total_synthetic:,}")
print("\nBucket summary:")
for bucket in buckets:
chains = final_dataset.get(str(bucket), [])
if chains:
avg_tokens = sum(c["token_count"] for c in chains) / len(chains)
min_tokens = min(c["token_count"] for c in chains)
max_tokens = max(c["token_count"] for c in chains)
print(f" {bucket:>6,} tokens: {len(chains):>3} chains, "
f"avg={avg_tokens:>6,.0f}, min={min_tokens:>6,}, max={max_tokens:>6,}")
print("\n" + "="*60)
print("To use this dataset with benchmark:")
print("="*60)
print(f" python benchmark_llm.py --dataset {args.output} ...")
print("="*60)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,9 @@
openai>=1.0.0
httpx>=0.24.0
pyyaml>=6.0
matplotlib>=3.7.0
seaborn>=0.12.0
numpy>=1.24.0
tiktoken>=0.5.0
datasets>=2.14.0
httpx[http2]

View File

@ -0,0 +1,630 @@
{
"timestamp": "2026-03-11T11:10:08.245541",
"model_name": "QuantTrio/Qwen3.5-35B-A3B-AWQ",
"results": [
{
"config": {
"input_tokens": 1000,
"output_tokens": 512,
"batch_size": 1,
"num_batches": 2,
"total_requests": 2,
"actual_input_tokens": 1140
},
"success_metrics": {
"success_rate": 100.0,
"successful_requests": 2,
"failed_requests": 0
},
"latency": {
"mean": 9.155,
"std": 5.968,
"min": 3.187,
"max": 15.123,
"p50": 9.155,
"p95": 14.526,
"p99": 15.003,
"ci_95_lower": 0.884,
"ci_95_upper": 17.426
},
"ttft": {
"mean": 9.155,
"std": 5.968,
"p50": 9.155,
"p90": 13.929
},
"tokens": {
"total_generated": 1024,
"content_tokens": 1024,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 55.62,
"concurrent_content_tps": 55.62,
"requests_per_second": 0.11,
"actual_wall_time": 18.412,
"efficiency_percent": 57.18
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 1.0,
"avg_batch_throughput": 97.26,
"min_batch_throughput": 33.86,
"max_batch_throughput": 160.67
}
},
{
"config": {
"input_tokens": 1000,
"output_tokens": 512,
"batch_size": 8,
"num_batches": 2,
"total_requests": 16,
"actual_input_tokens": 1003
},
"success_metrics": {
"success_rate": 100.0,
"successful_requests": 16,
"failed_requests": 0
},
"latency": {
"mean": 8.081,
"std": 2.287,
"min": 5.772,
"max": 10.373,
"p50": 8.085,
"p95": 10.372,
"p99": 10.373,
"ci_95_lower": 6.961,
"ci_95_upper": 9.202
},
"ttft": {
"mean": 8.081,
"std": 2.287,
"p50": 8.085,
"p90": 10.37
},
"tokens": {
"total_generated": 8192,
"content_tokens": 8192,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 503.04,
"concurrent_content_tps": 503.04,
"requests_per_second": 0.98,
"actual_wall_time": 16.285,
"efficiency_percent": 91.31
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 8.0,
"avg_batch_throughput": 549.93,
"min_batch_throughput": 394.83,
"max_batch_throughput": 705.03
}
},
{
"config": {
"input_tokens": 1000,
"output_tokens": 512,
"batch_size": 32,
"num_batches": 2,
"total_requests": 64,
"actual_input_tokens": 1028
},
"success_metrics": {
"success_rate": 100.0,
"successful_requests": 64,
"failed_requests": 0
},
"latency": {
"mean": 8.686,
"std": 0.017,
"min": 8.636,
"max": 8.732,
"p50": 8.688,
"p95": 8.71,
"p99": 8.721,
"ci_95_lower": 8.682,
"ci_95_upper": 8.691
},
"ttft": {
"mean": 8.595,
"std": 0.727,
"p50": 8.687,
"p90": 8.707
},
"tokens": {
"total_generated": 32768,
"content_tokens": 32768,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 1865.45,
"concurrent_content_tps": 1865.45,
"requests_per_second": 3.64,
"actual_wall_time": 17.566,
"efficiency_percent": 98.9
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 32.0,
"avg_batch_throughput": 1876.54,
"min_batch_throughput": 1870.97,
"max_batch_throughput": 1882.11
}
},
{
"config": {
"input_tokens": 1000,
"output_tokens": 512,
"batch_size": 64,
"num_batches": 2,
"total_requests": 128,
"actual_input_tokens": 1028
},
"success_metrics": {
"success_rate": 100.0,
"successful_requests": 128,
"failed_requests": 0
},
"latency": {
"mean": 12.207,
"std": 0.04,
"min": 12.108,
"max": 12.283,
"p50": 12.211,
"p95": 12.263,
"p99": 12.273,
"ci_95_lower": 12.2,
"ci_95_upper": 12.214
},
"ttft": {
"mean": 12.044,
"std": 1.066,
"p50": 12.205,
"p90": 12.257
},
"tokens": {
"total_generated": 65536,
"content_tokens": 65536,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 2654.48,
"concurrent_content_tps": 2654.48,
"requests_per_second": 5.18,
"actual_wall_time": 24.689,
"efficiency_percent": 98.89
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 64.0,
"avg_batch_throughput": 2665.65,
"min_batch_throughput": 2658.45,
"max_batch_throughput": 2672.85
}
},
{
"config": {
"input_tokens": 10000,
"output_tokens": 512,
"batch_size": 1,
"num_batches": 2,
"total_requests": 2,
"actual_input_tokens": 8871
},
"success_metrics": {
"success_rate": 100.0,
"successful_requests": 2,
"failed_requests": 0
},
"latency": {
"mean": 3.533,
"std": 0.026,
"min": 3.507,
"max": 3.559,
"p50": 3.533,
"p95": 3.557,
"p99": 3.559,
"ci_95_lower": 3.497,
"ci_95_upper": 3.569
},
"ttft": {
"mean": 3.533,
"std": 0.026,
"p50": 3.533,
"p90": 3.554
},
"tokens": {
"total_generated": 1024,
"content_tokens": 1024,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 142.85,
"concurrent_content_tps": 142.85,
"requests_per_second": 0.28,
"actual_wall_time": 7.168,
"efficiency_percent": 98.57
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 1.0,
"avg_batch_throughput": 144.92,
"min_batch_throughput": 143.85,
"max_batch_throughput": 145.99
}
},
{
"config": {
"input_tokens": 10000,
"output_tokens": 512,
"batch_size": 8,
"num_batches": 2,
"total_requests": 16,
"actual_input_tokens": 8895
},
"success_metrics": {
"success_rate": 100.0,
"successful_requests": 16,
"failed_requests": 0
},
"latency": {
"mean": 7.325,
"std": 0.144,
"min": 7.142,
"max": 7.493,
"p50": 7.333,
"p95": 7.489,
"p99": 7.492,
"ci_95_lower": 7.254,
"ci_95_upper": 7.395
},
"ttft": {
"mean": 7.325,
"std": 0.144,
"p50": 7.333,
"p90": 7.487
},
"tokens": {
"total_generated": 8192,
"content_tokens": 8192,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 550.76,
"concurrent_content_tps": 550.76,
"requests_per_second": 1.08,
"actual_wall_time": 14.874,
"efficiency_percent": 98.45
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 8.0,
"avg_batch_throughput": 554.82,
"min_batch_throughput": 543.43,
"max_batch_throughput": 566.21
}
},
{
"config": {
"input_tokens": 10000,
"output_tokens": 512,
"batch_size": 32,
"num_batches": 2,
"total_requests": 64,
"actual_input_tokens": 8842
},
"success_metrics": {
"success_rate": 100.0,
"successful_requests": 64,
"failed_requests": 0
},
"latency": {
"mean": 16.085,
"std": 2.082,
"min": 13.822,
"max": 18.383,
"p50": 16.109,
"p95": 18.273,
"p99": 18.329,
"ci_95_lower": 15.575,
"ci_95_upper": 16.595
},
"ttft": {
"mean": 15.996,
"std": 2.114,
"p50": 14.22,
"p90": 18.248
},
"tokens": {
"total_generated": 32768,
"content_tokens": 32768,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 995.46,
"concurrent_content_tps": 995.46,
"requests_per_second": 1.94,
"actual_wall_time": 32.917,
"efficiency_percent": 96.09
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 32.0,
"avg_batch_throughput": 1015.38,
"min_batch_throughput": 885.0,
"max_batch_throughput": 1145.76
}
},
{
"config": {
"input_tokens": 10000,
"output_tokens": 512,
"batch_size": 64,
"num_batches": 2,
"total_requests": 128,
"actual_input_tokens": 8842
},
"success_metrics": {
"success_rate": 100.0,
"successful_requests": 128,
"failed_requests": 0
},
"latency": {
"mean": 14.781,
"std": 0.143,
"min": 14.277,
"max": 15.099,
"p50": 14.781,
"p95": 15.032,
"p99": 15.096,
"ci_95_lower": 14.756,
"ci_95_upper": 14.806
},
"ttft": {
"mean": 14.781,
"std": 0.143,
"p50": 14.781,
"p90": 14.972
},
"tokens": {
"total_generated": 65536,
"content_tokens": 65536,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 2166.53,
"concurrent_content_tps": 2166.53,
"requests_per_second": 4.23,
"actual_wall_time": 30.249,
"efficiency_percent": 97.72
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 64.0,
"avg_batch_throughput": 2174.01,
"min_batch_throughput": 2164.24,
"max_batch_throughput": 2183.78
}
},
{
"config": {
"input_tokens": 50000,
"output_tokens": 512,
"batch_size": 1,
"num_batches": 2,
"total_requests": 2,
"actual_input_tokens": 42229
},
"success_metrics": {
"success_rate": 100.0,
"successful_requests": 2,
"failed_requests": 0
},
"latency": {
"mean": 6.101,
"std": 0.019,
"min": 6.082,
"max": 6.12,
"p50": 6.101,
"p95": 6.118,
"p99": 6.12,
"ci_95_lower": 6.074,
"ci_95_upper": 6.128
},
"ttft": {
"mean": 6.101,
"std": 0.019,
"p50": 6.101,
"p90": 6.117
},
"tokens": {
"total_generated": 1024,
"content_tokens": 1024,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 83.22,
"concurrent_content_tps": 83.22,
"requests_per_second": 0.16,
"actual_wall_time": 12.305,
"efficiency_percent": 99.16
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 1.0,
"avg_batch_throughput": 83.92,
"min_batch_throughput": 83.66,
"max_batch_throughput": 84.19
}
},
{
"config": {
"input_tokens": 50000,
"output_tokens": 512,
"batch_size": 8,
"num_batches": 2,
"total_requests": 16,
"actual_input_tokens": 42048
},
"success_metrics": {
"success_rate": 100.0,
"successful_requests": 16,
"failed_requests": 0
},
"latency": {
"mean": 22.685,
"std": 2.474,
"min": 20.003,
"max": 25.463,
"p50": 22.588,
"p95": 25.387,
"p99": 25.448,
"ci_95_lower": 21.473,
"ci_95_upper": 23.897
},
"ttft": {
"mean": 22.685,
"std": 2.474,
"p50": 22.588,
"p90": 25.295
},
"tokens": {
"total_generated": 8192,
"content_tokens": 8192,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 177.76,
"concurrent_content_tps": 177.76,
"requests_per_second": 0.35,
"actual_wall_time": 46.085,
"efficiency_percent": 97.28
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 8.0,
"avg_batch_throughput": 180.32,
"min_batch_throughput": 160.6,
"max_batch_throughput": 200.04
}
},
{
"config": {
"input_tokens": 50000,
"output_tokens": 512,
"batch_size": 32,
"num_batches": 2,
"total_requests": 64,
"actual_input_tokens": 41752
},
"success_metrics": {
"success_rate": 100.0,
"successful_requests": 64,
"failed_requests": 0
},
"latency": {
"mean": 70.626,
"std": 18.722,
"min": 48.439,
"max": 90.756,
"p50": 70.358,
"p95": 90.447,
"p99": 90.677,
"ci_95_lower": 66.039,
"ci_95_upper": 75.213
},
"ttft": {
"mean": 70.626,
"std": 18.722,
"p50": 70.358,
"p90": 90.064
},
"tokens": {
"total_generated": 32768,
"content_tokens": 32768,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 225.4,
"concurrent_content_tps": 225.4,
"requests_per_second": 0.44,
"actual_wall_time": 145.377,
"efficiency_percent": 90.31
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 32.0,
"avg_batch_throughput": 241.37,
"min_batch_throughput": 179.6,
"max_batch_throughput": 303.14
}
},
{
"config": {
"input_tokens": 50000,
"output_tokens": 512,
"batch_size": 64,
"num_batches": 2,
"total_requests": 128,
"actual_input_tokens": 41810
},
"success_metrics": {
"success_rate": 63.28,
"successful_requests": 81,
"failed_requests": 47
},
"latency": {
"mean": 111.228,
"std": 2.973,
"min": 106.149,
"max": 115.385,
"p50": 112.37,
"p95": 114.998,
"p99": 115.289,
"ci_95_lower": 110.581,
"ci_95_upper": 111.876
},
"ttft": {
"mean": 111.228,
"std": 2.973,
"p50": 112.37,
"p90": 114.818
},
"tokens": {
"total_generated": 41472,
"content_tokens": 41472,
"reasoning_tokens": 0,
"avg_per_request": 512.0
},
"throughput": {
"concurrent_total_tps": 182.43,
"concurrent_content_tps": 182.43,
"requests_per_second": 0.36,
"actual_wall_time": 227.333,
"efficiency_percent": 61.88
},
"batch_metrics": {
"num_batches": 2,
"avg_batch_size": 40.5,
"avg_batch_throughput": 181.97,
"min_batch_throughput": 162.11,
"max_batch_throughput": 201.84
}
}
]
}

View File

@ -0,0 +1,25 @@
endpoint:
url: https://0e799c11-4b01-4acd-a91c-5e43deaae940.services.external.0a71m37v.ubiops.io/v1
api_key: <REDACTED>
model_name: QuantTrio/Qwen3.5-35B-A3B-AWQ
benchmark:
input_tokens:
- 1000
- 10000
- 50000
batch_sizes:
- 1
- 8
- 32
- 64
num_batches: 2
output_tokens: 512
dataset: test_conversations.json
text: null
runtime:
request_timeout: 300
delay_between_runs: 5
log_io: true
wait_for_ready: true
max_init_retries: 10
init_retry_delay: 30