diff --git a/grafana/vllm-metrics/image.png b/grafana/vllm-metrics/image.png new file mode 100644 index 0000000..f139d99 Binary files /dev/null and b/grafana/vllm-metrics/image.png differ diff --git a/llm-throughput-tests-mindef-metadateren/README.md b/llm-throughput-tests-mindef-metadateren/README.md new file mode 100644 index 0000000..d20bcf2 --- /dev/null +++ b/llm-throughput-tests-mindef-metadateren/README.md @@ -0,0 +1,326 @@ +# LLM Benchmarking Tool + +The following benchmarks were used to 1) measure throughput of configured models, on the available hardware (NVIDIA RTX 6000 PRO GPUs), aswell as to debug connection issues that arised during the configuration of the pipelines. + +Benchmarks were created for qwen 3.5 and gpt oss. Mainly GPT OSS was used during the project (because of throughput and +output quality) + + + +------------------ +# How-to benchmark: + +Benchmark LLM deployments using **batch request patterns** - sends N requests simultaneously to measure concurrent throughput. + + +## Installation + +```bash +pip install -r requirements.txt +``` + +## Dataset Generation (Optional) + +You have **3 input options**: + +### 1. Generated Prompts (Default) +Automatically generates synthetic text to match token counts. + +### 2. Real Conversations +Use conversations from HuggingFace datasets: + +```bash +# Generate conversation dataset (takes ~5 minutes) +python create_test_dataset.py + +# Custom buckets +python create_test_dataset.py --buckets 1000 5000 10000 --chains_per_bucket 64 + +# Output to custom location +python create_test_dataset.py --output data/conversations.json +``` + +This creates a JSON file with real conversations bucketed by token count. The benchmark will cycle through these conversations instead of repeating the same synthetic prompt. + +### 3. Custom Text +Provide your own text directly: + +```bash +# Via CLI +python benchmark_llm.py --text "Your custom text here..." + +# Or in config file +text: "Analyze this large document about..." +``` + +## Quick Start + +### 1. Create Configuration File + +```yaml +endpoint: + url: https://b5cee612-b599-4524-a893-7698c9e75948.services.ubiops.development.vlam.ai + api_key: your-api-key + model_name: your-model + +benchmark: + input_tokens: [1000, 5000, 10000] + batch_sizes: [16, 32, 64, 128] + num_batches: 10 + output_tokens: 256 + dataset: test_conversations.json # Optional: real conversations + text: null # Optional: custom text input + +runtime: + request_timeout: 300 + delay_between_runs: 5 + log_io: false + wait_for_ready: true +``` + +### 2. Run Benchmark + +```bash +python benchmark_llm.py --config benchmark_config.yaml +``` + +### 3. Generate Visualizations + +```bash +python visualize_results.py --input results/results_your-model/benchmark_results.json +``` + +## Usage + +### Configuration File + +```bash +python benchmark_llm.py --config benchmark_config.yaml +``` + +### CLI Arguments + +```bash +# With dataset +python benchmark_llm.py \ + --endpoint_url https://api.example.com/v1 \ + --api_key YOUR_KEY \ + --model_name gpt-4 \ + --input_tokens 1000 5000 10000 \ + --batch_sizes 16 32 64 128 \ + --num_batches 10 \ + --output_tokens 256 \ + --dataset test_conversations.json + +# With custom text +python benchmark_llm.py \ + --endpoint_url https://api.example.com/v1 \ + --api_key YOUR_KEY \ + --model_name gpt-4 \ + --batch_sizes 32 \ + --num_batches 10 \ + --text "Analyze the following document about cloud architecture..." +``` + +## How It Works + +### Batch Execution + +The tool sends batches of N requests **simultaneously**: + +``` +Batch 0: [Req 1, Req 2, ..., Req 32] ← All start at exact same time + [Wait for all to complete] + +Batch 1: [Req 33, Req 34, ..., Req 64] ← All start at exact same time + [Wait for all to complete] +``` + +This ensures: +- All requests in a batch have **identical** `time_created` timestamps +- Concurrent load testing +- Accurate burst performance measurement + +### Request Calculation + +``` +total_requests = num_batches × batch_size +``` + +**Example:** +```yaml +batch_sizes: [32] +num_batches: 10 + +# Result: 10 batches × 32 requests = 320 total requests +# Each batch sends 32 requests simultaneously +``` + +## Key Metrics + +### Throughput +- **Tokens/second** across all requests in a batch +- Measures system's ability to handle concurrent load +- Higher is better + +### Time to First Token (TTFT) +- Latency until first content token appears +- Critical for user experience +- Lower is better + +### Latency Percentiles +- **P50 (median)**: Typical request latency +- **P95**: 95% of requests complete faster +- **P99**: 99% of requests complete faster + +### Batch Metrics +```json +{ + "batch_metrics": { + "num_batches": 10, + "avg_batch_throughput": 2456.78, + "min_batch_throughput": 2301.45, + "max_batch_throughput": 2589.12 + } +} +``` + +## Output Structure + +``` +results/ +└── results_your-model/ + ├── benchmark_results.json # Raw benchmark data + ├── benchmark_io.log # I/O logs (if enabled) + ├── config_used.yaml # Config copy (API key redacted) + ├── throughput.png # Throughput vs batch size + ├── ttft.png # TTFT vs batch size + └── latency_percentiles.png # Latency distribution +``` + +## Configuration Reference + +### Endpoint Configuration + +```yaml +endpoint: + url: string # OpenAI-compatible endpoint URL + api_key: string # API authentication key + model_name: string # Model identifier +``` + +### Benchmark Configuration + +```yaml +benchmark: + input_tokens: list[int] # Token counts to test [1000, 5000, 10000] + batch_sizes: list[int] # Batch sizes to test [16, 32, 64, 128] + num_batches: int # Number of batches per config (default: 10) + output_tokens: int # Max output tokens (default: 256) +``` + +**Understanding batch_sizes:** +- `batch_sizes: [16]` → Sends 16 requests simultaneously +- `batch_sizes: [32]` → Sends 32 requests simultaneously +- `batch_sizes: [16, 32, 64]` → Tests 3 different batch sizes + +### Runtime Configuration + +```yaml +runtime: + request_timeout: int # Timeout per request in seconds (default: 300) + delay_between_runs: int # Delay between configs in seconds (default: 5) + log_io: bool # Enable I/O logging (default: false) + wait_for_ready: bool # Wait for model init (default: true) + max_init_retries: int # Max init attempts (default: 10) + init_retry_delay: int # Delay between init attempts (default: 30) +``` + +## Example Output + +``` +Starting benchmark: 10 batches × 32 requests/batch = 320 total +Input: 5000 tokens, Output: 256 tokens +============================================================ + +Batch 0: 32/32 successful, 12.34s, 2456.78 tok/s +Batch 1: 32/32 successful, 12.45s, 2401.23 tok/s +Batch 2: 32/32 successful, 12.56s, 2389.45 tok/s +... + +✓ Benchmark complete in 125.67s + Success: 100% (320/320) + P95 Latency: 13.45s + Throughput: 2428.56 tokens/s + Avg Batch Throughput: 2429.01 tokens/s +``` + +## Use Cases + +### 1. Finding Optimal Batch Size + +Test multiple batch sizes to find the sweet spot: + +```yaml +batch_sizes: [16, 32, 64, 128, 256] +num_batches: 10 +``` + +Compare the `throughput.png` to see where throughput peaks. + +### 2. Stress Testing + +Test maximum burst capacity: + +```yaml +batch_sizes: [256] +num_batches: 5 +``` + +Sends 256 simultaneous requests per batch. + +### 3. Performance Profiling + +Test different input sizes at various batch sizes: + +```yaml +input_tokens: [1000, 2500, 5000, 10000] +batch_sizes: [16, 32, 64, 128] +``` + +Comprehensive performance matrix across configurations. + +## Advanced Usage + +### Enable I/O Logging + +Log all input prompts and outputs for debugging: + +```bash +python benchmark_llm.py --config benchmark_config.yaml +# Set log_io: true in config +``` + +Or: + +```bash +python benchmark_llm.py --log_io ... +``` + +Results saved to `benchmark_io.log`. + +### Skip Model Initialization + +If model is already warm: + +```bash +python benchmark_llm.py --config benchmark_config.yaml --skip_init_wait +``` + +### Custom Timeout + +For large batches or slow responses: + +```bash +python benchmark_llm.py --request_timeout 600 ... +``` diff --git a/llm-throughput-tests-mindef-metadateren/benchmark_config.yaml b/llm-throughput-tests-mindef-metadateren/benchmark_config.yaml new file mode 100644 index 0000000..26aa238 --- /dev/null +++ b/llm-throughput-tests-mindef-metadateren/benchmark_config.yaml @@ -0,0 +1,69 @@ +endpoint: + # internal litellm ubiops + #url: https://46e73bba-0ed9-4853-b2b0-d4509aaab06b.services.external.0a71m37v.ubiops.io/v1 + #api_key: + #model_name: openai-gpt-oss-120b-max-16 + + #url: https://46e73bba-0ed9-4853-b2b0-d4509aaab06b.services.external.0a71m37v.ubiops.io/v1 + #api_key: + #model_name: openai-gpt-oss-120b + + url: https://46e73bba-0ed9-4853-b2b0-d4509aaab06b.services.external.0a71m37v.ubiops.io/v1 + api_key: + model_name: openai-gpt-oss-120b-2x + + #url: https://b60dd657-9ce2-4ba0-ad45-754b5be29238.services.external.0a71m37v.ubiops.io/v1 + #api_key: + #model_name: openai/gpt-oss-120b + + + # staging litellm + #url: https://f1dfa3fc-3314-4d49-be06-98bfd3d1f5fd.services.staging.ubiops.dev/v1 + #api_key: + #model_name: llama-1b + + # staging vllm + #url: https://dde9ea35-6a02-4242-a3f3-5a7e7e29e7a7.services.staging.ubiops.dev/v1 + #api_key: + #model_name: meta-llama/Llama-3.2-1B-Instruct +benchmark: + # Input token counts to testfhtt + input_tokens: [50000] + + # Batch sizes to test (number of simultaneous requests per batch) + # Each batch sends N requests at the exact same time + batch_sizes: [64] + + num_batches: 1 + # Maximum output tokens per request + output_tokens: 1024 + + # Optional: Path to conversation dataset JSON file + # Generate with: python create_test_dataset.py + # If not provided, uses synthetic prompts + dataset: test_conversations.json # or "test_conversations.json" + + # Optional: Custom text to use as input for all requests + # Uses the same text for every request (ignores input_tokens) + # Priority: text > dataset > generated prompts + # Example: "Analyze this document about machine learning..." + text: null + +runtime: + # Timeout for each request (seconds) + request_timeout: 1800 + + # Delay between benchmark runs (seconds) + delay_between_runs: 5 + + # Enable detailed I/O logging (input prompts + outputs) + log_io: true + + # Wait for model initialization before starting + wait_for_ready: true + + # Maximum initialization check attempts + max_init_retries: 10 + + # Delay between initialization checks (seconds) + init_retry_delay: 30 diff --git a/llm-throughput-tests-mindef-metadateren/benchmark_llm.py b/llm-throughput-tests-mindef-metadateren/benchmark_llm.py new file mode 100644 index 0000000..278de96 --- /dev/null +++ b/llm-throughput-tests-mindef-metadateren/benchmark_llm.py @@ -0,0 +1,1193 @@ + +""" +LLM Benchmarking Tool + +Benchmarks LLM deployments via OpenAI-compatible endpoints with async concurrency. +Supports YAML config files and CLI arguments. +""" + +import asyncio +import time +import argparse +import json +import yaml +import logging +import httpx +from pathlib import Path +from datetime import datetime +from dataclasses import dataclass, asdict +from typing import List, Dict, Optional, Tuple +from openai import AsyncOpenAI, RateLimitError + +# ============================================================================ +# CONSTANTS +# ============================================================================ + +TOKEN_TO_CHAR_RATIO = 7 +CONFIDENCE_95_Z_SCORE = 1.96 +MIN_HTTP_CONNECTIONS = 200 +CONNECTION_MULTIPLIER = 2 +TOKEN_VALIDATION_THRESHOLD = 100 +SHUTDOWN_SENTINEL = object() + +# ============================================================================ +# DATA MODELS +# ============================================================================ + +@dataclass +class BenchmarkConfig: + """Configuration for a single benchmark run.""" + input_tokens: int + batch_size: int + num_batches: int + output_tokens: int + + +@dataclass +class RequestResult: + """Results from a single request.""" + total_tokens: int + content_tokens: int + reasoning_tokens: int + elapsed_time: float + time_to_first_token: float + prompt_tokens: int # Actual input tokens from API response + start_time: float + end_time: float + success: bool = True + error_message: Optional[str] = None + batch_id: Optional[int] = None # Batch identifier if using batching + requests_in_batch: Optional[int] = None # Number of requests in this batch + + @property + def tokens_per_second(self) -> float: + """Calculate tokens per second for this request.""" + return self.total_tokens / self.elapsed_time if self.elapsed_time > 0 else 0 + + @property + def content_tokens_per_second(self) -> float: + return self.content_tokens / self.elapsed_time if self.elapsed_time > 0 else 0 + + +# ============================================================================ +# LOGGING SETUP +# ============================================================================ + +def setup_logging(results_dir: Path, log_io: bool = False) -> Optional[logging.Logger]: + """Setup logging with optional I/O logging.""" + # Configure root logger for console output + logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(levelname)s - %(message)s', + handlers=[ + logging.StreamHandler() # Ensure console output + ] + ) + # logging.getLogger("httpx").setLevel(logging.DEBUG) + # logging.getLogger("httpcore").setLevel(logging.DEBUG) + + if not log_io: + return None + + # Setup separate I/O logger for detailed request/response logging + io_logger = logging.getLogger('io_logger') + io_logger.handlers.clear() + + io_log_path = results_dir / 'benchmark_io.log' + io_handler = logging.FileHandler(io_log_path) + io_handler.setFormatter(logging.Formatter('%(asctime)s - %(message)s')) + io_logger.addHandler(io_handler) + io_logger.setLevel(logging.INFO) + io_logger.propagate = False + + logging.info(f"I/O logging enabled: {io_log_path}") + + return io_logger + + +# ============================================================================ +# PROMPT GENERATION +# ============================================================================ + +def generate_prompt(target_tokens: int) -> str: + """ + Generate a prompt with approximately the target number of tokens. + + Parameters: + target_tokens: Target number of tokens for the generated prompt + + Returns: + Generated prompt string + """ + chars_needed = target_tokens * TOKEN_TO_CHAR_RATIO + + base_text = ( + "You are an AI assistant analyzing complex systems. Consider the following " + "context about modern computing, AI architectures, distributed systems, " + "cloud infrastructure, ML model deployment, data pipelines, scalability, " + "microservices, containerization, Kubernetes, monitoring, observability, " + "performance optimization, security, cost optimization, and emerging AI/ML trends. " + ) + + base_len = len(base_text) + full_reps = chars_needed // base_len + remainder = chars_needed % base_len + + expanded_text = base_text * full_reps + base_text[:remainder] + question = "\n\nBased on the context above, provide a comprehensive analysis." + + return expanded_text + question + + +def load_conversation_dataset(dataset_path: str) -> dict: + """Load pre-generated conversation dataset from JSON file.""" + try: + with open(dataset_path, 'r', encoding='utf-8') as f: + dataset = json.load(f) + + logging.info(f"Loaded conversation dataset from {dataset_path}") + + # Log bucket information + for bucket, conversations in dataset.items(): + logging.info(f" Bucket {bucket}: {len(conversations)} conversations") + + return dataset + except FileNotFoundError: + logging.error(f"Dataset file not found: {dataset_path}") + logging.error("Generate dataset first using: python create_test_dataset.py") + raise + except json.JSONDecodeError as e: + logging.error(f"Invalid JSON in dataset file: {e}") + raise + + +def get_conversation_for_tokens( + dataset: dict, + target_tokens: int, + request_index: int +) -> list[dict]: + """ + Get a conversation from the dataset for the given token count. + + Uses request_index to cycle through available conversations. + """ + bucket_key = str(target_tokens) + + if bucket_key not in dataset: + # Find closest bucket + available_buckets = [int(k) for k in dataset.keys()] + closest_bucket = min(available_buckets, key=lambda x: abs(x - target_tokens)) + bucket_key = str(closest_bucket) + logging.debug(f"No exact bucket for {target_tokens} tokens, using {closest_bucket}") + + conversations = dataset[bucket_key] + + if not conversations: + raise ValueError(f"No conversations available for bucket {bucket_key}") + + # Cycle through conversations using request index + conversation_idx = request_index % len(conversations) + conversation = conversations[conversation_idx] + + messages = list(conversation["messages"]) + + # Ensure the conversation ends with a user message so the API can generate a response + if messages and messages[-1].get("role") == "assistant": + messages.append({"role": "user", "content": "Please continue."}) + + return messages + + +# ============================================================================ +# REQUEST PROCESSING +# ============================================================================ + +async def process_stream(stream, log_io: bool, io_logger, request_id: int) -> Tuple: + """Process streaming response and capture metrics.""" + first_token_time = None + output_text = "" + reasoning_text = "" + + completion_tokens = 0 + prompt_tokens = 0 + content_tokens = 0 + reasoning_tokens = 0 + + try: + async for chunk in stream: + if hasattr(chunk, 'choices') and chunk.choices: + delta = chunk.choices[0].delta + + if hasattr(delta, 'reasoning_content') and delta.reasoning_content: + if log_io: + reasoning_text += delta.reasoning_content + + if hasattr(delta, 'content') and delta.content: + if first_token_time is None: + first_token_time = time.time() + output_text += delta.content + + # Get token counts from API + if hasattr(chunk, 'usage') and chunk.usage: + if hasattr(chunk.usage, 'completion_tokens'): + completion_tokens = chunk.usage.completion_tokens + if hasattr(chunk.usage, 'prompt_tokens'): + prompt_tokens = chunk.usage.prompt_tokens + + if hasattr(chunk.usage, 'completion_tokens_details') and chunk.usage.completion_tokens_details: + details = chunk.usage.completion_tokens_details + if hasattr(details, 'reasoning_tokens'): + reasoning_tokens = details.reasoning_tokens + if hasattr(details, 'content_tokens'): + content_tokens = details.content_tokens + + except Exception as e: + logging.error(f"Error processing stream: {e}") + raise + + # Fallback calculation + if content_tokens == 0 and reasoning_tokens == 0 and completion_tokens > 0: + content_tokens = completion_tokens + elif reasoning_tokens == 0 and content_tokens > 0 and content_tokens < completion_tokens: + reasoning_tokens = completion_tokens - content_tokens + elif content_tokens == 0 and reasoning_tokens > 0 and completion_tokens > reasoning_tokens: + content_tokens = completion_tokens - reasoning_tokens + + return ( + first_token_time, + completion_tokens, + prompt_tokens, + content_tokens, + reasoning_tokens, + output_text, + reasoning_text + ) + + +async def make_request( + client: AsyncOpenAI, + config: BenchmarkConfig, + model_name: str, + request_timeout: int, + log_io: bool = False, + io_logger: Optional[logging.Logger] = None, + request_id: Optional[int] = None, + dataset: Optional[dict] = None, + text_content: Optional[str] = None, + stream: bool = True +) -> Optional[RequestResult]: + """Make a single request to the model.""" + start_time = time.time() + + # Determine input source priority: text_content > dataset > generated + if text_content is not None: + # Use provided text directly (ignores input_tokens, uses actual text length) + messages = [{'role': 'user', 'content': text_content}] + elif dataset is not None: + try: + messages = get_conversation_for_tokens(dataset, config.input_tokens, request_id) + except (ValueError, KeyError) as e: + logging.warning(f"Failed to get conversation from dataset: {e}. Falling back to generated prompt.") + prompt = generate_prompt(config.input_tokens) + messages = [{'role': 'user', 'content': prompt}] + else: + # Generate synthetic prompt + prompt = generate_prompt(config.input_tokens) + messages = [{'role': 'user', 'content': prompt}] + + if log_io and io_logger: + io_logger.info(f"\n{'='*80}") + io_logger.info(f"REQUEST {request_id} - Target: {config.input_tokens} tokens") + io_logger.info(f"Model: {model_name}, Batch size: {config.batch_size}") + if text_content: + io_logger.info(f"Source: Custom text") + elif dataset: + io_logger.info(f"Source: Conversation dataset") + else: + io_logger.info(f"Source: Generated prompt") + io_logger.info(f"{'='*80}") + io_logger.info(f"MESSAGES:\n{json.dumps(messages, indent=2)}") + io_logger.info(f"{'-'*80}") + + try: + stream = False + print(f"Streaming: {stream}") + + if stream: + stream_obj = await client.chat.completions.create( + model=model_name, + messages=messages, + max_tokens=config.output_tokens, + stream=True, + stream_options={"include_usage": True} + ) + + result = await asyncio.wait_for( + process_stream(stream_obj, log_io, io_logger, request_id), + timeout=request_timeout + ) + + ( + first_token_time, + completion_tokens, + prompt_tokens, + content_tokens, + reasoning_tokens, + output_text, + reasoning_text + ) = result + + #logging.warning(f"[STREAM RESPONSE] {output_text}") + + end_time = time.time() + elapsed_time = end_time - start_time + ttft = first_token_time - start_time if first_token_time else elapsed_time + else: + response = await asyncio.wait_for( + client.chat.completions.create( + model=model_name, + messages=messages, + max_tokens=config.output_tokens, + stream=False + ), + timeout=request_timeout + ) + # logging.warning(f"Response object: {response}") + + end_time = time.time() + elapsed_time = end_time - start_time + ttft = elapsed_time # No first-token event in non-streaming mode + + completion_tokens = response.usage.completion_tokens if response.usage else 0 + prompt_tokens = response.usage.prompt_tokens if response.usage else 0 + content_tokens = completion_tokens + reasoning_tokens = 0 + output_text = response.choices[0].message.content or "" if response.choices else "" + reasoning_text = "" + + if hasattr(response.usage, 'completion_tokens_details') and response.usage.completion_tokens_details: + details = response.usage.completion_tokens_details + if hasattr(details, 'reasoning_tokens') and details.reasoning_tokens: + reasoning_tokens = details.reasoning_tokens + content_tokens = completion_tokens - reasoning_tokens + + if log_io and io_logger: + if reasoning_text: + io_logger.info(f"REASONING ({reasoning_tokens} tokens):\n{reasoning_text}") + io_logger.info(f"{'-'*80}") + io_logger.info(f"CONTENT ({content_tokens} tokens):\n{output_text}") + io_logger.info(f"{'-'*80}") + io_logger.info(f"TTFT: {ttft:.3f}s | Latency: {elapsed_time:.3f}s") + io_logger.info(f"Throughput: {completion_tokens/elapsed_time:.2f} tok/s") + io_logger.info(f"Token accuracy: target={config.input_tokens}, actual={prompt_tokens}") + io_logger.info(f"{'='*80}\n") + + # Validate token counts + if content_tokens == 0: + logging.warning("Request completed but got no content tokens") + return RequestResult( + total_tokens=completion_tokens, + content_tokens=content_tokens, + reasoning_tokens=reasoning_tokens, + elapsed_time=elapsed_time, + time_to_first_token=ttft, + prompt_tokens=prompt_tokens, + start_time=start_time, + end_time=end_time, + success=False, + error_message="No content tokens generated" + ) + + token_diff = abs(prompt_tokens - config.input_tokens) + if token_diff > TOKEN_VALIDATION_THRESHOLD: + logging.warning( + f"Token count difference: target={config.input_tokens}, " + f"actual={prompt_tokens}, diff={token_diff}" + ) + + return RequestResult( + total_tokens=completion_tokens, + content_tokens=content_tokens, + reasoning_tokens=reasoning_tokens, + elapsed_time=elapsed_time, + time_to_first_token=ttft, + prompt_tokens=prompt_tokens, + start_time=start_time, + end_time=end_time, + success=True + ) + + except asyncio.TimeoutError: + end_time = time.time() + logging.warning(f"Request {request_id} timed out after {request_timeout}s") + if log_io and io_logger: + io_logger.info(f"REQUEST {request_id} - TIMEOUT after {request_timeout}s\n") + return RequestResult( + total_tokens=0, + content_tokens=0, + reasoning_tokens=0, + elapsed_time=request_timeout, + time_to_first_token=request_timeout, + prompt_tokens=config.input_tokens, + start_time=start_time, + end_time=end_time, + success=False, + error_message=f"Timeout after {request_timeout}s" + ) + + except RateLimitError as e: + end_time = time.time() + elapsed = end_time - start_time + error_msg = f"429 Rate Limit: {str(e)}" + logging.warning(f"Request {request_id} got 429 (rate limited) after {elapsed:.3f}s") + if log_io and io_logger: + io_logger.info(f"REQUEST {request_id} - 429 RATE LIMITED after {elapsed:.3f}s\n") + return RequestResult( + total_tokens=0, + content_tokens=0, + reasoning_tokens=0, + elapsed_time=elapsed, + time_to_first_token=0, + prompt_tokens=config.input_tokens, + start_time=start_time, + end_time=end_time, + success=False, + error_message=error_msg + ) + + except Exception as e: + end_time = time.time() + elapsed = end_time - start_time + error_msg = f"{type(e).__name__}: {str(e)}" + logging.error(f"Request {request_id} error after {elapsed:.3f}s: {error_msg}", exc_info=True) + if log_io and io_logger: + io_logger.info(f"REQUEST {request_id} - ERROR after {elapsed:.3f}s: {error_msg}\n") + return RequestResult( + total_tokens=0, + content_tokens=0, + reasoning_tokens=0, + elapsed_time=elapsed, + time_to_first_token=0, + prompt_tokens=config.input_tokens, + start_time=start_time, + end_time=end_time, + success=False, + error_message=error_msg + ) + + +async def make_batch_request( + client: AsyncOpenAI, + config: BenchmarkConfig, + model_name: str, + request_timeout: int, + batch_size: int, + batch_id: int, + log_io: bool = False, + io_logger: Optional[logging.Logger] = None, + dataset: Optional[dict] = None, + text_content: Optional[str] = None, + stream: bool = True +) -> List[RequestResult]: + """ + Make a batch of requests simultaneously and return results. + + This sends multiple independent requests at the exact same time + and measures their collective performance. + """ + batch_start_time = time.time() + + if log_io and io_logger: + io_logger.info(f"\n{'#'*80}") + io_logger.info(f"BATCH {batch_id} - Sending {batch_size} requests") + io_logger.info(f"Model: {model_name}, Input: {config.input_tokens} tokens") + io_logger.info(f"{'#'*80}") + + # Create all requests simultaneously + tasks = [] + for i in range(batch_size): + request_id = batch_id * batch_size + i + task = make_request( + client=client, + config=config, + model_name=model_name, + request_timeout=request_timeout, + log_io=log_io, + io_logger=io_logger, + request_id=request_id, + dataset=dataset, + text_content=text_content, + stream=stream + ) + tasks.append(task) + + # Execute all requests simultaneously + results = await asyncio.gather(*tasks) + + batch_end_time = time.time() + batch_elapsed = batch_end_time - batch_start_time + + # Add batch metadata to results + enhanced_results = [] + for result in results: + if result: + result.batch_id = batch_id + result.requests_in_batch = batch_size + enhanced_results.append(result) + + successful = sum(1 for r in enhanced_results if r.success) + total_tokens = sum(r.total_tokens for r in enhanced_results if r.success) + batch_throughput = total_tokens / batch_elapsed if batch_elapsed > 0 else 0 + + if log_io and io_logger: + io_logger.info(f"\n{'#'*80}") + io_logger.info(f"BATCH {batch_id} COMPLETE") + io_logger.info(f" Duration: {batch_elapsed:.3f}s") + io_logger.info(f" Successful: {successful}/{batch_size}") + io_logger.info(f" Batch Throughput: {batch_throughput:.2f} tokens/s") + io_logger.info(f"{'#'*80}\n") + + logging.info( + f"Batch {batch_id}: {successful}/{batch_size} successful, " + f"{batch_elapsed:.2f}s, {batch_throughput:.2f} tok/s" + ) + + return enhanced_results + + +# ============================================================================ +# STATISTICS CALCULATION +# ============================================================================ + +def calculate_statistics(results: List[RequestResult], config: BenchmarkConfig) -> Dict: + """Calculate aggregate statistics from benchmark results.""" + import numpy as np + + successful = [r for r in results if r.success] + failed = [r for r in results if not r.success] + + if not successful: + return {'success_rate': 0, 'error': 'No successful requests'} + + success_rate = len(successful) / len(results) * 100 + + # Latency stats + latencies = [r.elapsed_time for r in successful] + avg_lat = float(np.mean(latencies)) + std_lat = float(np.std(latencies)) + margin = CONFIDENCE_95_Z_SCORE * std_lat / np.sqrt(len(successful)) + + # TTFT stats + ttft_values = [r.time_to_first_token for r in successful if r.time_to_first_token] + + # Throughput calculation + actual_start = min(r.start_time for r in successful) + actual_end = max(r.end_time for r in successful) + wall_time = actual_end - actual_start + + total_output_tokens = sum(r.total_tokens for r in successful) + total_content_tokens = sum(r.content_tokens for r in successful) + + concurrent_throughput = total_output_tokens / wall_time if wall_time > 0 else 0 + content_throughput = total_content_tokens / wall_time if wall_time > 0 else 0 + + # Efficiency + per_request_throughputs = [r.tokens_per_second for r in successful] + avg_per_request = float(np.mean(per_request_throughputs)) + theoretical_max = config.batch_size * avg_per_request + efficiency = min((concurrent_throughput / theoretical_max * 100) if theoretical_max > 0 else 0, 100) + + # Batch metrics + batch_metrics = None + batch_ids = [r.batch_id for r in successful if r.batch_id is not None] + if batch_ids: + unique_batches = set(batch_ids) + batch_sizes = {} + batch_throughputs = {} + + for batch_id in unique_batches: + batch_results = [r for r in successful if r.batch_id == batch_id] + if batch_results: + batch_start = min(r.start_time for r in batch_results) + batch_end = max(r.end_time for r in batch_results) + batch_time = batch_end - batch_start + batch_tokens = sum(r.total_tokens for r in batch_results) + + batch_sizes[batch_id] = len(batch_results) + batch_throughputs[batch_id] = batch_tokens / batch_time if batch_time > 0 else 0 + + throughput_values = list(batch_throughputs.values()) + batch_metrics = { + 'num_batches': len(unique_batches), + 'avg_batch_size': round(float(np.mean(list(batch_sizes.values()))), 2), + 'avg_batch_throughput': round(float(np.mean(throughput_values)), 2), + 'min_batch_throughput': round(float(np.min(throughput_values)), 2), + 'max_batch_throughput': round(float(np.max(throughput_values)), 2), + } + + stats_dict = { + 'config': { + 'input_tokens': config.input_tokens, + 'output_tokens': config.output_tokens, + 'batch_size': config.batch_size, + 'num_batches': config.num_batches, + 'total_requests': len(results), + 'actual_input_tokens': round(float(np.mean([r.prompt_tokens for r in successful]))) + }, + 'success_metrics': { + 'success_rate': round(success_rate, 2), + 'successful_requests': len(successful), + 'failed_requests': len(failed) + }, + 'latency': { + 'mean': round(avg_lat, 3), + 'std': round(std_lat, 3), + 'min': round(float(np.min(latencies)), 3), + 'max': round(float(np.max(latencies)), 3), + 'p50': round(float(np.percentile(latencies, 50)), 3), + 'p95': round(float(np.percentile(latencies, 95)), 3), + 'p99': round(float(np.percentile(latencies, 99)), 3), + 'ci_95_lower': round(avg_lat - margin, 3), + 'ci_95_upper': round(avg_lat + margin, 3) + }, + 'ttft': { + 'mean': round(float(np.mean(ttft_values)), 3) if ttft_values else None, + 'std': round(float(np.std(ttft_values)), 3) if ttft_values else None, + 'p50': round(float(np.percentile(ttft_values, 50)), 3) if ttft_values else None, + 'p90': round(float(np.percentile(ttft_values, 90)), 3) if ttft_values else None + }, + 'tokens': { + 'total_generated': total_output_tokens, + 'content_tokens': total_content_tokens, + 'reasoning_tokens': sum(r.reasoning_tokens for r in successful), + 'avg_per_request': round(total_output_tokens / len(successful), 2) + }, + 'throughput': { + 'concurrent_total_tps': round(concurrent_throughput, 2), + 'concurrent_content_tps': round(content_throughput, 2), + 'requests_per_second': round(len(successful) / wall_time, 2) if wall_time > 0 else 0, + 'actual_wall_time': round(wall_time, 3), + 'efficiency_percent': round(efficiency, 2) + } + } + + # Add batch metrics if available + if batch_metrics: + stats_dict['batch_metrics'] = batch_metrics + + return stats_dict + + +# ============================================================================ +# HTTP CLIENT +# ============================================================================ + +def create_http_client(concurrent_requests: int, request_timeout: int) -> httpx.AsyncClient: + """Create HTTP client configured for high concurrency.""" + max_connections = max(concurrent_requests * CONNECTION_MULTIPLIER, MIN_HTTP_CONNECTIONS) + + logging.info(f"Creating HTTP client with max_connections={max_connections}") + + return httpx.AsyncClient( + # http2=True, + limits=httpx.Limits( + max_keepalive_connections=max_connections, + max_connections=max_connections, + keepalive_expiry=1800 + ), + timeout=httpx.Timeout( + timeout=request_timeout, + connect=1800.0, + read=request_timeout, + write=1800.0, + pool=5.0 + ) + ) + + +# ============================================================================ +# MODEL INITIALIZATION +# ============================================================================ + +async def wait_for_model( + endpoint_url: str, + api_key: str, + model_name: str, + max_retries: int = 10, + retry_delay: int = 30 +) -> bool: + """Wait for model to be ready.""" + logging.info(f"Waiting for model '{model_name}' to initialize...") + + http_client = create_http_client(1, 60) + client = AsyncOpenAI(base_url=endpoint_url, api_key=api_key, http_client=http_client, max_retries=0) + + try: + for attempt in range(1, max_retries + 1): + try: + logging.info(f"Initialization check {attempt}/{max_retries}...") + + await client.chat.completions.create( + model=model_name, + messages=[{'role': 'user', 'content': 'Hello'}], + max_tokens=10, + stream=False + ) + + logging.info("Model is ready!") + return True + + except Exception as e: + error_msg = str(e).lower() + + if any(kw in error_msg for kw in ['initializing', 'loading', 'starting', 'not ready', 'unavailable']): + if attempt < max_retries: + logging.info(f"Model initializing... waiting {retry_delay}s") + await asyncio.sleep(retry_delay) + else: + logging.error(f"Model failed to initialize after {max_retries} attempts") + return False + else: + logging.error(f"Error: {e}") + return False + finally: + await http_client.aclose() + + return False + + +# ============================================================================ +# BENCHMARK EXECUTION +# ============================================================================ + +async def run_benchmark( + config: BenchmarkConfig, + endpoint_url: str, + api_key: str, + model_name: str, + request_timeout: int, + log_io: bool = False, + io_logger: Optional[logging.Logger] = None, + dataset: Optional[dict] = None, + text_content: Optional[str] = None, + stream: bool = True +) -> Optional[Dict]: + """Run a single benchmark configuration in batch mode.""" + total_requests = config.num_batches * config.batch_size + + logging.info(f"\n{'='*60}") + logging.info(f"Starting benchmark: {config.num_batches} batches" + f"{config.batch_size} requests/batch = {total_requests} total") + logging.info(f"Target: {config.input_tokens} tokens, Output: {config.output_tokens} tokens") + if text_content: + logging.info(f"Using custom text input") + elif dataset: + logging.info(f"Using conversation dataset") + logging.info(f"{'='*60}") + + http_client = create_http_client(config.batch_size, request_timeout) + + try: + client = AsyncOpenAI(base_url=endpoint_url, api_key=api_key, http_client=http_client, max_retries=0) + results = [] + + start_time = time.time() + + # Send batches of requests + for batch_id in range(config.num_batches): + batch_start = time.time() + + logging.info(f" Starting batch {batch_id + 1}/{config.num_batches} " + f"({config.batch_size} simultaneous requests)...") + + batch_results = await make_batch_request( + client=client, + config=config, + model_name=model_name, + request_timeout=request_timeout, + batch_size=config.batch_size, + batch_id=batch_id, + log_io=log_io, + io_logger=io_logger, + dataset=dataset, + text_content=text_content, + stream=stream + ) + results.extend(batch_results) + + # Log batch completion with detailed metrics + batch_time = time.time() - batch_start + successful = sum(1 for r in batch_results if r.success) + failed = len(batch_results) - successful + batch_throughput = sum(r.total_tokens for r in batch_results if r.success) / batch_time if batch_time > 0 else 0 + + # Calculate average TTFT for this batch + ttfts = [r.time_to_first_token for r in batch_results if r.success and r.time_to_first_token] + avg_ttft = sum(ttfts) / len(ttfts) if ttfts else 0 + + # Calculate average latency for this batch + latencies = [r.elapsed_time for r in batch_results if r.success] + avg_latency = sum(latencies) / len(latencies) if latencies else 0 + + status_icon = "✓" if failed == 0 else "⚠" + logging.info(f"{status_icon} Batch {batch_id + 1}/{config.num_batches} complete: " + f"{successful}/{len(batch_results)} successful " + f"| {batch_time:.2f}s total " + f"| {batch_throughput:.0f} tok/s " + f"| TTFT: {avg_ttft:.3f}s " + f"| Latency: {avg_latency:.2f}s") + + if failed > 0: + logging.warning(f" {failed} request(s) failed in batch {batch_id + 1}") + + # Small delay between batches to avoid overwhelming the system + if batch_id < config.num_batches - 1: + await asyncio.sleep(0.1) + + end_time = time.time() + + if not results: + logging.error("No results collected!") + return None + + stats = calculate_statistics(results, config) + + logging.info(f"✓ Benchmark complete in {end_time - start_time:.2f}s") + if 'success_metrics' in stats: + logging.info(f" Success: {stats['success_metrics']['success_rate']}% " + f"({stats['success_metrics']['successful_requests']}/{total_requests})") + logging.info(f" P95 Latency: {stats['latency']['p95']}s") + logging.info(f" Throughput: {stats['throughput']['concurrent_total_tps']} tokens/s") + + if 'batch_metrics' in stats: + logging.info(f" Avg Batch Throughput: {stats['batch_metrics']['avg_batch_throughput']} tokens/s") + else: + logging.error(f" All {total_requests} requests failed. Error: {stats.get('error', 'unknown')}") + + return stats + + finally: + await http_client.aclose() + + +async def run_all_benchmarks( + configs: List[BenchmarkConfig], + endpoint_url: str, + api_key: str, + model_name: str, + request_timeout: int, + delay_between_runs: int = 5, + log_io: bool = False, + io_logger: Optional[logging.Logger] = None, + wait_for_ready: bool = True, + max_init_retries: int = 10, + init_retry_delay: int = 30, + dataset: Optional[dict] = None, + text_content: Optional[str] = None, + stream: bool = True +) -> List[Dict]: + """Run all benchmark configurations.""" + if wait_for_ready: + if not await wait_for_model(endpoint_url, api_key, model_name, max_init_retries, init_retry_delay): + logging.error("Model initialization failed. Aborting.") + return [] + + all_results = [] + + for i, config in enumerate(configs, 1): + logging.info(f"\n--- Benchmark {i}/{len(configs)} ---") + result = await run_benchmark( + config, endpoint_url, api_key, model_name, request_timeout, + log_io, io_logger, dataset, text_content, stream + ) + + if result: + all_results.append(result) + else: + logging.error(f"Benchmark {i} failed") + + if i < len(configs): + logging.info(f"Waiting {delay_between_runs}s before next run...") + await asyncio.sleep(delay_between_runs) + + # Log summary + if all_results: + successful = len(all_results) + total = len(configs) + logging.info(f"\n{'='*60}") + logging.info(f"ALL BENCHMARKS COMPLETE") + logging.info(f"{'='*60}") + logging.info(f"Completed: {successful}/{total} benchmarks") + + # Overall stats + total_requests = sum(r['config']['total_requests'] for r in all_results) + total_successful = sum(r['success_metrics']['successful_requests'] for r in all_results) + overall_success_rate = (total_successful / total_requests * 100) if total_requests > 0 else 0 + + logging.info(f"Total requests: {total_requests}") + logging.info(f"Successful: {total_successful} ({overall_success_rate:.1f}%)") + logging.info(f"{'='*60}") + + return all_results + + +# ============================================================================ +# FILE I/O +# ============================================================================ + +def create_results_directory(model_name: str) -> Tuple[Path, str]: + """Create results directory.""" + if 'ubiops-deployment/' in model_name: + parts = model_name.split('/') + if len(parts) >= 4: + model_name = parts[-1] + + safe_name = "".join(c if c.isalnum() or c in ('-', '_') else '_' for c in model_name) + results_dir = Path('results') / f'results_{safe_name}' + results_dir.mkdir(parents=True, exist_ok=True) + + logging.info(f"Results directory: {results_dir}") + return results_dir, safe_name + + +def save_results(results: List[Dict], results_dir: Path, model_name: str) -> Path: + """Save benchmark results to JSON.""" + output_path = results_dir / 'benchmark_results.json' + + output_data = { + 'timestamp': datetime.now().isoformat(), + 'model_name': model_name, + 'results': results + } + + with open(output_path, 'w') as f: + json.dump(output_data, f, indent=2) + + logging.info(f"Results saved: {output_path}") + return output_path + + +def load_config_file(config_path: str) -> Dict: + """Load configuration from YAML file.""" + with open(config_path, 'r') as f: + return yaml.safe_load(f) + + +def save_config_copy(config_data: Dict, results_dir: Path) -> None: + """Save sanitized config copy.""" + if 'endpoint' in config_data and 'api_key' in config_data['endpoint']: + config_data['endpoint']['api_key'] = '' + + config_path = results_dir / 'config_used.yaml' + with open(config_path, 'w') as f: + yaml.dump(config_data, f, default_flow_style=False, sort_keys=False) + + logging.info(f"Config saved: {config_path}") + + +# ============================================================================ +# CLI INTERFACE +# ============================================================================ + +def parse_args(): + """Parse command-line arguments.""" + parser = argparse.ArgumentParser( + description="Benchmark LLM deployments via OpenAI-compatible endpoints", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Using config file + python benchmark_llm.py --config benchmark_config.yaml + + # Using CLI arguments + python benchmark_llm.py --endpoint_url https://api.example.com/v1 \\ + --api_key YOUR_KEY \\ + --model_name gpt-4 \\ + --input_tokens 1000 5000 10000 \\ + --concurrent_requests 1 8 16 32 + """ + ) + + parser.add_argument('--config', type=str, help="Path to YAML config file") + parser.add_argument('--endpoint_url', type=str, help="OpenAI-compatible endpoint URL") + parser.add_argument('--api_key', type=str, help="API key") + parser.add_argument('--model_name', type=str, help="Model name") + parser.add_argument('--input_tokens', type=int, nargs='+', default=[1000, 5000, 10000], + help="Input token counts to test") + parser.add_argument('--batch_sizes', type=int, nargs='+', default=[16, 32, 64, 128], + help="Batch sizes to test (number of simultaneous requests)") + parser.add_argument('--num_batches', type=int, default=10, + help="Number of batches to send per configuration") + parser.add_argument('--output_tokens', type=int, default=256, + help="Output tokens per request") + parser.add_argument('--dataset', type=str, default=None, + help="Path to conversation dataset JSON file (optional)") + parser.add_argument('--text', type=str, default=None, + help="Custom text to use as input for all requests (same text repeated)") + parser.add_argument('--request_timeout', type=int, default=900, + help="Request timeout (seconds)") + parser.add_argument('--delay_between_runs', type=int, default=5, + help="Delay between runs (seconds)") + parser.add_argument('--no_stream', action='store_true', + help="Use non-streaming requests instead of streaming") + parser.add_argument('--log_io', action='store_true', + help="Log all input/output") + parser.add_argument('--skip_init_wait', action='store_true', + help="Skip model initialization wait") + parser.add_argument('--max_init_retries', type=int, default=10, + help="Max initialization retries") + parser.add_argument('--init_retry_delay', type=int, default=30, + help="Delay between init retries") + + return parser.parse_args() + + +def generate_configs( + input_tokens: List[int], + batch_sizes: List[int], + num_batches: int, + output_tokens: int +) -> List[BenchmarkConfig]: + """Generate benchmark configurations.""" + configs = [] + for input_tok in input_tokens: + for batch_size in batch_sizes: + configs.append(BenchmarkConfig( + input_tokens=input_tok, + batch_size=batch_size, + num_batches=num_batches, + output_tokens=output_tokens + )) + return configs + + +# ============================================================================ +# MAIN +# ============================================================================ + +async def main(): + """Main entry point.""" + args = parse_args() + + # Load configuration + if args.config: + config_data = load_config_file(args.config) + endpoint_url = config_data['endpoint']['url'] + api_key = config_data['endpoint']['api_key'] + model_name = config_data['endpoint']['model_name'] + + bench_config = config_data['benchmark'] + input_tokens = bench_config['input_tokens'] + batch_sizes = bench_config['batch_sizes'] + num_batches = bench_config['num_batches'] + output_tokens = bench_config['output_tokens'] + dataset_path = bench_config.get('dataset', None) + custom_text = bench_config.get('text', None) + + runtime = config_data.get('runtime', {}) + request_timeout = runtime.get('request_timeout', 900) + delay_between_runs = runtime.get('delay_between_runs', 5) + log_io = runtime.get('log_io', False) + stream = runtime.get('stream', True) + wait_for_ready = runtime.get('wait_for_ready', True) + max_init_retries = runtime.get('max_init_retries', 10) + init_retry_delay = runtime.get('init_retry_delay', 30) + else: + # Use CLI arguments + if not all([args.endpoint_url, args.api_key, args.model_name]): + logging.error("Must provide --config or all of --endpoint_url, --api_key, --model_name") + return + + config_data = None + endpoint_url = args.endpoint_url + api_key = args.api_key + model_name = args.model_name + input_tokens = args.input_tokens + batch_sizes = args.batch_sizes + num_batches = args.num_batches + output_tokens = args.output_tokens + dataset_path = args.dataset + custom_text = args.text + request_timeout = args.request_timeout + delay_between_runs = args.delay_between_runs + log_io = args.log_io + stream = not args.no_stream + wait_for_ready = not args.skip_init_wait + max_init_retries = args.max_init_retries + init_retry_delay = args.init_retry_delay + + # Load conversation dataset if provided + dataset = None + if dataset_path: + try: + dataset = load_conversation_dataset(dataset_path) + except Exception as e: + logging.error(f"Failed to load dataset: {e}") + logging.error("Continuing with generated prompts...") + + # Use custom text if provided (takes priority over everything) + text_content = custom_text + + # Create results directory + results_dir, safe_name = create_results_directory(model_name) + io_logger = setup_logging(results_dir, log_io) + + if config_data: + save_config_copy(config_data, results_dir) + + # Generate benchmark configurations + configs = generate_configs(input_tokens, batch_sizes, num_batches, output_tokens) + + logging.info(f"\n{'='*60}") + logging.info(f"BENCHMARK CONFIGURATION") + logging.info(f"{'='*60}") + logging.info(f"Model: {model_name}") + logging.info(f"Endpoint: {endpoint_url}") + logging.info(f"Mode: BATCH ({'streaming' if stream else 'non-streaming'})") + if text_content: + logging.info(f"Input: Custom text ({len(text_content)} chars)") + elif dataset: + logging.info(f"Input: {dataset_path} (real conversations)") + else: + logging.info(f"Input: Generated prompts") + logging.info(f"Total configurations: {len(configs)}") + logging.info(f"Input tokens: {input_tokens}") + logging.info(f"Batch sizes: {batch_sizes}") + logging.info(f"Batches per config: {num_batches}") + logging.info(f"Output tokens: {output_tokens}") + logging.info(f"{'='*60}\n") + + # Run benchmarks + results = await run_all_benchmarks( + configs=configs, + endpoint_url=endpoint_url, + api_key=api_key, + model_name=model_name, + request_timeout=request_timeout, + delay_between_runs=delay_between_runs, + log_io=log_io, + io_logger=io_logger, + wait_for_ready=wait_for_ready, + max_init_retries=max_init_retries, + init_retry_delay=init_retry_delay, + dataset=dataset, + text_content=text_content, + stream=stream + ) + + if results: + results_path = save_results(results, results_dir, model_name) + + logging.info(f"\n{'='*60}") + logging.info("BENCHMARK COMPLETE!") + logging.info(f"{'='*60}") + logging.info(f"Results: {results_path}") + if log_io: + logging.info(f"I/O log: {results_dir / 'benchmark_io.log'}") + logging.info(f"\nVisualize: python visualize_results.py --input {results_path}") + else: + logging.error("No results generated") + + +if __name__ == '__main__': + asyncio.run(main()) diff --git a/llm-throughput-tests-mindef-metadateren/create_test_dataset.py b/llm-throughput-tests-mindef-metadateren/create_test_dataset.py new file mode 100644 index 0000000..2b3399a --- /dev/null +++ b/llm-throughput-tests-mindef-metadateren/create_test_dataset.py @@ -0,0 +1,338 @@ +#!/usr/bin/env python3 +""" +Create bucketed test dataset for LLM benchmarking. + +Uses multiple strategies to fill all token buckets: +1. Natural conversations from UltraChat dataset +2. Concatenation of shorter conversations for larger buckets + +Buckets aligned with benchmark input_tokens: 100, 500, 1k, 2k, 5k, 10k +Outputs 128 unique conversations per bucket for comprehensive testing. + +Usage: + python create_test_dataset.py + python create_test_dataset.py --output test_conversations.json + python create_test_dataset.py --buckets 1000 5000 10000 --chains_per_bucket 64 +""" + +import argparse +import json +import random +from collections import defaultdict +from pathlib import Path + +import tiktoken +from datasets import load_dataset + +# Default buckets aligned with typical benchmark configurations +DEFAULT_BUCKETS = [100, 500, 1_000, 2_000, 5_000, 10_000] +CHAINS_PER_BUCKET = 128 +DATASET_NAME = "HuggingFaceH4/ultrachat_200k" +ENCODING_NAME = "cl100k_base" + + +def count_tokens(messages: list[dict], encoding: tiktoken.Encoding) -> int: + """Count total tokens in a conversation chain.""" + total = 0 + for msg in messages: + content = msg.get("content", "") or "" + role = msg.get("role", "") or "" + total += len(encoding.encode(content, disallowed_special=())) + total += len(encoding.encode(role, disallowed_special=())) + total += 4 # Message formatting overhead + total += 2 # Conversation formatting overhead + return total + + +def get_bucket(token_count: int, buckets: list[int]) -> int | None: + """Find the appropriate bucket for a token count (within 20% of target).""" + for bucket in buckets: + if bucket * 0.8 <= token_count <= bucket * 1.2: + return bucket + return None + + +def format_ultrachat_messages(messages: list[dict]) -> list[dict]: + """Format UltraChat conversations to OpenAI chat format.""" + formatted = [] + for msg in messages: + role = msg.get("role", "user") + if role not in ["user", "assistant", "system"]: + role = "user" + content = msg.get("content", "") or "" + if content: + formatted.append({"role": role, "content": content}) + return formatted + + +def concatenate_conversations( + conversations: list[list[dict]], + target_tokens: int, + encoding: tiktoken.Encoding, + tolerance: float = 0.2 +) -> list[dict] | None: + """Concatenate multiple conversations to reach target token count.""" + result = [] + current_tokens = 0 + target_min = target_tokens * (1 - tolerance) + target_max = target_tokens * (1 + tolerance) + + random.shuffle(conversations) + + for conv in conversations: + conv_tokens = count_tokens(conv, encoding) + + # Skip if this would exceed target + if current_tokens + conv_tokens > target_max: + continue + + # Add separator between conversations + if result and conv: + separator = {"role": "user", "content": "---\nNew conversation:\n---"} + result.append(separator) + current_tokens += 10 # Approximate tokens for separator + + result.extend(conv) + current_tokens += conv_tokens + + # Check if we've reached target + if current_tokens >= target_min: + break + + # Verify we're within acceptable range + if current_tokens < target_min * 0.8: + return None + + return result + + +def main(): + parser = argparse.ArgumentParser( + description="Create bucketed test dataset for LLM benchmarking", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Default configuration (128 conversations per bucket) + python create_test_dataset.py + + # Custom buckets + python create_test_dataset.py --buckets 1000 5000 10000 + + # Fewer conversations per bucket + python create_test_dataset.py --chains_per_bucket 64 + + # Custom output location + python create_test_dataset.py --output data/conversations.json + """ + ) + + parser.add_argument( + "--output", + type=str, + default="test_conversations.json", + help="Output file path (default: test_conversations.json)" + ) + + parser.add_argument( + "--buckets", + type=int, + nargs='+', + default=DEFAULT_BUCKETS, + help="Token count buckets (default: 100 500 1000 2000 5000 10000)" + ) + + parser.add_argument( + "--chains_per_bucket", + type=int, + default=CHAINS_PER_BUCKET, + help=f"Number of conversations per bucket (default: {CHAINS_PER_BUCKET})" + ) + + parser.add_argument( + "--seed", + type=int, + default=42, + help="Random seed for reproducibility (default: 42)" + ) + + parser.add_argument( + "--dataset", + type=str, + default=DATASET_NAME, + help=f"HuggingFace dataset name (default: {DATASET_NAME})" + ) + + args = parser.parse_args() + + random.seed(args.seed) + buckets = sorted(args.buckets) + + print("="*60) + print("LLM Benchmark Dataset Generator") + print("="*60) + print(f"Output: {args.output}") + print(f"Buckets: {buckets}") + print(f"Conversations per bucket: {args.chains_per_bucket}") + print(f"Random seed: {args.seed}") + print("="*60) + + print(f"\nLoading dataset: {args.dataset}") + try: + dataset = load_dataset(args.dataset, split="train_sft") + except Exception as e: + print(f"Error loading dataset: {e}") + print("Make sure you have internet connection and the 'datasets' package installed:") + print(" pip install datasets") + return + + print(f"Initializing tokenizer: {ENCODING_NAME}") + try: + encoding = tiktoken.get_encoding(ENCODING_NAME) + except Exception as e: + print(f"Error loading tokenizer: {e}") + print("Make sure you have 'tiktoken' installed:") + print(" pip install tiktoken") + return + + bucketed_chains: dict[int, list[dict]] = defaultdict(list) + all_conversations: list[list[dict]] = [] + + print(f"\nProcessing {len(dataset)} conversation chains...") + + for idx, row in enumerate(dataset): + messages = row.get("messages", []) + if not messages: + continue + + formatted = format_ultrachat_messages(messages) + if not formatted: + continue + + token_count = count_tokens(formatted, encoding) + bucket = get_bucket(token_count, buckets) + + all_conversations.append(formatted) + + if bucket is not None: + bucketed_chains[bucket].append( + { + "messages": formatted, + "token_count": token_count, + "bucket": bucket, + "original_index": idx, + "synthetic": False, + } + ) + + if (idx + 1) % 50000 == 0: + print(f" Processed {idx + 1:,} chains...") + + print(f"\nTotal conversations collected: {len(all_conversations):,}") + print("\nNatural bucket distribution:") + print("-" * 60) + + for bucket in buckets: + count = len(bucketed_chains[bucket]) + status = "!" if count >= args.chains_per_bucket else f" need {args.chains_per_bucket - count} more" + print(f" {bucket:>6,} tokens: {count:>5,} chains {status}") + + # Generate synthetic conversations for sparse buckets + print("\nGenerating synthetic chains for sparse buckets...") + large_buckets = [b for b in buckets if len(bucketed_chains[b]) < args.chains_per_bucket] + + for bucket in large_buckets: + needed = args.chains_per_bucket - len(bucketed_chains[bucket]) + if needed <= 0: + continue + + print(f" Creating {needed} synthetic chains for {bucket:,} token bucket...") + attempts = 0 + max_attempts = needed * 20 + created = 0 + + while len(bucketed_chains[bucket]) < args.chains_per_bucket and attempts < max_attempts: + attempts += 1 + synthetic = concatenate_conversations( + [c.copy() for c in all_conversations], + bucket, + encoding + ) + + if synthetic: + token_count = count_tokens(synthetic, encoding) + bucketed_chains[bucket].append( + { + "messages": synthetic, + "token_count": token_count, + "bucket": bucket, + "original_index": -1, + "synthetic": True, + } + ) + created += 1 + + if created < needed: + print(f" Only created {created}/{needed} synthetic chains") + + print("\nFinal bucket distribution:") + print("-" * 60) + + final_dataset = {} + total_natural = 0 + total_synthetic = 0 + + for bucket in buckets: + chains = bucketed_chains[bucket] + count = len(chains) + + if count >= args.chains_per_bucket: + selected = random.sample(chains, args.chains_per_bucket) + else: + selected = chains + if count < args.chains_per_bucket: + print(f" {bucket:>6,} tokens: {count:>5,} chains insufficient (target: {args.chains_per_bucket})") + selected = chains # Use what we have + + natural = sum(1 for c in selected if not c.get("synthetic", False)) + synthetic = len(selected) - natural + total_natural += natural + total_synthetic += synthetic + + print(f" {bucket:>6,} tokens: {len(selected):>3} chains ({natural} natural, {synthetic} synthetic)") + + final_dataset[str(bucket)] = selected + + # Save dataset + output_path = Path(args.output) + output_path.parent.mkdir(parents=True, exist_ok=True) + + with open(output_path, "w", encoding="utf-8") as f: + json.dump(final_dataset, f, indent=2, ensure_ascii=False) + + print("-" * 60) + print(f"\n Dataset saved to: {output_path}") + + total_chains = sum(len(chains) for chains in final_dataset.values()) + print(f"\nTotal chains: {total_chains:,}") + print(f"Natural conversations: {total_natural:,}") + print(f"Synthetic conversations: {total_synthetic:,}") + + print("\nBucket summary:") + for bucket in buckets: + chains = final_dataset.get(str(bucket), []) + if chains: + avg_tokens = sum(c["token_count"] for c in chains) / len(chains) + min_tokens = min(c["token_count"] for c in chains) + max_tokens = max(c["token_count"] for c in chains) + print(f" {bucket:>6,} tokens: {len(chains):>3} chains, " + f"avg={avg_tokens:>6,.0f}, min={min_tokens:>6,}, max={max_tokens:>6,}") + + print("\n" + "="*60) + print("To use this dataset with benchmark:") + print("="*60) + print(f" python benchmark_llm.py --dataset {args.output} ...") + print("="*60) + + +if __name__ == "__main__": + main() diff --git a/llm-throughput-tests-mindef-metadateren/requirements.txt b/llm-throughput-tests-mindef-metadateren/requirements.txt new file mode 100644 index 0000000..0f260ac --- /dev/null +++ b/llm-throughput-tests-mindef-metadateren/requirements.txt @@ -0,0 +1,9 @@ +openai>=1.0.0 +httpx>=0.24.0 +pyyaml>=6.0 +matplotlib>=3.7.0 +seaborn>=0.12.0 +numpy>=1.24.0 +tiktoken>=0.5.0 +datasets>=2.14.0 +httpx[http2] \ No newline at end of file diff --git a/llm-throughput-tests-mindef-metadateren/results/results_QuantTrio_Qwen3_5-35B-A3B-AWQ/benchmark_results.json b/llm-throughput-tests-mindef-metadateren/results/results_QuantTrio_Qwen3_5-35B-A3B-AWQ/benchmark_results.json new file mode 100644 index 0000000..8e763ec --- /dev/null +++ b/llm-throughput-tests-mindef-metadateren/results/results_QuantTrio_Qwen3_5-35B-A3B-AWQ/benchmark_results.json @@ -0,0 +1,630 @@ +{ + "timestamp": "2026-03-11T11:10:08.245541", + "model_name": "QuantTrio/Qwen3.5-35B-A3B-AWQ", + "results": [ + { + "config": { + "input_tokens": 1000, + "output_tokens": 512, + "batch_size": 1, + "num_batches": 2, + "total_requests": 2, + "actual_input_tokens": 1140 + }, + "success_metrics": { + "success_rate": 100.0, + "successful_requests": 2, + "failed_requests": 0 + }, + "latency": { + "mean": 9.155, + "std": 5.968, + "min": 3.187, + "max": 15.123, + "p50": 9.155, + "p95": 14.526, + "p99": 15.003, + "ci_95_lower": 0.884, + "ci_95_upper": 17.426 + }, + "ttft": { + "mean": 9.155, + "std": 5.968, + "p50": 9.155, + "p90": 13.929 + }, + "tokens": { + "total_generated": 1024, + "content_tokens": 1024, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 55.62, + "concurrent_content_tps": 55.62, + "requests_per_second": 0.11, + "actual_wall_time": 18.412, + "efficiency_percent": 57.18 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 1.0, + "avg_batch_throughput": 97.26, + "min_batch_throughput": 33.86, + "max_batch_throughput": 160.67 + } + }, + { + "config": { + "input_tokens": 1000, + "output_tokens": 512, + "batch_size": 8, + "num_batches": 2, + "total_requests": 16, + "actual_input_tokens": 1003 + }, + "success_metrics": { + "success_rate": 100.0, + "successful_requests": 16, + "failed_requests": 0 + }, + "latency": { + "mean": 8.081, + "std": 2.287, + "min": 5.772, + "max": 10.373, + "p50": 8.085, + "p95": 10.372, + "p99": 10.373, + "ci_95_lower": 6.961, + "ci_95_upper": 9.202 + }, + "ttft": { + "mean": 8.081, + "std": 2.287, + "p50": 8.085, + "p90": 10.37 + }, + "tokens": { + "total_generated": 8192, + "content_tokens": 8192, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 503.04, + "concurrent_content_tps": 503.04, + "requests_per_second": 0.98, + "actual_wall_time": 16.285, + "efficiency_percent": 91.31 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 8.0, + "avg_batch_throughput": 549.93, + "min_batch_throughput": 394.83, + "max_batch_throughput": 705.03 + } + }, + { + "config": { + "input_tokens": 1000, + "output_tokens": 512, + "batch_size": 32, + "num_batches": 2, + "total_requests": 64, + "actual_input_tokens": 1028 + }, + "success_metrics": { + "success_rate": 100.0, + "successful_requests": 64, + "failed_requests": 0 + }, + "latency": { + "mean": 8.686, + "std": 0.017, + "min": 8.636, + "max": 8.732, + "p50": 8.688, + "p95": 8.71, + "p99": 8.721, + "ci_95_lower": 8.682, + "ci_95_upper": 8.691 + }, + "ttft": { + "mean": 8.595, + "std": 0.727, + "p50": 8.687, + "p90": 8.707 + }, + "tokens": { + "total_generated": 32768, + "content_tokens": 32768, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 1865.45, + "concurrent_content_tps": 1865.45, + "requests_per_second": 3.64, + "actual_wall_time": 17.566, + "efficiency_percent": 98.9 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 32.0, + "avg_batch_throughput": 1876.54, + "min_batch_throughput": 1870.97, + "max_batch_throughput": 1882.11 + } + }, + { + "config": { + "input_tokens": 1000, + "output_tokens": 512, + "batch_size": 64, + "num_batches": 2, + "total_requests": 128, + "actual_input_tokens": 1028 + }, + "success_metrics": { + "success_rate": 100.0, + "successful_requests": 128, + "failed_requests": 0 + }, + "latency": { + "mean": 12.207, + "std": 0.04, + "min": 12.108, + "max": 12.283, + "p50": 12.211, + "p95": 12.263, + "p99": 12.273, + "ci_95_lower": 12.2, + "ci_95_upper": 12.214 + }, + "ttft": { + "mean": 12.044, + "std": 1.066, + "p50": 12.205, + "p90": 12.257 + }, + "tokens": { + "total_generated": 65536, + "content_tokens": 65536, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 2654.48, + "concurrent_content_tps": 2654.48, + "requests_per_second": 5.18, + "actual_wall_time": 24.689, + "efficiency_percent": 98.89 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 64.0, + "avg_batch_throughput": 2665.65, + "min_batch_throughput": 2658.45, + "max_batch_throughput": 2672.85 + } + }, + { + "config": { + "input_tokens": 10000, + "output_tokens": 512, + "batch_size": 1, + "num_batches": 2, + "total_requests": 2, + "actual_input_tokens": 8871 + }, + "success_metrics": { + "success_rate": 100.0, + "successful_requests": 2, + "failed_requests": 0 + }, + "latency": { + "mean": 3.533, + "std": 0.026, + "min": 3.507, + "max": 3.559, + "p50": 3.533, + "p95": 3.557, + "p99": 3.559, + "ci_95_lower": 3.497, + "ci_95_upper": 3.569 + }, + "ttft": { + "mean": 3.533, + "std": 0.026, + "p50": 3.533, + "p90": 3.554 + }, + "tokens": { + "total_generated": 1024, + "content_tokens": 1024, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 142.85, + "concurrent_content_tps": 142.85, + "requests_per_second": 0.28, + "actual_wall_time": 7.168, + "efficiency_percent": 98.57 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 1.0, + "avg_batch_throughput": 144.92, + "min_batch_throughput": 143.85, + "max_batch_throughput": 145.99 + } + }, + { + "config": { + "input_tokens": 10000, + "output_tokens": 512, + "batch_size": 8, + "num_batches": 2, + "total_requests": 16, + "actual_input_tokens": 8895 + }, + "success_metrics": { + "success_rate": 100.0, + "successful_requests": 16, + "failed_requests": 0 + }, + "latency": { + "mean": 7.325, + "std": 0.144, + "min": 7.142, + "max": 7.493, + "p50": 7.333, + "p95": 7.489, + "p99": 7.492, + "ci_95_lower": 7.254, + "ci_95_upper": 7.395 + }, + "ttft": { + "mean": 7.325, + "std": 0.144, + "p50": 7.333, + "p90": 7.487 + }, + "tokens": { + "total_generated": 8192, + "content_tokens": 8192, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 550.76, + "concurrent_content_tps": 550.76, + "requests_per_second": 1.08, + "actual_wall_time": 14.874, + "efficiency_percent": 98.45 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 8.0, + "avg_batch_throughput": 554.82, + "min_batch_throughput": 543.43, + "max_batch_throughput": 566.21 + } + }, + { + "config": { + "input_tokens": 10000, + "output_tokens": 512, + "batch_size": 32, + "num_batches": 2, + "total_requests": 64, + "actual_input_tokens": 8842 + }, + "success_metrics": { + "success_rate": 100.0, + "successful_requests": 64, + "failed_requests": 0 + }, + "latency": { + "mean": 16.085, + "std": 2.082, + "min": 13.822, + "max": 18.383, + "p50": 16.109, + "p95": 18.273, + "p99": 18.329, + "ci_95_lower": 15.575, + "ci_95_upper": 16.595 + }, + "ttft": { + "mean": 15.996, + "std": 2.114, + "p50": 14.22, + "p90": 18.248 + }, + "tokens": { + "total_generated": 32768, + "content_tokens": 32768, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 995.46, + "concurrent_content_tps": 995.46, + "requests_per_second": 1.94, + "actual_wall_time": 32.917, + "efficiency_percent": 96.09 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 32.0, + "avg_batch_throughput": 1015.38, + "min_batch_throughput": 885.0, + "max_batch_throughput": 1145.76 + } + }, + { + "config": { + "input_tokens": 10000, + "output_tokens": 512, + "batch_size": 64, + "num_batches": 2, + "total_requests": 128, + "actual_input_tokens": 8842 + }, + "success_metrics": { + "success_rate": 100.0, + "successful_requests": 128, + "failed_requests": 0 + }, + "latency": { + "mean": 14.781, + "std": 0.143, + "min": 14.277, + "max": 15.099, + "p50": 14.781, + "p95": 15.032, + "p99": 15.096, + "ci_95_lower": 14.756, + "ci_95_upper": 14.806 + }, + "ttft": { + "mean": 14.781, + "std": 0.143, + "p50": 14.781, + "p90": 14.972 + }, + "tokens": { + "total_generated": 65536, + "content_tokens": 65536, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 2166.53, + "concurrent_content_tps": 2166.53, + "requests_per_second": 4.23, + "actual_wall_time": 30.249, + "efficiency_percent": 97.72 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 64.0, + "avg_batch_throughput": 2174.01, + "min_batch_throughput": 2164.24, + "max_batch_throughput": 2183.78 + } + }, + { + "config": { + "input_tokens": 50000, + "output_tokens": 512, + "batch_size": 1, + "num_batches": 2, + "total_requests": 2, + "actual_input_tokens": 42229 + }, + "success_metrics": { + "success_rate": 100.0, + "successful_requests": 2, + "failed_requests": 0 + }, + "latency": { + "mean": 6.101, + "std": 0.019, + "min": 6.082, + "max": 6.12, + "p50": 6.101, + "p95": 6.118, + "p99": 6.12, + "ci_95_lower": 6.074, + "ci_95_upper": 6.128 + }, + "ttft": { + "mean": 6.101, + "std": 0.019, + "p50": 6.101, + "p90": 6.117 + }, + "tokens": { + "total_generated": 1024, + "content_tokens": 1024, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 83.22, + "concurrent_content_tps": 83.22, + "requests_per_second": 0.16, + "actual_wall_time": 12.305, + "efficiency_percent": 99.16 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 1.0, + "avg_batch_throughput": 83.92, + "min_batch_throughput": 83.66, + "max_batch_throughput": 84.19 + } + }, + { + "config": { + "input_tokens": 50000, + "output_tokens": 512, + "batch_size": 8, + "num_batches": 2, + "total_requests": 16, + "actual_input_tokens": 42048 + }, + "success_metrics": { + "success_rate": 100.0, + "successful_requests": 16, + "failed_requests": 0 + }, + "latency": { + "mean": 22.685, + "std": 2.474, + "min": 20.003, + "max": 25.463, + "p50": 22.588, + "p95": 25.387, + "p99": 25.448, + "ci_95_lower": 21.473, + "ci_95_upper": 23.897 + }, + "ttft": { + "mean": 22.685, + "std": 2.474, + "p50": 22.588, + "p90": 25.295 + }, + "tokens": { + "total_generated": 8192, + "content_tokens": 8192, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 177.76, + "concurrent_content_tps": 177.76, + "requests_per_second": 0.35, + "actual_wall_time": 46.085, + "efficiency_percent": 97.28 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 8.0, + "avg_batch_throughput": 180.32, + "min_batch_throughput": 160.6, + "max_batch_throughput": 200.04 + } + }, + { + "config": { + "input_tokens": 50000, + "output_tokens": 512, + "batch_size": 32, + "num_batches": 2, + "total_requests": 64, + "actual_input_tokens": 41752 + }, + "success_metrics": { + "success_rate": 100.0, + "successful_requests": 64, + "failed_requests": 0 + }, + "latency": { + "mean": 70.626, + "std": 18.722, + "min": 48.439, + "max": 90.756, + "p50": 70.358, + "p95": 90.447, + "p99": 90.677, + "ci_95_lower": 66.039, + "ci_95_upper": 75.213 + }, + "ttft": { + "mean": 70.626, + "std": 18.722, + "p50": 70.358, + "p90": 90.064 + }, + "tokens": { + "total_generated": 32768, + "content_tokens": 32768, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 225.4, + "concurrent_content_tps": 225.4, + "requests_per_second": 0.44, + "actual_wall_time": 145.377, + "efficiency_percent": 90.31 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 32.0, + "avg_batch_throughput": 241.37, + "min_batch_throughput": 179.6, + "max_batch_throughput": 303.14 + } + }, + { + "config": { + "input_tokens": 50000, + "output_tokens": 512, + "batch_size": 64, + "num_batches": 2, + "total_requests": 128, + "actual_input_tokens": 41810 + }, + "success_metrics": { + "success_rate": 63.28, + "successful_requests": 81, + "failed_requests": 47 + }, + "latency": { + "mean": 111.228, + "std": 2.973, + "min": 106.149, + "max": 115.385, + "p50": 112.37, + "p95": 114.998, + "p99": 115.289, + "ci_95_lower": 110.581, + "ci_95_upper": 111.876 + }, + "ttft": { + "mean": 111.228, + "std": 2.973, + "p50": 112.37, + "p90": 114.818 + }, + "tokens": { + "total_generated": 41472, + "content_tokens": 41472, + "reasoning_tokens": 0, + "avg_per_request": 512.0 + }, + "throughput": { + "concurrent_total_tps": 182.43, + "concurrent_content_tps": 182.43, + "requests_per_second": 0.36, + "actual_wall_time": 227.333, + "efficiency_percent": 61.88 + }, + "batch_metrics": { + "num_batches": 2, + "avg_batch_size": 40.5, + "avg_batch_throughput": 181.97, + "min_batch_throughput": 162.11, + "max_batch_throughput": 201.84 + } + } + ] +} \ No newline at end of file diff --git a/llm-throughput-tests-mindef-metadateren/results/results_QuantTrio_Qwen3_5-35B-A3B-AWQ/config_used.yaml b/llm-throughput-tests-mindef-metadateren/results/results_QuantTrio_Qwen3_5-35B-A3B-AWQ/config_used.yaml new file mode 100644 index 0000000..191478e --- /dev/null +++ b/llm-throughput-tests-mindef-metadateren/results/results_QuantTrio_Qwen3_5-35B-A3B-AWQ/config_used.yaml @@ -0,0 +1,25 @@ +endpoint: + url: https://0e799c11-4b01-4acd-a91c-5e43deaae940.services.external.0a71m37v.ubiops.io/v1 + api_key: + model_name: QuantTrio/Qwen3.5-35B-A3B-AWQ +benchmark: + input_tokens: + - 1000 + - 10000 + - 50000 + batch_sizes: + - 1 + - 8 + - 32 + - 64 + num_batches: 2 + output_tokens: 512 + dataset: test_conversations.json + text: null +runtime: + request_timeout: 300 + delay_between_runs: 5 + log_io: true + wait_for_ready: true + max_init_retries: 10 + init_retry_delay: 30