Introduction
Inference Lab is a simulation framework designed to evaluate and analyze LLM workloads.
It uses discrete-event simulation to model the behavior of a multi-GPU node serving LLM inference requests with the vLLM library. It contains a facsimile of the vLLM queueing, scheduling, and execution logic, with only the actual model inference replaced by a performance model based on the supplied GPU specs and model architecture.
Within each simulation step, the simulator:
- Processes any newly arrived requests, adding them to the scheduling queue.
- Schedules requests to serve based on the selected scheduling policy.
- Calculates the compute and memory bandwidth usage for the workload that the scheduled requests represent, and the theoretical time required to execute the workload on the specified hardware.
- Increments the simulation time by the calculated execution time, updating the state of all requests accordingly.
Caveats:
- This assumes perfectly optimized GPU execution, ignoring kernel launch overheads, poorly optimized kernels, application overhead, thermals, etc.
- We simulate tensor parallel execution, but don’t model multi-GPU communication overheads.
Features
- Accurate Performance Modeling: Models compute (FLOPS) and memory bandwidth constraints
- Multiple Scheduling Policies: FCFS, Priority, SJF, and more
- Chunked Prefill: Simulates realistic request interleaving
- KV Cache Management: Models GPU memory and KV cache utilization
- Workload Generation: Supports Poisson, Gamma, and closed-loop patterns
- WebAssembly Support: Run simulations in the browser via WASM
Quick Start
See the Getting Started guide to begin using Inference Lab.
Getting Started
This guide will help you get started with Inference Lab.
Installation
Install from crates.io:
cargo install --locked inference-lab
Or build from source:
cargo build --release
Running Your First Simulation
inference-lab -c config.toml
Next Steps
- Learn about configuration options
- Explore running simulations
Running Simulations
This guide covers how to run simulations and interpret results.
Basic Usage
Run a simulation with a configuration file:
inference-lab -c config.toml
For dataset mode, add tokenizer and chat template:
inference-lab -c config.toml \
--tokenizer tokenizer.json \
--chat-template None
See Configuration for details on configuring workloads, policies, and hardware.
Output Modes
Console Output (Default)
By default, the simulator displays:
- Real-time progress bar
- Current simulation time
- Queue status (running/waiting requests)
- KV cache utilization
Final output includes:
- Latency metrics (TTFT, E2E, per-token)
- Throughput metrics (tokens/sec, requests/sec)
- Utilization statistics (KV cache, FLOPS, bandwidth)
- Preemption statistics
JSON Output
Save results to a file:
inference-lab -c config.toml -o results.json
Combine with -q for batch processing:
inference-lab -c config.toml -q -o results.json
Running Multiple Experiments
Comparing Policies
for policy in fcfs sof sif lof; do
sed "s/policy = .*/policy = \"$policy\"/" config.toml > config_$policy.toml
inference-lab -c config_$policy.toml -q -o results_$policy.json
done
Sweeping Parameters
for batch_size in 4096 8192 16384; do
sed "s/max_num_batched_tokens = .*/max_num_batched_tokens = $batch_size/" \
config.toml > config_$batch_size.toml
inference-lab -c config_$batch_size.toml -o results_$batch_size.json
done
Multiple Seeds
Override the seed for reproducibility testing:
for seed in {1..10}; do
inference-lab -c config.toml --seed $seed -q -o results_$seed.json
done
Understanding Results
Latency Metrics
Time to First Token (TTFT)
- Time from request arrival to first token generation
- Lower is better for interactive applications
- Affected by: queue wait time, prefill computation
End-to-End (E2E) Latency
- Total time from request arrival to completion
- Includes prefill and all decode steps
- Key metric for overall user experience
Per-Token Latency
- Average time between consecutive output tokens
- Lower is better for streaming applications
- Primarily affected by batch size and model size
Throughput Metrics
Input Tokens/sec
- Rate of processing prompt tokens
- Indicates prefill throughput
Output Tokens/sec
- Rate of generating output tokens
- Indicates decode throughput
Requests/sec
- Overall request completion rate
- Key metric for capacity planning
Utilization Metrics
KV Cache
- Percentage of KV cache memory in use
- High utilization may lead to preemptions
FLOPS
- Percentage of compute capacity utilized
- Low FLOPS may indicate memory bottleneck
Bandwidth
- Percentage of memory bandwidth utilized
- High bandwidth utilization indicates memory-bound workload
Preemption Statistics
Preemptions occur when new requests need memory but the KV cache is full:
- Total number of preemptions
- Average preemptions per request
- Can significantly impact TTFT for preempted requests
Troubleshooting
Simulation running slowly?
- Reduce
num_requestsor use-qflag - Increase
log_intervalin config
Too many preemptions?
- Increase
kv_cache_capacityin hardware config - Reduce
max_num_seqsormax_num_batched_tokensin scheduler config
Dataset loading errors?
- Verify
--tokenizerand--chat-templateflags are provided - Check JSONL format matches OpenAI batch API format
For more details, see CLI Reference and Configuration.
Configuration
Inference Lab uses TOML configuration files to define your simulation parameters. A configuration file has five main sections: hardware, model, scheduler, workload, and simulation.
Configuration Sections Overview
- [hardware] - GPU specifications (compute, memory, bandwidth)
- [model] - LLM architecture (layers, parameters, dimensions)
- [scheduler] - Scheduling policy and batching behavior
- [workload] - Request arrival patterns and distributions
- [simulation] - Logging and output options
Quick Start Example
Here’s a minimal configuration to get started:
[hardware]
name = "H100"
compute_flops = 1.513e15 # 1513 TFLOPS bf16
memory_bandwidth = 3.35e12 # 3.35 TB/s
memory_capacity = 85899345920 # 80 GB
bytes_per_param = 2 # bf16
[model]
name = "Llama-3-70B"
num_parameters = 70000000000
num_layers = 80
hidden_dim = 8192
num_heads = 64
num_kv_heads = 8 # GQA with 8 KV heads
max_seq_len = 8192
[scheduler]
max_num_batched_tokens = 8192
max_num_seqs = 256
policy = "fcfs"
enable_chunked_prefill = true
block_size = 16
[workload]
arrival_pattern = "poisson"
arrival_rate = 5.0 # 5 requests/sec
num_requests = 100
seed = 42
[workload.input_len_dist]
type = "lognormal"
mean = 6.9 # ~1000 tokens median
std_dev = 0.7
[workload.output_len_dist]
type = "lognormal"
mean = 5.3 # ~200 tokens median
std_dev = 0.8
[simulation]
log_interval = 5
Hardware Configuration
The hardware section defines your GPU specifications:
[hardware]
name = "H100"
compute_flops = 1.513e15 # bf16 TFLOPS
memory_bandwidth = 3.35e12 # bytes/sec
memory_capacity = 85899345920 # 80 GB
bytes_per_param = 2 # 2 for bf16, 1 for fp8
Optional fields:
kv_cache_capacity- Explicit KV cache size (otherwise computed automatically)gpu_memory_utilization- Fraction of memory to use (default: 0.9)
Model Configuration
Define your LLM architecture:
[model]
name = "Llama-3-70B"
num_parameters = 70000000000
num_layers = 80
hidden_dim = 8192
num_heads = 64
num_kv_heads = 8 # For GQA (omit for MHA)
max_seq_len = 8192
Grouped Query Attention (GQA)
For models using GQA, set num_kv_heads to the number of KV heads:
num_kv_heads = 8 # Llama 3 uses 8 KV heads
Omit num_kv_heads for standard multi-head attention (MHA) models.
Mixture of Experts (MoE)
For MoE models, specify active parameters separately:
num_parameters = 140000000000 # Total params
num_active_parameters = 12000000000 # Active per forward pass
Sliding Window Attention
For models like GPT-OSS with sliding window attention:
sliding_window = 4096
num_sliding_layers = 28 # Number of layers using sliding window
Scheduler Configuration
Control request scheduling and batching:
[scheduler]
max_num_batched_tokens = 8192
max_num_seqs = 256
policy = "fcfs"
enable_chunked_prefill = true
block_size = 16
Scheduling Policies
Available policies:
fcfs- First-Come-First-Served (default)sof- Shortest Output Firstsif- Shortest Input Firststf- Shortest Total Firstlif- Longest Input Firstlof- Longest Output Firstltf- Longest Total First
Chunked Prefill
Enable chunked prefill to allow interleaving prompt processing with generation:
enable_chunked_prefill = true
long_prefill_token_threshold = 512 # Optional: chunk size limit
max_num_partial_prefills = 1 # Max concurrent partial prefills
Preemption-Free Mode
Enable conservative admission control to guarantee zero preemptions:
enable_preemption_free = true
Workload Configuration
Define how requests arrive and their characteristics.
Synthetic Workload
[workload]
arrival_pattern = "poisson"
arrival_rate = 5.0
num_requests = 100
seed = 42
[workload.input_len_dist]
type = "lognormal"
mean = 6.9
std_dev = 0.7
[workload.output_len_dist]
type = "lognormal"
mean = 5.3
std_dev = 0.8
Arrival Patterns
poisson- Poisson process with exponential inter-arrival timesuniform- Uniform random inter-arrival timesburst- Bursty trafficfixed_rate- Fixed interval between requestsclosed_loop- Fixed number of concurrent usersbatched- Requests arrive in batches
Length Distributions
Four distribution types are supported:
Fixed:
[workload.input_len_dist]
type = "fixed"
value = 1000
Uniform:
[workload.input_len_dist]
type = "uniform"
min = 100
max = 2000
Normal:
[workload.input_len_dist]
type = "normal"
mean = 1000.0
std_dev = 200.0
LogNormal:
[workload.input_len_dist]
type = "lognormal"
mean = 6.9 # ln(1000)
std_dev = 0.7
Dataset Mode
Use real request traces instead of synthetic workloads:
[workload]
dataset_path = "path/to/dataset.jsonl"
arrival_pattern = "poisson"
arrival_rate = 1.0
# These are used for sampling actual generation length
input_len_dist = { type = "fixed", value = 100 } # Ignored
output_len_dist = { type = "fixed", value = 50 } # Samples EOS
Dataset Format: JSONL file in OpenAI batch API format. Each line should be a JSON object with a messages field containing an array of message objects.
Example:
{"custom_id": "req-1", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}
Tokenizer: Dataset mode requires a tokenizer file to convert text to tokens. You’ll need to provide this via the --tokenizer flag:
inference-lab -c config.toml --tokenizer tokenizer.json
The tokenizer should be a HuggingFace tokenizers JSON file (typically tokenizer.json from the model repository).
Chat Template: You’ll also need to specify how to format messages via --chat-template:
- Use
"None"for simple concatenation of messages - Use a Jinja2 template string for custom formatting (e.g.,
"{{user}}\n{{assistant}}") - Most models have their own chat template format
Example with no template:
inference-lab -c config.toml \
--tokenizer tokenizer.json \
--chat-template None
Closed-Loop Workload
Simulate a fixed number of concurrent users:
[workload]
arrival_pattern = "closed_loop"
num_concurrent_users = 10
# ... length distributions ...
Simulation Configuration
Control logging and output:
[simulation]
log_interval = 5 # Log every 5 iterations
Common Configuration Patterns
High Throughput Setup
Maximize batch size and token throughput:
[scheduler]
max_num_batched_tokens = 16384
max_num_seqs = 512
enable_chunked_prefill = true
Low Latency Setup
Prioritize request completion speed:
[scheduler]
max_num_batched_tokens = 4096
max_num_seqs = 64
policy = "sof" # Shortest Output First
Memory-Constrained Setup
Limit KV cache usage:
[hardware]
kv_cache_capacity = 34359738368 # 32 GB explicit limit
[scheduler]
max_num_seqs = 128
Next Steps
- See the Configuration Reference for exhaustive field documentation
- Learn about Running Simulations
CLI Reference
Command-line interface reference for Inference Lab.
Usage
inference-lab [OPTIONS]
Options
Required Options
None - all options have defaults or are optional.
Configuration
-c, --config <PATH>
Path to the TOML configuration file.
- Default:
config.toml - Example:
inference-lab -c my-config.toml
Dataset Mode
-t, --tokenizer <PATH>
Path to tokenizer file (required for dataset mode).
- Required when using
dataset_pathin configuration - Example:
inference-lab -c config.toml --tokenizer tokenizer.json
--chat-template <TEMPLATE>
Chat template for formatting messages in dataset mode.
- Required when using datasets
- Use
"None"for simple message concatenation (no template) - Example:
inference-lab --tokenizer tokenizer.json --chat-template None - Example with template:
inference-lab --tokenizer tokenizer.json --chat-template "{{system}}\n{{user}}\n{{assistant}}"
Output Options
-o, --output <PATH>
Path to output JSON file for results.
- If not specified, results are only displayed to console
- Example:
inference-lab -c config.toml -o results.json
-q, --quiet
Suppress progress output (only show final results).
- Example:
inference-lab -c config.toml -q
-v, --verbose
Enable verbose output.
- Example:
inference-lab -c config.toml -v
--debug
Enable debug logging.
- Example:
inference-lab -c config.toml --debug
--no-color
Disable colored output.
- Useful for logging to files or CI environments
- Example:
inference-lab -c config.toml --no-color
Simulation Options
--seed <NUMBER>
Override the random seed from configuration.
- Useful for reproducible runs with different seeds
- Example:
inference-lab -c config.toml --seed 12345
Examples
Basic Simulation
inference-lab -c config.toml
Dataset Mode
inference-lab -c config.toml \
--tokenizer tokenizer.json \
--chat-template None
Save Results to File
inference-lab -c config.toml -o results.json
Quiet Mode with Output
inference-lab -c config.toml -q -o results.json
Multiple Runs with Different Seeds
for seed in 42 43 44; do
inference-lab -c config.toml --seed $seed -o results_$seed.json
done
Exit Codes
0- Simulation completed successfully1- Error occurred (configuration error, file not found, etc.)
Configuration File Reference
Complete field-by-field reference for Inference Lab configuration files.
Top-Level Structure
[hardware]
# ... hardware configuration ...
[model]
# ... model configuration ...
[scheduler]
# ... scheduler configuration ...
[workload]
# ... workload configuration ...
[simulation]
# ... simulation configuration ...
[hardware]
GPU and accelerator specifications.
Required Fields
| Field | Type | Description |
|---|---|---|
name | String | Accelerator name (e.g., “H100”, “A100”) |
compute_flops | Float | Compute capacity in FLOPS for the specified precision |
memory_bandwidth | Float | Memory bandwidth in bytes/second |
memory_capacity | U64 | Total GPU memory capacity in bytes |
bytes_per_param | U32 | Bytes per parameter (1 for fp8, 2 for bf16/fp16) |
Optional Fields
| Field | Type | Default | Description |
|---|---|---|---|
kv_cache_capacity | U64 | Computed | KV cache capacity in bytes. If not specified, calculated as (memory_capacity * gpu_memory_utilization) - model_size |
gpu_memory_utilization | Float | 0.9 | Fraction of GPU memory to use. Used to compute kv_cache_capacity if not explicitly set |
Example
[hardware]
name = "H100"
compute_flops = 1.513e15
memory_bandwidth = 3.35e12
memory_capacity = 85899345920
bytes_per_param = 2
[model]
LLM architecture parameters.
Required Fields
| Field | Type | Description |
|---|---|---|
name | String | Model name |
num_parameters | U64 | Total number of parameters (for MoE: all experts) |
num_layers | U32 | Number of transformer layers |
hidden_dim | U32 | Hidden dimension size |
num_heads | U32 | Number of attention heads |
max_seq_len | U32 | Maximum sequence length supported by the model |
Optional Fields
| Field | Type | Default | Description |
|---|---|---|---|
num_active_parameters | U64 | num_parameters | Active parameters per forward pass (for MoE models with sparse activation) |
num_kv_heads | U32 | num_heads | Number of KV heads. Set for GQA/MQA, omit for MHA |
sliding_window | U32 | None | Sliding window size for sliding window attention layers |
num_sliding_layers | U32 | 0 | Number of layers using sliding window attention (rest use full attention) |
Example
[model]
name = "Llama-3-70B"
num_parameters = 70000000000
num_layers = 80
hidden_dim = 8192
num_heads = 64
num_kv_heads = 8
max_seq_len = 8192
[scheduler]
Request scheduling and batching configuration.
Required Fields
| Field | Type | Description |
|---|---|---|
max_num_batched_tokens | U32 | Maximum number of tokens processed in a single iteration |
max_num_seqs | U32 | Maximum number of sequences that can run concurrently |
policy | String | Scheduling policy: "fcfs", "sof", "sif", "stf", "lif", "lof", or "ltf" |
enable_chunked_prefill | Bool | Enable chunked prefilling to interleave prompt processing with generation |
block_size | U32 | Block size for KV cache management (in tokens) |
Optional Fields
| Field | Type | Default | Description |
|---|---|---|---|
long_prefill_token_threshold | U32 | 0 or 4% of max_seq_len | Maximum tokens to prefill in a single iteration. Defaults to 0 (no chunking within request) unless max_num_partial_prefills > 1, then defaults to 4% of max_seq_len |
max_num_partial_prefills | U32 | 1 | Maximum number of sequences that can be partially prefilled concurrently. Limits how many new waiting requests can start prefilling per iteration |
enable_preemption_free | Bool | false | Enable preemption-free scheduling mode with conservative admission control |
Scheduling Policy Values
fcfs- First-Come-First-Servedsof- Shortest Output Firstsif- Shortest Input Firststf- Shortest Total Firstlif- Longest Input Firstlof- Longest Output Firstltf- Longest Total First
Example
[scheduler]
max_num_batched_tokens = 8192
max_num_seqs = 256
policy = "fcfs"
enable_chunked_prefill = true
block_size = 16
[workload]
Request arrival patterns and length distributions.
Required Fields
| Field | Type | Description |
|---|---|---|
arrival_pattern | String | Arrival pattern: "poisson", "uniform", "burst", "fixed_rate", "closed_loop", or "batched" |
arrival_rate | Float | Mean arrival rate in requests per second |
input_len_dist | Distribution | Input sequence length distribution (ignored in dataset mode) |
output_len_dist | Distribution | Output sequence length distribution (in dataset mode: samples actual generation length) |
seed | U64 | Random seed for reproducibility |
Optional Fields
| Field | Type | Default | Description |
|---|---|---|---|
dataset_path | String | None | Path to dataset file in OpenAI batch API format (JSONL). If provided, uses dataset mode instead of synthetic workload |
num_requests | Usize | None | Total number of requests to simulate. If None, runs until duration_secs |
duration_secs | Float | None | Simulation duration in seconds. If None, runs until num_requests |
num_concurrent_users | Usize | None | Number of concurrent users for closed_loop pattern. Each user immediately sends a new request when their previous one completes |
Length Distribution Types
Distributions are specified using TOML tables with a type field:
Fixed:
input_len_dist = { type = "fixed", value = 1000 }
Uniform:
input_len_dist = { type = "uniform", min = 100, max = 2000 }
Normal:
input_len_dist = { type = "normal", mean = 1000.0, std_dev = 200.0 }
LogNormal:
input_len_dist = { type = "lognormal", mean = 6.9, std_dev = 0.7 }
Or using TOML section syntax:
[workload.input_len_dist]
type = "lognormal"
mean = 6.9
std_dev = 0.7
Example
[workload]
arrival_pattern = "poisson"
arrival_rate = 5.0
num_requests = 100
seed = 42
[workload.input_len_dist]
type = "lognormal"
mean = 6.9
std_dev = 0.7
[workload.output_len_dist]
type = "lognormal"
mean = 5.3
std_dev = 0.8
[simulation]
Simulation control and logging.
Optional Fields
| Field | Type | Default | Description |
|---|---|---|---|
log_interval | U64 | 100 | Log progress every N iterations |
Example
[simulation]
log_interval = 5
Type Reference
- String: Text string
- Float: 64-bit floating point number
- U32: 32-bit unsigned integer
- U64: 64-bit unsigned integer
- Usize: Platform-dependent unsigned integer
- Bool: Boolean (
trueorfalse) - Distribution: Length distribution object (see Length Distribution Types)