Introduction

Inference Lab is a simulation framework designed to evaluate and analyze LLM workloads.

It uses discrete-event simulation to model the behavior of a multi-GPU node serving LLM inference requests with the vLLM library. It contains a facsimile of the vLLM queueing, scheduling, and execution logic, with only the actual model inference replaced by a performance model based on the supplied GPU specs and model architecture.

Within each simulation step, the simulator:

Processes any newly arrived requests, adding them to the scheduling queue.
Schedules requests to serve based on the selected scheduling policy.
Calculates the compute and memory bandwidth usage for the workload that the scheduled requests represent, and the theoretical time required to execute the workload on the specified hardware.
Increments the simulation time by the calculated execution time, updating the state of all requests accordingly.

Caveats:

This assumes perfectly optimized GPU execution, ignoring kernel launch overheads, poorly optimized kernels, application overhead, thermals, etc.
We simulate tensor parallel execution, but don’t model multi-GPU communication overheads.

Features

Accurate Performance Modeling: Models compute (FLOPS) and memory bandwidth constraints
Multiple Scheduling Policies: FCFS, Priority, SJF, and more
Chunked Prefill: Simulates realistic request interleaving
KV Cache Management: Models GPU memory and KV cache utilization
Workload Generation: Supports Poisson, Gamma, and closed-loop patterns
WebAssembly Support: Run simulations in the browser via WASM

Quick Start

See the Getting Started guide to begin using Inference Lab.

Getting Started

This guide will help you get started with Inference Lab.

Installation

Install from crates.io:

cargo install --locked inference-lab

Or build from source:

cargo build --release

Running Your First Simulation

inference-lab -c config.toml

Next Steps

Learn about configuration options
Explore running simulations

Running Simulations

This guide covers how to run simulations and interpret results.

Basic Usage

Run a simulation with a configuration file:

inference-lab -c config.toml

For dataset mode, add tokenizer and chat template:

inference-lab -c config.toml \
  --tokenizer tokenizer.json \
  --chat-template None

See Configuration for details on configuring workloads, policies, and hardware.

Output Modes

Console Output (Default)

By default, the simulator displays:

Real-time progress bar
Current simulation time
Queue status (running/waiting requests)
KV cache utilization

Final output includes:

Latency metrics (TTFT, E2E, per-token)
Throughput metrics (tokens/sec, requests/sec)
Utilization statistics (KV cache, FLOPS, bandwidth)
Preemption statistics

JSON Output

Save results to a file:

inference-lab -c config.toml -o results.json

Combine with -q for batch processing:

inference-lab -c config.toml -q -o results.json

Running Multiple Experiments

Comparing Policies

for policy in fcfs sof sif lof; do
  sed "s/policy = .*/policy = \"$policy\"/" config.toml > config_$policy.toml
  inference-lab -c config_$policy.toml -q -o results_$policy.json
done

Sweeping Parameters

for batch_size in 4096 8192 16384; do
  sed "s/max_num_batched_tokens = .*/max_num_batched_tokens = $batch_size/" \
    config.toml > config_$batch_size.toml
  inference-lab -c config_$batch_size.toml -o results_$batch_size.json
done

Multiple Seeds

Override the seed for reproducibility testing:

for seed in {1..10}; do
  inference-lab -c config.toml --seed $seed -q -o results_$seed.json
done

Understanding Results

Latency Metrics

Time to First Token (TTFT)

Time from request arrival to first token generation
Lower is better for interactive applications
Affected by: queue wait time, prefill computation

End-to-End (E2E) Latency

Total time from request arrival to completion
Includes prefill and all decode steps
Key metric for overall user experience

Per-Token Latency

Average time between consecutive output tokens
Lower is better for streaming applications
Primarily affected by batch size and model size

Throughput Metrics

Input Tokens/sec

Rate of processing prompt tokens
Indicates prefill throughput

Output Tokens/sec

Rate of generating output tokens
Indicates decode throughput

Requests/sec

Overall request completion rate
Key metric for capacity planning

Utilization Metrics

KV Cache

Percentage of KV cache memory in use
High utilization may lead to preemptions

FLOPS

Percentage of compute capacity utilized
Low FLOPS may indicate memory bottleneck

Bandwidth

Percentage of memory bandwidth utilized
High bandwidth utilization indicates memory-bound workload

Preemption Statistics

Preemptions occur when new requests need memory but the KV cache is full:

Total number of preemptions
Average preemptions per request
Can significantly impact TTFT for preempted requests

Troubleshooting

Simulation running slowly?

Reduce num_requests or use -q flag
Increase log_interval in config

Too many preemptions?

Increase kv_cache_capacity in hardware config
Reduce max_num_seqs or max_num_batched_tokens in scheduler config

Dataset loading errors?

Verify --tokenizer and --chat-template flags are provided
Check JSONL format matches OpenAI batch API format

For more details, see CLI Reference and Configuration.

Configuration

Inference Lab uses TOML configuration files to define your simulation parameters. A configuration file has five main sections: hardware, model, scheduler, workload, and simulation.

Configuration Sections Overview

[hardware] - GPU specifications (compute, memory, bandwidth)
[model] - LLM architecture (layers, parameters, dimensions)
[scheduler] - Scheduling policy and batching behavior
[workload] - Request arrival patterns and distributions
[simulation] - Logging and output options

Quick Start Example

Here’s a minimal configuration to get started:

[hardware]
name = "H100"
compute_flops = 1.513e15        # 1513 TFLOPS bf16
memory_bandwidth = 3.35e12      # 3.35 TB/s
memory_capacity = 85899345920   # 80 GB
bytes_per_param = 2             # bf16

[model]
name = "Llama-3-70B"
num_parameters = 70000000000
num_layers = 80
hidden_dim = 8192
num_heads = 64
num_kv_heads = 8                # GQA with 8 KV heads
max_seq_len = 8192

[scheduler]
max_num_batched_tokens = 8192
max_num_seqs = 256
policy = "fcfs"
enable_chunked_prefill = true
block_size = 16

[workload]
arrival_pattern = "poisson"
arrival_rate = 5.0              # 5 requests/sec
num_requests = 100
seed = 42

[workload.input_len_dist]
type = "lognormal"
mean = 6.9                      # ~1000 tokens median
std_dev = 0.7

[workload.output_len_dist]
type = "lognormal"
mean = 5.3                      # ~200 tokens median
std_dev = 0.8

[simulation]
log_interval = 5

Hardware Configuration

The hardware section defines your GPU specifications:

[hardware]
name = "H100"
compute_flops = 1.513e15        # bf16 TFLOPS
memory_bandwidth = 3.35e12      # bytes/sec
memory_capacity = 85899345920   # 80 GB
bytes_per_param = 2             # 2 for bf16, 1 for fp8

Optional fields:

kv_cache_capacity - Explicit KV cache size (otherwise computed automatically)
gpu_memory_utilization - Fraction of memory to use (default: 0.9)

Model Configuration

Define your LLM architecture:

[model]
name = "Llama-3-70B"
num_parameters = 70000000000
num_layers = 80
hidden_dim = 8192
num_heads = 64
num_kv_heads = 8                # For GQA (omit for MHA)
max_seq_len = 8192

Grouped Query Attention (GQA)

For models using GQA, set num_kv_heads to the number of KV heads:

num_kv_heads = 8  # Llama 3 uses 8 KV heads

Omit num_kv_heads for standard multi-head attention (MHA) models.

Mixture of Experts (MoE)

For MoE models, specify active parameters separately:

num_parameters = 140000000000      # Total params
num_active_parameters = 12000000000 # Active per forward pass

Sliding Window Attention

For models like GPT-OSS with sliding window attention:

sliding_window = 4096
num_sliding_layers = 28  # Number of layers using sliding window

Scheduler Configuration

Control request scheduling and batching:

[scheduler]
max_num_batched_tokens = 8192
max_num_seqs = 256
policy = "fcfs"
enable_chunked_prefill = true
block_size = 16

Scheduling Policies

Available policies:

fcfs - First-Come-First-Served (default)
sof - Shortest Output First
sif - Shortest Input First
stf - Shortest Total First
lif - Longest Input First
lof - Longest Output First
ltf - Longest Total First

Chunked Prefill

Enable chunked prefill to allow interleaving prompt processing with generation:

enable_chunked_prefill = true
long_prefill_token_threshold = 512  # Optional: chunk size limit
max_num_partial_prefills = 1        # Max concurrent partial prefills

Preemption-Free Mode

Enable conservative admission control to guarantee zero preemptions:

enable_preemption_free = true

Workload Configuration

Define how requests arrive and their characteristics.

Synthetic Workload

[workload]
arrival_pattern = "poisson"
arrival_rate = 5.0
num_requests = 100
seed = 42

[workload.input_len_dist]
type = "lognormal"
mean = 6.9
std_dev = 0.7

[workload.output_len_dist]
type = "lognormal"
mean = 5.3
std_dev = 0.8

Arrival Patterns

poisson - Poisson process with exponential inter-arrival times
uniform - Uniform random inter-arrival times
burst - Bursty traffic
fixed_rate - Fixed interval between requests
closed_loop - Fixed number of concurrent users
batched - Requests arrive in batches

Length Distributions

Four distribution types are supported:

Fixed:

[workload.input_len_dist]
type = "fixed"
value = 1000

Uniform:

[workload.input_len_dist]
type = "uniform"
min = 100
max = 2000

Normal:

[workload.input_len_dist]
type = "normal"
mean = 1000.0
std_dev = 200.0

LogNormal:

[workload.input_len_dist]
type = "lognormal"
mean = 6.9      # ln(1000)
std_dev = 0.7

Dataset Mode

Use real request traces instead of synthetic workloads:

[workload]
dataset_path = "path/to/dataset.jsonl"
arrival_pattern = "poisson"
arrival_rate = 1.0

# These are used for sampling actual generation length
input_len_dist = { type = "fixed", value = 100 }  # Ignored
output_len_dist = { type = "fixed", value = 50 }  # Samples EOS

Dataset Format: JSONL file in OpenAI batch API format. Each line should be a JSON object with a messages field containing an array of message objects.

Example:

{"custom_id": "req-1", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}

Tokenizer: Dataset mode requires a tokenizer file to convert text to tokens. You’ll need to provide this via the --tokenizer flag:

inference-lab -c config.toml --tokenizer tokenizer.json

The tokenizer should be a HuggingFace tokenizers JSON file (typically tokenizer.json from the model repository).

Chat Template: You’ll also need to specify how to format messages via --chat-template:

Use "None" for simple concatenation of messages
Use a Jinja2 template string for custom formatting (e.g., "{{user}}\n{{assistant}}")
Most models have their own chat template format

Example with no template:

inference-lab -c config.toml \
  --tokenizer tokenizer.json \
  --chat-template None

Closed-Loop Workload

Simulate a fixed number of concurrent users:

[workload]
arrival_pattern = "closed_loop"
num_concurrent_users = 10
# ... length distributions ...

Simulation Configuration

Control logging and output:

[simulation]
log_interval = 5  # Log every 5 iterations

Common Configuration Patterns

High Throughput Setup

Maximize batch size and token throughput:

[scheduler]
max_num_batched_tokens = 16384
max_num_seqs = 512
enable_chunked_prefill = true

Low Latency Setup

Prioritize request completion speed:

[scheduler]
max_num_batched_tokens = 4096
max_num_seqs = 64
policy = "sof"  # Shortest Output First

Memory-Constrained Setup

Limit KV cache usage:

[hardware]
kv_cache_capacity = 34359738368  # 32 GB explicit limit

[scheduler]
max_num_seqs = 128

Next Steps

See the Configuration Reference for exhaustive field documentation
Learn about Running Simulations

CLI Reference

Command-line interface reference for Inference Lab.

Usage

inference-lab [OPTIONS]

Default: config.toml
Example: inference-lab -c my-config.toml

Dataset Mode

-t, --tokenizer <PATH>

Path to tokenizer file (required for dataset mode).

Required when using dataset_path in configuration
Example: inference-lab -c config.toml --tokenizer tokenizer.json

--chat-template <TEMPLATE>

Chat template for formatting messages in dataset mode.

Required when using datasets
Use "None" for simple message concatenation (no template)
Example: inference-lab --tokenizer tokenizer.json --chat-template None
Example with template: inference-lab --tokenizer tokenizer.json --chat-template "{{system}}\n{{user}}\n{{assistant}}"

Output Options

-o, --output <PATH>

Path to output JSON file for results.

If not specified, results are only displayed to console
Example: inference-lab -c config.toml -o results.json

-q, --quiet

Suppress progress output (only show final results).

Example: inference-lab -c config.toml -q

-v, --verbose

Enable verbose output.

Example: inference-lab -c config.toml -v

--debug

Enable debug logging.

Example: inference-lab -c config.toml --debug

--no-color

Disable colored output.

Useful for logging to files or CI environments
Example: inference-lab -c config.toml --no-color

Simulation Options

--seed <NUMBER>

Override the random seed from configuration.

Useful for reproducible runs with different seeds
Example: inference-lab -c config.toml --seed 12345

Examples

Basic Simulation

inference-lab -c config.toml

Dataset Mode

inference-lab -c config.toml \
  --tokenizer tokenizer.json \
  --chat-template None

Save Results to File

inference-lab -c config.toml -o results.json

Quiet Mode with Output

inference-lab -c config.toml -q -o results.json

Multiple Runs with Different Seeds

for seed in 42 43 44; do
  inference-lab -c config.toml --seed $seed -o results_$seed.json
done

Exit Codes

0 - Simulation completed successfully
1 - Error occurred (configuration error, file not found, etc.)

Configuration File Reference

Complete field-by-field reference for Inference Lab configuration files.

Top-Level Structure

[hardware]
# ... hardware configuration ...

[model]
# ... model configuration ...

[scheduler]
# ... scheduler configuration ...

[workload]
# ... workload configuration ...

[simulation]
# ... simulation configuration ...

[hardware]

GPU and accelerator specifications.

Required Fields

Field	Type	Description
`name`	String	Accelerator name (e.g., “H100”, “A100”)
`compute_flops`	Float	Compute capacity in FLOPS for the specified precision
`memory_bandwidth`	Float	Memory bandwidth in bytes/second
`memory_capacity`	U64	Total GPU memory capacity in bytes
`bytes_per_param`	U32	Bytes per parameter (1 for fp8, 2 for bf16/fp16)

Optional Fields

Field	Type	Default	Description
`kv_cache_capacity`	U64	Computed	KV cache capacity in bytes. If not specified, calculated as `(memory_capacity * gpu_memory_utilization) - model_size`
`gpu_memory_utilization`	Float	0.9	Fraction of GPU memory to use. Used to compute `kv_cache_capacity` if not explicitly set

Example

[hardware]
name = "H100"
compute_flops = 1.513e15
memory_bandwidth = 3.35e12
memory_capacity = 85899345920
bytes_per_param = 2

[model]

LLM architecture parameters.

Required Fields

Field	Type	Description
`name`	String	Model name
`num_parameters`	U64	Total number of parameters (for MoE: all experts)
`num_layers`	U32	Number of transformer layers
`hidden_dim`	U32	Hidden dimension size
`num_heads`	U32	Number of attention heads
`max_seq_len`	U32	Maximum sequence length supported by the model

Optional Fields

Field	Type	Default	Description
`num_active_parameters`	U64	`num_parameters`	Active parameters per forward pass (for MoE models with sparse activation)
`num_kv_heads`	U32	`num_heads`	Number of KV heads. Set for GQA/MQA, omit for MHA
`sliding_window`	U32	None	Sliding window size for sliding window attention layers
`num_sliding_layers`	U32	0	Number of layers using sliding window attention (rest use full attention)

Example

[model]
name = "Llama-3-70B"
num_parameters = 70000000000
num_layers = 80
hidden_dim = 8192
num_heads = 64
num_kv_heads = 8
max_seq_len = 8192

[scheduler]

Request scheduling and batching configuration.

Required Fields

Field	Type	Description
`max_num_batched_tokens`	U32	Maximum number of tokens processed in a single iteration
`max_num_seqs`	U32	Maximum number of sequences that can run concurrently
`policy`	String	Scheduling policy: `"fcfs"`, `"sof"`, `"sif"`, `"stf"`, `"lif"`, `"lof"`, or `"ltf"`
`enable_chunked_prefill`	Bool	Enable chunked prefilling to interleave prompt processing with generation
`block_size`	U32	Block size for KV cache management (in tokens)

Optional Fields

Field	Type	Default	Description
`long_prefill_token_threshold`	U32	0 or 4% of `max_seq_len`	Maximum tokens to prefill in a single iteration. Defaults to 0 (no chunking within request) unless `max_num_partial_prefills > 1`, then defaults to 4% of `max_seq_len`
`max_num_partial_prefills`	U32	1	Maximum number of sequences that can be partially prefilled concurrently. Limits how many new waiting requests can start prefilling per iteration
`enable_preemption_free`	Bool	false	Enable preemption-free scheduling mode with conservative admission control

Scheduling Policy Values

fcfs - First-Come-First-Served
sof - Shortest Output First
sif - Shortest Input First
stf - Shortest Total First
lif - Longest Input First
lof - Longest Output First
ltf - Longest Total First

Example

[scheduler]
max_num_batched_tokens = 8192
max_num_seqs = 256
policy = "fcfs"
enable_chunked_prefill = true
block_size = 16

[workload]

Request arrival patterns and length distributions.

Required Fields

Field	Type	Description
`arrival_pattern`	String	Arrival pattern: `"poisson"`, `"uniform"`, `"burst"`, `"fixed_rate"`, `"closed_loop"`, or `"batched"`
`arrival_rate`	Float	Mean arrival rate in requests per second
`input_len_dist`	Distribution	Input sequence length distribution (ignored in dataset mode)
`output_len_dist`	Distribution	Output sequence length distribution (in dataset mode: samples actual generation length)
`seed`	U64	Random seed for reproducibility

Optional Fields

Field	Type	Default	Description
`dataset_path`	String	None	Path to dataset file in OpenAI batch API format (JSONL). If provided, uses dataset mode instead of synthetic workload
`num_requests`	Usize	None	Total number of requests to simulate. If None, runs until `duration_secs`
`duration_secs`	Float	None	Simulation duration in seconds. If None, runs until `num_requests`
`num_concurrent_users`	Usize	None	Number of concurrent users for `closed_loop` pattern. Each user immediately sends a new request when their previous one completes

Length Distribution Types

Distributions are specified using TOML tables with a type field:

Fixed:

input_len_dist = { type = "fixed", value = 1000 }

Uniform:

input_len_dist = { type = "uniform", min = 100, max = 2000 }

Normal:

input_len_dist = { type = "normal", mean = 1000.0, std_dev = 200.0 }

LogNormal:

input_len_dist = { type = "lognormal", mean = 6.9, std_dev = 0.7 }

Or using TOML section syntax:

[workload.input_len_dist]
type = "lognormal"
mean = 6.9
std_dev = 0.7

Example

[workload]
arrival_pattern = "poisson"
arrival_rate = 5.0
num_requests = 100
seed = 42

[workload.input_len_dist]
type = "lognormal"
mean = 6.9
std_dev = 0.7

[workload.output_len_dist]
type = "lognormal"
mean = 5.3
std_dev = 0.8

[simulation]

Simulation control and logging.

Optional Fields

Field	Type	Default	Description
`log_interval`	U64	100	Log progress every N iterations

Example

[simulation]
log_interval = 5

Type Reference

String: Text string
Float: 64-bit floating point number
U32: 32-bit unsigned integer
U64: 64-bit unsigned integer
Usize: Platform-dependent unsigned integer
Bool: Boolean (true or false)
Distribution: Length distribution object (see Length Distribution Types)

Keyboard shortcuts

Inference Lab Documentation