Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Inference Lab is a simulation framework designed to evaluate and analyze LLM workloads.

It uses discrete-event simulation to model the behavior of a multi-GPU node serving LLM inference requests with the vLLM library. It contains a facsimile of the vLLM queueing, scheduling, and execution logic, with only the actual model inference replaced by a performance model based on the supplied GPU specs and model architecture.

Within each simulation step, the simulator:

  • Processes any newly arrived requests, adding them to the scheduling queue.
  • Schedules requests to serve based on the selected scheduling policy.
  • Calculates the compute and memory bandwidth usage for the workload that the scheduled requests represent, and the theoretical time required to execute the workload on the specified hardware.
  • Increments the simulation time by the calculated execution time, updating the state of all requests accordingly.

Caveats:

  • This assumes perfectly optimized GPU execution, ignoring kernel launch overheads, poorly optimized kernels, application overhead, thermals, etc.
  • We simulate tensor parallel execution, but don’t model multi-GPU communication overheads.

Features

  • Accurate Performance Modeling: Models compute (FLOPS) and memory bandwidth constraints
  • Multiple Scheduling Policies: FCFS, Priority, SJF, and more
  • Chunked Prefill: Simulates realistic request interleaving
  • KV Cache Management: Models GPU memory and KV cache utilization
  • Workload Generation: Supports Poisson, Gamma, and closed-loop patterns
  • WebAssembly Support: Run simulations in the browser via WASM

Quick Start

See the Getting Started guide to begin using Inference Lab.

Getting Started

This guide will help you get started with Inference Lab.

Installation

Install from crates.io:

cargo install --locked inference-lab

Or build from source:

cargo build --release

Running Your First Simulation

inference-lab -c config.toml

Next Steps

Running Simulations

This guide covers how to run simulations and interpret results.

Basic Usage

Run a simulation with a configuration file:

inference-lab -c config.toml

For dataset mode, add tokenizer and chat template:

inference-lab -c config.toml \
  --tokenizer tokenizer.json \
  --chat-template None

See Configuration for details on configuring workloads, policies, and hardware.

Output Modes

Console Output (Default)

By default, the simulator displays:

  • Real-time progress bar
  • Current simulation time
  • Queue status (running/waiting requests)
  • KV cache utilization

Final output includes:

  • Latency metrics (TTFT, E2E, per-token)
  • Throughput metrics (tokens/sec, requests/sec)
  • Utilization statistics (KV cache, FLOPS, bandwidth)
  • Preemption statistics

JSON Output

Save results to a file:

inference-lab -c config.toml -o results.json

Combine with -q for batch processing:

inference-lab -c config.toml -q -o results.json

Running Multiple Experiments

Comparing Policies

for policy in fcfs sof sif lof; do
  sed "s/policy = .*/policy = \"$policy\"/" config.toml > config_$policy.toml
  inference-lab -c config_$policy.toml -q -o results_$policy.json
done

Sweeping Parameters

for batch_size in 4096 8192 16384; do
  sed "s/max_num_batched_tokens = .*/max_num_batched_tokens = $batch_size/" \
    config.toml > config_$batch_size.toml
  inference-lab -c config_$batch_size.toml -o results_$batch_size.json
done

Multiple Seeds

Override the seed for reproducibility testing:

for seed in {1..10}; do
  inference-lab -c config.toml --seed $seed -q -o results_$seed.json
done

Understanding Results

Latency Metrics

Time to First Token (TTFT)

  • Time from request arrival to first token generation
  • Lower is better for interactive applications
  • Affected by: queue wait time, prefill computation

End-to-End (E2E) Latency

  • Total time from request arrival to completion
  • Includes prefill and all decode steps
  • Key metric for overall user experience

Per-Token Latency

  • Average time between consecutive output tokens
  • Lower is better for streaming applications
  • Primarily affected by batch size and model size

Throughput Metrics

Input Tokens/sec

  • Rate of processing prompt tokens
  • Indicates prefill throughput

Output Tokens/sec

  • Rate of generating output tokens
  • Indicates decode throughput

Requests/sec

  • Overall request completion rate
  • Key metric for capacity planning

Utilization Metrics

KV Cache

  • Percentage of KV cache memory in use
  • High utilization may lead to preemptions

FLOPS

  • Percentage of compute capacity utilized
  • Low FLOPS may indicate memory bottleneck

Bandwidth

  • Percentage of memory bandwidth utilized
  • High bandwidth utilization indicates memory-bound workload

Preemption Statistics

Preemptions occur when new requests need memory but the KV cache is full:

  • Total number of preemptions
  • Average preemptions per request
  • Can significantly impact TTFT for preempted requests

Troubleshooting

Simulation running slowly?

  • Reduce num_requests or use -q flag
  • Increase log_interval in config

Too many preemptions?

  • Increase kv_cache_capacity in hardware config
  • Reduce max_num_seqs or max_num_batched_tokens in scheduler config

Dataset loading errors?

  • Verify --tokenizer and --chat-template flags are provided
  • Check JSONL format matches OpenAI batch API format

For more details, see CLI Reference and Configuration.

Configuration

Inference Lab uses TOML configuration files to define your simulation parameters. A configuration file has five main sections: hardware, model, scheduler, workload, and simulation.

Configuration Sections Overview

  • [hardware] - GPU specifications (compute, memory, bandwidth)
  • [model] - LLM architecture (layers, parameters, dimensions)
  • [scheduler] - Scheduling policy and batching behavior
  • [workload] - Request arrival patterns and distributions
  • [simulation] - Logging and output options

Quick Start Example

Here’s a minimal configuration to get started:

[hardware]
name = "H100"
compute_flops = 1.513e15        # 1513 TFLOPS bf16
memory_bandwidth = 3.35e12      # 3.35 TB/s
memory_capacity = 85899345920   # 80 GB
bytes_per_param = 2             # bf16

[model]
name = "Llama-3-70B"
num_parameters = 70000000000
num_layers = 80
hidden_dim = 8192
num_heads = 64
num_kv_heads = 8                # GQA with 8 KV heads
max_seq_len = 8192

[scheduler]
max_num_batched_tokens = 8192
max_num_seqs = 256
policy = "fcfs"
enable_chunked_prefill = true
block_size = 16

[workload]
arrival_pattern = "poisson"
arrival_rate = 5.0              # 5 requests/sec
num_requests = 100
seed = 42

[workload.input_len_dist]
type = "lognormal"
mean = 6.9                      # ~1000 tokens median
std_dev = 0.7

[workload.output_len_dist]
type = "lognormal"
mean = 5.3                      # ~200 tokens median
std_dev = 0.8

[simulation]
log_interval = 5

Hardware Configuration

The hardware section defines your GPU specifications:

[hardware]
name = "H100"
compute_flops = 1.513e15        # bf16 TFLOPS
memory_bandwidth = 3.35e12      # bytes/sec
memory_capacity = 85899345920   # 80 GB
bytes_per_param = 2             # 2 for bf16, 1 for fp8

Optional fields:

  • kv_cache_capacity - Explicit KV cache size (otherwise computed automatically)
  • gpu_memory_utilization - Fraction of memory to use (default: 0.9)

Model Configuration

Define your LLM architecture:

[model]
name = "Llama-3-70B"
num_parameters = 70000000000
num_layers = 80
hidden_dim = 8192
num_heads = 64
num_kv_heads = 8                # For GQA (omit for MHA)
max_seq_len = 8192

Grouped Query Attention (GQA)

For models using GQA, set num_kv_heads to the number of KV heads:

num_kv_heads = 8  # Llama 3 uses 8 KV heads

Omit num_kv_heads for standard multi-head attention (MHA) models.

Mixture of Experts (MoE)

For MoE models, specify active parameters separately:

num_parameters = 140000000000      # Total params
num_active_parameters = 12000000000 # Active per forward pass

Sliding Window Attention

For models like GPT-OSS with sliding window attention:

sliding_window = 4096
num_sliding_layers = 28  # Number of layers using sliding window

Scheduler Configuration

Control request scheduling and batching:

[scheduler]
max_num_batched_tokens = 8192
max_num_seqs = 256
policy = "fcfs"
enable_chunked_prefill = true
block_size = 16

Scheduling Policies

Available policies:

  • fcfs - First-Come-First-Served (default)
  • sof - Shortest Output First
  • sif - Shortest Input First
  • stf - Shortest Total First
  • lif - Longest Input First
  • lof - Longest Output First
  • ltf - Longest Total First

Chunked Prefill

Enable chunked prefill to allow interleaving prompt processing with generation:

enable_chunked_prefill = true
long_prefill_token_threshold = 512  # Optional: chunk size limit
max_num_partial_prefills = 1        # Max concurrent partial prefills

Preemption-Free Mode

Enable conservative admission control to guarantee zero preemptions:

enable_preemption_free = true

Workload Configuration

Define how requests arrive and their characteristics.

Synthetic Workload

[workload]
arrival_pattern = "poisson"
arrival_rate = 5.0
num_requests = 100
seed = 42

[workload.input_len_dist]
type = "lognormal"
mean = 6.9
std_dev = 0.7

[workload.output_len_dist]
type = "lognormal"
mean = 5.3
std_dev = 0.8

Arrival Patterns

  • poisson - Poisson process with exponential inter-arrival times
  • uniform - Uniform random inter-arrival times
  • burst - Bursty traffic
  • fixed_rate - Fixed interval between requests
  • closed_loop - Fixed number of concurrent users
  • batched - Requests arrive in batches

Length Distributions

Four distribution types are supported:

Fixed:

[workload.input_len_dist]
type = "fixed"
value = 1000

Uniform:

[workload.input_len_dist]
type = "uniform"
min = 100
max = 2000

Normal:

[workload.input_len_dist]
type = "normal"
mean = 1000.0
std_dev = 200.0

LogNormal:

[workload.input_len_dist]
type = "lognormal"
mean = 6.9      # ln(1000)
std_dev = 0.7

Dataset Mode

Use real request traces instead of synthetic workloads:

[workload]
dataset_path = "path/to/dataset.jsonl"
arrival_pattern = "poisson"
arrival_rate = 1.0

# These are used for sampling actual generation length
input_len_dist = { type = "fixed", value = 100 }  # Ignored
output_len_dist = { type = "fixed", value = 50 }  # Samples EOS

Dataset Format: JSONL file in OpenAI batch API format. Each line should be a JSON object with a messages field containing an array of message objects.

Example:

{"custom_id": "req-1", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100}

Tokenizer: Dataset mode requires a tokenizer file to convert text to tokens. You’ll need to provide this via the --tokenizer flag:

inference-lab -c config.toml --tokenizer tokenizer.json

The tokenizer should be a HuggingFace tokenizers JSON file (typically tokenizer.json from the model repository).

Chat Template: You’ll also need to specify how to format messages via --chat-template:

  • Use "None" for simple concatenation of messages
  • Use a Jinja2 template string for custom formatting (e.g., "{{user}}\n{{assistant}}")
  • Most models have their own chat template format

Example with no template:

inference-lab -c config.toml \
  --tokenizer tokenizer.json \
  --chat-template None

Closed-Loop Workload

Simulate a fixed number of concurrent users:

[workload]
arrival_pattern = "closed_loop"
num_concurrent_users = 10
# ... length distributions ...

Simulation Configuration

Control logging and output:

[simulation]
log_interval = 5  # Log every 5 iterations

Common Configuration Patterns

High Throughput Setup

Maximize batch size and token throughput:

[scheduler]
max_num_batched_tokens = 16384
max_num_seqs = 512
enable_chunked_prefill = true

Low Latency Setup

Prioritize request completion speed:

[scheduler]
max_num_batched_tokens = 4096
max_num_seqs = 64
policy = "sof"  # Shortest Output First

Memory-Constrained Setup

Limit KV cache usage:

[hardware]
kv_cache_capacity = 34359738368  # 32 GB explicit limit

[scheduler]
max_num_seqs = 128

Next Steps

CLI Reference

Command-line interface reference for Inference Lab.

Usage

inference-lab [OPTIONS]

Options

Required Options

None - all options have defaults or are optional.

Configuration

-c, --config <PATH>

Path to the TOML configuration file.

  • Default: config.toml
  • Example: inference-lab -c my-config.toml

Dataset Mode

-t, --tokenizer <PATH>

Path to tokenizer file (required for dataset mode).

  • Required when using dataset_path in configuration
  • Example: inference-lab -c config.toml --tokenizer tokenizer.json

--chat-template <TEMPLATE>

Chat template for formatting messages in dataset mode.

  • Required when using datasets
  • Use "None" for simple message concatenation (no template)
  • Example: inference-lab --tokenizer tokenizer.json --chat-template None
  • Example with template: inference-lab --tokenizer tokenizer.json --chat-template "{{system}}\n{{user}}\n{{assistant}}"

Output Options

-o, --output <PATH>

Path to output JSON file for results.

  • If not specified, results are only displayed to console
  • Example: inference-lab -c config.toml -o results.json

-q, --quiet

Suppress progress output (only show final results).

  • Example: inference-lab -c config.toml -q

-v, --verbose

Enable verbose output.

  • Example: inference-lab -c config.toml -v

--debug

Enable debug logging.

  • Example: inference-lab -c config.toml --debug

--no-color

Disable colored output.

  • Useful for logging to files or CI environments
  • Example: inference-lab -c config.toml --no-color

Simulation Options

--seed <NUMBER>

Override the random seed from configuration.

  • Useful for reproducible runs with different seeds
  • Example: inference-lab -c config.toml --seed 12345

Examples

Basic Simulation

inference-lab -c config.toml

Dataset Mode

inference-lab -c config.toml \
  --tokenizer tokenizer.json \
  --chat-template None

Save Results to File

inference-lab -c config.toml -o results.json

Quiet Mode with Output

inference-lab -c config.toml -q -o results.json

Multiple Runs with Different Seeds

for seed in 42 43 44; do
  inference-lab -c config.toml --seed $seed -o results_$seed.json
done

Exit Codes

  • 0 - Simulation completed successfully
  • 1 - Error occurred (configuration error, file not found, etc.)

Configuration File Reference

Complete field-by-field reference for Inference Lab configuration files.

Top-Level Structure

[hardware]
# ... hardware configuration ...

[model]
# ... model configuration ...

[scheduler]
# ... scheduler configuration ...

[workload]
# ... workload configuration ...

[simulation]
# ... simulation configuration ...

[hardware]

GPU and accelerator specifications.

Required Fields

FieldTypeDescription
nameStringAccelerator name (e.g., “H100”, “A100”)
compute_flopsFloatCompute capacity in FLOPS for the specified precision
memory_bandwidthFloatMemory bandwidth in bytes/second
memory_capacityU64Total GPU memory capacity in bytes
bytes_per_paramU32Bytes per parameter (1 for fp8, 2 for bf16/fp16)

Optional Fields

FieldTypeDefaultDescription
kv_cache_capacityU64ComputedKV cache capacity in bytes. If not specified, calculated as (memory_capacity * gpu_memory_utilization) - model_size
gpu_memory_utilizationFloat0.9Fraction of GPU memory to use. Used to compute kv_cache_capacity if not explicitly set

Example

[hardware]
name = "H100"
compute_flops = 1.513e15
memory_bandwidth = 3.35e12
memory_capacity = 85899345920
bytes_per_param = 2

[model]

LLM architecture parameters.

Required Fields

FieldTypeDescription
nameStringModel name
num_parametersU64Total number of parameters (for MoE: all experts)
num_layersU32Number of transformer layers
hidden_dimU32Hidden dimension size
num_headsU32Number of attention heads
max_seq_lenU32Maximum sequence length supported by the model

Optional Fields

FieldTypeDefaultDescription
num_active_parametersU64num_parametersActive parameters per forward pass (for MoE models with sparse activation)
num_kv_headsU32num_headsNumber of KV heads. Set for GQA/MQA, omit for MHA
sliding_windowU32NoneSliding window size for sliding window attention layers
num_sliding_layersU320Number of layers using sliding window attention (rest use full attention)

Example

[model]
name = "Llama-3-70B"
num_parameters = 70000000000
num_layers = 80
hidden_dim = 8192
num_heads = 64
num_kv_heads = 8
max_seq_len = 8192

[scheduler]

Request scheduling and batching configuration.

Required Fields

FieldTypeDescription
max_num_batched_tokensU32Maximum number of tokens processed in a single iteration
max_num_seqsU32Maximum number of sequences that can run concurrently
policyStringScheduling policy: "fcfs", "sof", "sif", "stf", "lif", "lof", or "ltf"
enable_chunked_prefillBoolEnable chunked prefilling to interleave prompt processing with generation
block_sizeU32Block size for KV cache management (in tokens)

Optional Fields

FieldTypeDefaultDescription
long_prefill_token_thresholdU320 or 4% of max_seq_lenMaximum tokens to prefill in a single iteration. Defaults to 0 (no chunking within request) unless max_num_partial_prefills > 1, then defaults to 4% of max_seq_len
max_num_partial_prefillsU321Maximum number of sequences that can be partially prefilled concurrently. Limits how many new waiting requests can start prefilling per iteration
enable_preemption_freeBoolfalseEnable preemption-free scheduling mode with conservative admission control

Scheduling Policy Values

  • fcfs - First-Come-First-Served
  • sof - Shortest Output First
  • sif - Shortest Input First
  • stf - Shortest Total First
  • lif - Longest Input First
  • lof - Longest Output First
  • ltf - Longest Total First

Example

[scheduler]
max_num_batched_tokens = 8192
max_num_seqs = 256
policy = "fcfs"
enable_chunked_prefill = true
block_size = 16

[workload]

Request arrival patterns and length distributions.

Required Fields

FieldTypeDescription
arrival_patternStringArrival pattern: "poisson", "uniform", "burst", "fixed_rate", "closed_loop", or "batched"
arrival_rateFloatMean arrival rate in requests per second
input_len_distDistributionInput sequence length distribution (ignored in dataset mode)
output_len_distDistributionOutput sequence length distribution (in dataset mode: samples actual generation length)
seedU64Random seed for reproducibility

Optional Fields

FieldTypeDefaultDescription
dataset_pathStringNonePath to dataset file in OpenAI batch API format (JSONL). If provided, uses dataset mode instead of synthetic workload
num_requestsUsizeNoneTotal number of requests to simulate. If None, runs until duration_secs
duration_secsFloatNoneSimulation duration in seconds. If None, runs until num_requests
num_concurrent_usersUsizeNoneNumber of concurrent users for closed_loop pattern. Each user immediately sends a new request when their previous one completes

Length Distribution Types

Distributions are specified using TOML tables with a type field:

Fixed:

input_len_dist = { type = "fixed", value = 1000 }

Uniform:

input_len_dist = { type = "uniform", min = 100, max = 2000 }

Normal:

input_len_dist = { type = "normal", mean = 1000.0, std_dev = 200.0 }

LogNormal:

input_len_dist = { type = "lognormal", mean = 6.9, std_dev = 0.7 }

Or using TOML section syntax:

[workload.input_len_dist]
type = "lognormal"
mean = 6.9
std_dev = 0.7

Example

[workload]
arrival_pattern = "poisson"
arrival_rate = 5.0
num_requests = 100
seed = 42

[workload.input_len_dist]
type = "lognormal"
mean = 6.9
std_dev = 0.7

[workload.output_len_dist]
type = "lognormal"
mean = 5.3
std_dev = 0.8

[simulation]

Simulation control and logging.

Optional Fields

FieldTypeDefaultDescription
log_intervalU64100Log progress every N iterations

Example

[simulation]
log_interval = 5

Type Reference

  • String: Text string
  • Float: 64-bit floating point number
  • U32: 32-bit unsigned integer
  • U64: 64-bit unsigned integer
  • Usize: Platform-dependent unsigned integer
  • Bool: Boolean (true or false)
  • Distribution: Length distribution object (see Length Distribution Types)