Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Configuration File Reference

Complete field-by-field reference for Inference Lab configuration files.

Top-Level Structure

[hardware]
# ... hardware configuration ...

[model]
# ... model configuration ...

[scheduler]
# ... scheduler configuration ...

[workload]
# ... workload configuration ...

[simulation]
# ... simulation configuration ...

[hardware]

GPU and accelerator specifications.

Required Fields

FieldTypeDescription
nameStringAccelerator name (e.g., “H100”, “A100”)
compute_flopsFloatCompute capacity in FLOPS for the specified precision
memory_bandwidthFloatMemory bandwidth in bytes/second
memory_capacityU64Total GPU memory capacity in bytes
bytes_per_paramU32Bytes per parameter (1 for fp8, 2 for bf16/fp16)

Optional Fields

FieldTypeDefaultDescription
kv_cache_capacityU64ComputedKV cache capacity in bytes. If not specified, calculated as (memory_capacity * gpu_memory_utilization) - model_size
gpu_memory_utilizationFloat0.9Fraction of GPU memory to use. Used to compute kv_cache_capacity if not explicitly set

Example

[hardware]
name = "H100"
compute_flops = 1.513e15
memory_bandwidth = 3.35e12
memory_capacity = 85899345920
bytes_per_param = 2

[model]

LLM architecture parameters.

Required Fields

FieldTypeDescription
nameStringModel name
num_parametersU64Total number of parameters (for MoE: all experts)
num_layersU32Number of transformer layers
hidden_dimU32Hidden dimension size
num_headsU32Number of attention heads
max_seq_lenU32Maximum sequence length supported by the model

Optional Fields

FieldTypeDefaultDescription
num_active_parametersU64num_parametersActive parameters per forward pass (for MoE models with sparse activation)
num_kv_headsU32num_headsNumber of KV heads. Set for GQA/MQA, omit for MHA
sliding_windowU32NoneSliding window size for sliding window attention layers
num_sliding_layersU320Number of layers using sliding window attention (rest use full attention)

Example

[model]
name = "Llama-3-70B"
num_parameters = 70000000000
num_layers = 80
hidden_dim = 8192
num_heads = 64
num_kv_heads = 8
max_seq_len = 8192

[scheduler]

Request scheduling and batching configuration.

Required Fields

FieldTypeDescription
max_num_batched_tokensU32Maximum number of tokens processed in a single iteration
max_num_seqsU32Maximum number of sequences that can run concurrently
policyStringScheduling policy: "fcfs", "sof", "sif", "stf", "lif", "lof", or "ltf"
enable_chunked_prefillBoolEnable chunked prefilling to interleave prompt processing with generation
block_sizeU32Block size for KV cache management (in tokens)

Optional Fields

FieldTypeDefaultDescription
long_prefill_token_thresholdU320 or 4% of max_seq_lenMaximum tokens to prefill in a single iteration. Defaults to 0 (no chunking within request) unless max_num_partial_prefills > 1, then defaults to 4% of max_seq_len
max_num_partial_prefillsU321Maximum number of sequences that can be partially prefilled concurrently. Limits how many new waiting requests can start prefilling per iteration
enable_preemption_freeBoolfalseEnable preemption-free scheduling mode with conservative admission control

Scheduling Policy Values

  • fcfs - First-Come-First-Served
  • sof - Shortest Output First
  • sif - Shortest Input First
  • stf - Shortest Total First
  • lif - Longest Input First
  • lof - Longest Output First
  • ltf - Longest Total First

Example

[scheduler]
max_num_batched_tokens = 8192
max_num_seqs = 256
policy = "fcfs"
enable_chunked_prefill = true
block_size = 16

[workload]

Request arrival patterns and length distributions.

Required Fields

FieldTypeDescription
arrival_patternStringArrival pattern: "poisson", "uniform", "burst", "fixed_rate", "closed_loop", or "batched"
arrival_rateFloatMean arrival rate in requests per second
input_len_distDistributionInput sequence length distribution (ignored in dataset mode)
output_len_distDistributionOutput sequence length distribution (in dataset mode: samples actual generation length)
seedU64Random seed for reproducibility

Optional Fields

FieldTypeDefaultDescription
dataset_pathStringNonePath to dataset file in OpenAI batch API format (JSONL). If provided, uses dataset mode instead of synthetic workload
num_requestsUsizeNoneTotal number of requests to simulate. If None, runs until duration_secs
duration_secsFloatNoneSimulation duration in seconds. If None, runs until num_requests
num_concurrent_usersUsizeNoneNumber of concurrent users for closed_loop pattern. Each user immediately sends a new request when their previous one completes

Length Distribution Types

Distributions are specified using TOML tables with a type field:

Fixed:

input_len_dist = { type = "fixed", value = 1000 }

Uniform:

input_len_dist = { type = "uniform", min = 100, max = 2000 }

Normal:

input_len_dist = { type = "normal", mean = 1000.0, std_dev = 200.0 }

LogNormal:

input_len_dist = { type = "lognormal", mean = 6.9, std_dev = 0.7 }

Or using TOML section syntax:

[workload.input_len_dist]
type = "lognormal"
mean = 6.9
std_dev = 0.7

Example

[workload]
arrival_pattern = "poisson"
arrival_rate = 5.0
num_requests = 100
seed = 42

[workload.input_len_dist]
type = "lognormal"
mean = 6.9
std_dev = 0.7

[workload.output_len_dist]
type = "lognormal"
mean = 5.3
std_dev = 0.8

[simulation]

Simulation control and logging.

Optional Fields

FieldTypeDefaultDescription
log_intervalU64100Log progress every N iterations

Example

[simulation]
log_interval = 5

Type Reference

  • String: Text string
  • Float: 64-bit floating point number
  • U32: 32-bit unsigned integer
  • U64: 64-bit unsigned integer
  • Usize: Platform-dependent unsigned integer
  • Bool: Boolean (true or false)
  • Distribution: Length distribution object (see Length Distribution Types)