Running Simulations
This guide covers how to run simulations and interpret results.
Basic Usage
Run a simulation with a configuration file:
inference-lab -c config.toml
For dataset mode, add tokenizer and chat template:
inference-lab -c config.toml \
--tokenizer tokenizer.json \
--chat-template None
See Configuration for details on configuring workloads, policies, and hardware.
Output Modes
Console Output (Default)
By default, the simulator displays:
- Real-time progress bar
- Current simulation time
- Queue status (running/waiting requests)
- KV cache utilization
Final output includes:
- Latency metrics (TTFT, E2E, per-token)
- Throughput metrics (tokens/sec, requests/sec)
- Utilization statistics (KV cache, FLOPS, bandwidth)
- Preemption statistics
JSON Output
Save results to a file:
inference-lab -c config.toml -o results.json
Combine with -q for batch processing:
inference-lab -c config.toml -q -o results.json
Running Multiple Experiments
Comparing Policies
for policy in fcfs sof sif lof; do
sed "s/policy = .*/policy = \"$policy\"/" config.toml > config_$policy.toml
inference-lab -c config_$policy.toml -q -o results_$policy.json
done
Sweeping Parameters
for batch_size in 4096 8192 16384; do
sed "s/max_num_batched_tokens = .*/max_num_batched_tokens = $batch_size/" \
config.toml > config_$batch_size.toml
inference-lab -c config_$batch_size.toml -o results_$batch_size.json
done
Multiple Seeds
Override the seed for reproducibility testing:
for seed in {1..10}; do
inference-lab -c config.toml --seed $seed -q -o results_$seed.json
done
Understanding Results
Latency Metrics
Time to First Token (TTFT)
- Time from request arrival to first token generation
- Lower is better for interactive applications
- Affected by: queue wait time, prefill computation
End-to-End (E2E) Latency
- Total time from request arrival to completion
- Includes prefill and all decode steps
- Key metric for overall user experience
Per-Token Latency
- Average time between consecutive output tokens
- Lower is better for streaming applications
- Primarily affected by batch size and model size
Throughput Metrics
Input Tokens/sec
- Rate of processing prompt tokens
- Indicates prefill throughput
Output Tokens/sec
- Rate of generating output tokens
- Indicates decode throughput
Requests/sec
- Overall request completion rate
- Key metric for capacity planning
Utilization Metrics
KV Cache
- Percentage of KV cache memory in use
- High utilization may lead to preemptions
FLOPS
- Percentage of compute capacity utilized
- Low FLOPS may indicate memory bottleneck
Bandwidth
- Percentage of memory bandwidth utilized
- High bandwidth utilization indicates memory-bound workload
Preemption Statistics
Preemptions occur when new requests need memory but the KV cache is full:
- Total number of preemptions
- Average preemptions per request
- Can significantly impact TTFT for preempted requests
Troubleshooting
Simulation running slowly?
- Reduce
num_requestsor use-qflag - Increase
log_intervalin config
Too many preemptions?
- Increase
kv_cache_capacityin hardware config - Reduce
max_num_seqsormax_num_batched_tokensin scheduler config
Dataset loading errors?
- Verify
--tokenizerand--chat-templateflags are provided - Check JSONL format matches OpenAI batch API format
For more details, see CLI Reference and Configuration.