Introduction

Inference Lab is a simulation framework designed to evaluate and analyze LLM workloads.

It uses discrete-event simulation to model the behavior of a multi-GPU node serving LLM inference requests with the vLLM library. It contains a facsimile of the vLLM queueing, scheduling, and execution logic, with only the actual model inference replaced by a performance model based on the supplied GPU specs and model architecture.

Within each simulation step, the simulator:

Processes any newly arrived requests, adding them to the scheduling queue.
Schedules requests to serve based on the selected scheduling policy.
Calculates the compute and memory bandwidth usage for the workload that the scheduled requests represent, and the theoretical time required to execute the workload on the specified hardware.
Increments the simulation time by the calculated execution time, updating the state of all requests accordingly.

Caveats:

This assumes perfectly optimized GPU execution, ignoring kernel launch overheads, poorly optimized kernels, application overhead, thermals, etc.
We simulate tensor parallel execution, but don’t model multi-GPU communication overheads.

Features

Accurate Performance Modeling: Models compute (FLOPS) and memory bandwidth constraints
Multiple Scheduling Policies: FCFS, Priority, SJF, and more
Chunked Prefill: Simulates realistic request interleaving
KV Cache Management: Models GPU memory and KV cache utilization
Workload Generation: Supports Poisson, Gamma, and closed-loop patterns
WebAssembly Support: Run simulations in the browser via WASM

Quick Start

See the Getting Started guide to begin using Inference Lab.

Keyboard shortcuts

Inference Lab Documentation

Introduction

Features

Quick Start