Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Inference Lab is a simulation framework designed to evaluate and analyze LLM workloads.

It uses discrete-event simulation to model the behavior of a multi-GPU node serving LLM inference requests with the vLLM library. It contains a facsimile of the vLLM queueing, scheduling, and execution logic, with only the actual model inference replaced by a performance model based on the supplied GPU specs and model architecture.

Within each simulation step, the simulator:

  • Processes any newly arrived requests, adding them to the scheduling queue.
  • Schedules requests to serve based on the selected scheduling policy.
  • Calculates the compute and memory bandwidth usage for the workload that the scheduled requests represent, and the theoretical time required to execute the workload on the specified hardware.
  • Increments the simulation time by the calculated execution time, updating the state of all requests accordingly.

Caveats:

  • This assumes perfectly optimized GPU execution, ignoring kernel launch overheads, poorly optimized kernels, application overhead, thermals, etc.
  • We simulate tensor parallel execution, but don’t model multi-GPU communication overheads.

Features

  • Accurate Performance Modeling: Models compute (FLOPS) and memory bandwidth constraints
  • Multiple Scheduling Policies: FCFS, Priority, SJF, and more
  • Chunked Prefill: Simulates realistic request interleaving
  • KV Cache Management: Models GPU memory and KV cache utilization
  • Workload Generation: Supports Poisson, Gamma, and closed-loop patterns
  • WebAssembly Support: Run simulations in the browser via WASM

Quick Start

See the Getting Started guide to begin using Inference Lab.