Research Context
This page is optional. It exists to position vllm-sr-sim relative to adjacent
research systems and planning tools; most users can skip it and start with
Getting started or Capacity planning scenarios.
vllm-sr-sim sits at the intersection of several active research threads.
Each related work answers a different question than this simulator.
Mélange — heterogeneous GPU type selection
Griggs et al., UC Berkeley, 2024 · arXiv:2404.14527
Mélange shows that the optimal GPU type is determined by three interacting factors: request size (short requests favour cheap GPUs; long ones favour high-end GPUs), arrival rate (low rates allow right-sizing to cheaper hardware), and SLO tightness (strict latency requires fast GPUs regardless of cost). It formulates GPU allocation as cost-aware bin packing — GPUs are bins, workload slices are items — and uses an ILP to find the minimum-cost multi-GPU-type allocation. Achieves up to 77% cost reduction vs. a single GPU type.
Key differences from vllm-sr-sim:
| Mélange | vllm-sr-sim | |
|---|---|---|
| Input | Empirical throughput profiles per (GPU, request-size bucket, SLO) | Physics-derived W/H from HardwareSpec + ModelSpec; no GPU required |
| Output | Optimal mix of GPU types (how many A10G, A100, H100 …) | Optimal number of GPU instances per pool + routing topology |
| Routing | None — bins requests by size, assigns bins to GPU types | Explicit routing policies: length, semantic, C+R, model |
| Serving model | Single pool per GPU type, no pool routing | Multi-pool with inter-pool routing and SLO verification |
| SLO metric | Average TPOT | P99 TTFT (also supports TPOT via profile) |
| Validation | Benchmark runs on real hardware | Analytical Erlang-C + discrete-event simulation |
When to use Mélange: You have a homogeneous workload and want to know which cloud
GPU SKU to rent. Mélange selects the type; vllm-sr-sim then tells you how
many of that type you need, given your length-distribution and routing strategy.
SageServe — forecast-aware runtime auto-scaling
Jaiswal et al., Microsoft O365, 2025 · arXiv:2502.14617
SageServe is a runtime controller for an existing fleet. It characterises production O365 workloads (10 M+ requests/day across 3 US regions, 4 models), observes strong diurnal periodicity in interactive (IW) traffic and opportunistic non-interactive (NIW) batch jobs, and proposes: (1) a unified GPU VM pool shared across IW and NIW instead of siloed pools; (2) ARIMA-based hourly traffic forecasting; (3) an ILP to compute optimal instance count changes (δ) that minimise VM cold-start overhead; (4) a reactive heuristic that fine-tunes based on live memory utilisation. Saves 25% GPU-hours and reduces cold-start waste by 80%.
Key differences from vllm-sr-sim:
| SageServe | vllm-sr-sim | |
|---|---|---|
| Problem | How many instances to run right now given current traffic | How many GPUs to provision in total for a target traffic level |
| Time horizon | Minutes to hours (dynamic scaling loop) | Static capacity plan (peak-hour sizing) |
| Traffic model | Production traces + ARIMA forecast | Poisson arrivals / CDF workload / trace replay |
| Multi-tier workloads | IW-Fast, IW-Normal, NIW with different SLAs | Single SLO per pool (multi-SLO via multi-pool config) |
| Routing | Memory-utilisation-based cross-region routing | Length / semantic / model / C+R content-based routing |
| Performance model | Empirical TPS profiles per (model, GPU) | Physics-based roofline from specs |
| Hardware requirement | Real production traces from O365 GPT models | Self-contained; works without any hardware or traces |
When to use SageServe: You already have a deployed fleet and need to scale it
up/down through a 24-hour demand cycle. Use vllm-sr-sim first to size the
peak-hour fleet; then apply SageServe-style policies to scale down during off-peak
hours to save 20–30% GPU-hours.