跳到主要内容

版本：最新版

Research Context

This page is optional. It exists to position vllm-sr-sim relative to adjacent research systems and planning tools; most users can skip it and start with Getting started or Capacity planning scenarios.

vllm-sr-sim sits at the intersection of several active research threads. Each related work answers a different question than this simulator.

Mélange — heterogeneous GPU type selection

Griggs et al., UC Berkeley, 2024 · arXiv:2404.14527

Mélange shows that the optimal GPU type is determined by three interacting factors: request size (short requests favour cheap GPUs; long ones favour high-end GPUs), arrival rate (low rates allow right-sizing to cheaper hardware), and SLO tightness (strict latency requires fast GPUs regardless of cost). It formulates GPU allocation as cost-aware bin packing — GPUs are bins, workload slices are items — and uses an ILP to find the minimum-cost multi-GPU-type allocation. Achieves up to 77% cost reduction vs. a single GPU type.

Key differences from vllm-sr-sim:

	Mélange	vllm-sr-sim
Input	Empirical throughput profiles per (GPU, request-size bucket, SLO)	Physics-derived `W`/`H` from `HardwareSpec` + `ModelSpec`; no GPU required
Output	Optimal mix of GPU types (how many A10G, A100, H100 …)	Optimal number of GPU instances per pool + routing topology
Routing	None — bins requests by size, assigns bins to GPU types	Explicit routing policies: length, semantic, C+R, model
Serving model	Single pool per GPU type, no pool routing	Multi-pool with inter-pool routing and SLO verification
SLO metric	Average TPOT	P99 TTFT (also supports TPOT via profile)
Validation	Benchmark runs on real hardware	Analytical Erlang-C + discrete-event simulation

When to use Mélange: You have a homogeneous workload and want to know which cloud GPU SKU to rent. Mélange selects the type; vllm-sr-sim then tells you how many of that type you need, given your length-distribution and routing strategy.

SageServe — forecast-aware runtime auto-scaling

Jaiswal et al., Microsoft O365, 2025 · arXiv:2502.14617

SageServe is a runtime controller for an existing fleet. It characterises production O365 workloads (10 M+ requests/day across 3 US regions, 4 models), observes strong diurnal periodicity in interactive (IW) traffic and opportunistic non-interactive (NIW) batch jobs, and proposes: (1) a unified GPU VM pool shared across IW and NIW instead of siloed pools; (2) ARIMA-based hourly traffic forecasting; (3) an ILP to compute optimal instance count changes (δ) that minimise VM cold-start overhead; (4) a reactive heuristic that fine-tunes based on live memory utilisation. Saves 25% GPU-hours and reduces cold-start waste by 80%.

Key differences from vllm-sr-sim:

	SageServe	vllm-sr-sim
Problem	How many instances to run right now given current traffic	How many GPUs to provision in total for a target traffic level
Time horizon	Minutes to hours (dynamic scaling loop)	Static capacity plan (peak-hour sizing)
Traffic model	Production traces + ARIMA forecast	Poisson arrivals / CDF workload / trace replay
Multi-tier workloads	IW-Fast, IW-Normal, NIW with different SLAs	Single SLO per pool (multi-SLO via multi-pool config)
Routing	Memory-utilisation-based cross-region routing	Length / semantic / model / C+R content-based routing
Performance model	Empirical TPS profiles per (model, GPU)	Physics-based roofline from specs
Hardware requirement	Real production traces from O365 GPT models	Self-contained; works without any hardware or traces

When to use SageServe: You already have a deployed fleet and need to scale it up/down through a 24-hour demand cycle. Use vllm-sr-sim first to size the peak-hour fleet; then apply SageServe-style policies to scale down during off-peak hours to save 20–30% GPU-hours.

AIConfigurator — kernel-level configuration search for disaggregated clusters

Xu et al., NVIDIA, 2025 · arXiv:2601.06288

AIConfigurator decomposes LLM inference into fundamental operations (GEMM, attention, all-reduce, P2P transfer) and maintains a calibrated kernel performance database across Ampere/Hopper/Blackwell GPUs and popular models (GPT, Qwen, DeepSeek, Llama, Mistral). Given a workload descriptor and SLA targets, it searches the combinatorial space of TP/PP/EP degrees, batch sizes, CUDA-graph flags, and KV-cache fractions in under 30 seconds, producing Pareto-optimal throughput-vs-latency frontiers and ready-to-launch config files for vLLM, SGLang, and TRT-LLM. Reports up to 40% improvement for dense models and 50% for MoE (DeepSeek-V3) vs. default configs.

Key differences from vllm-sr-sim:

	AIConfigurator	vllm-sr-sim
Output	Optimal TP/PP/EP, batch size, engine flags for one cluster	Optimal number of GPU instances across N pools
Granularity	Intra-cluster parallelism degrees and runtime flags	Fleet-level pool count and routing topology
Models	GEMM/attention/communication ops on real silicon	Roofline W/H model (embeds AIConf. calibration constants)
Frameworks	vLLM, SGLang, TRT-LLM, Dynamo launch files	Framework-agnostic Python simulation
SLA scope	TTFT + TPOT per cluster configuration	P99 TTFT across the whole multi-pool fleet
MoE support	Native kernel DB for DeepSeek-V3, Qwen3 MoE	Embedded silicon-measured kernel table from AIConf.

Relationship: vllm-sr-sim embeds AIConfigurator's empirical constants (ALPHA_BW = 0.80, LAYER_OVERHEAD_US = 3 µs, MoE kernel table) so that its ProfileBuilder produces calibrated W/H values without requiring hardware access. Use AIConfigurator to find the optimal TP/EP config for a single node group; feed that into ProfileBuilder as ServingConfig.tp to size the full fleet.

DistServe — foundational disaggregated prefill/decode serving

Zhong et al., Peking University + UCSD, OSDI 2024 · arXiv:2401.09670

DistServe identifies that colocating prefill (compute-bound) and decode (memory-bandwidth-bound) phases causes mutual interference: a single prefill batch can inflate TPOT by 3–10×, and decoding jobs inflate TTFT. It routes prefill and decode to physically separate GPUs, allows each to adopt its own TP/PP configuration independently, and uses an M/D/1 queuing model to find the optimal prefill-to-decode GPU ratio. Achieves 4.48× more requests or 10.2× tighter SLO vs. vLLM on A100s.

Key differences from vllm-sr-sim:

	DistServe	vllm-sr-sim
Output	Optimal (TP, PP, batch strategy) for each phase	Optimal `n_prefill`, `n_decode` GPU counts at fleet scale
Scope	One model replica / cluster	Fleet of N replicated pool pairs
Queuing model	M/D/1 per phase (uniform request lengths)	M/G/c Erlang-C (variable service times from W, H, CDF)
KV transfer	Explicit NVLink/InfiniBand bandwidth modelling	Captured via `BETA_TTFT = 1.80` empirical multiplier
Routing	No content-based routing	Length, semantic, C+R, model routing on top of PD split
Phase performance	Empirical throughput measurement	Physics-derived via `ProfileBuilder` with `phase=` flag

When to use together: DistServe determines the right TP/PP config for each phase. DisaggFleetOptimizer then determines how many prefill and decode workers you need to meet your P99 TTFT SLO at a given arrival rate.

Splitwise — heterogeneous hardware co-design for PD-split clusters

Patel et al., University of Washington + Microsoft, ISCA 2024 · arXiv:2311.18677

Splitwise observes that token generation (decode) does not need the high FLOPs of the latest GPUs — it is memory-bandwidth bound. H100 has 3.4× more compute than A100 but only 1.6× more memory bandwidth. So pairing H100 (prompt) with A100 (token) achieves better cost efficiency than two H100s. It designs three cluster archetypes optimised for throughput, cost, and power, all using fast InfiniBand for KV-cache state transfer. Achieves 1.4× higher throughput at 20% lower cost, or 2.35× throughput at the same power budget.

Key differences from vllm-sr-sim:

	Splitwise	vllm-sr-sim
Key insight	Different GPUs for prompt vs. token phase saves cost/power	Different GPU counts per pool, with any GPU type
Hardware	Heterogeneous GPU types within one cluster	Heterogeneous GPU types across separate pools
Optimisation	Cluster topology for throughput/cost/power	Fleet sizing for SLO-constrained minimum cost
KV transfer	Explicit InfiniBand bandwidth modelling	Modelled via empirical β multiplier
Routing	Load-balancing across instances	Content-aware: length, semantic, model, C+R
Configuration	Requires profiling on real hardware	Self-contained physics model

When to use together: Use Splitwise's cluster archetypes to choose the GPU mix within a pool node group (e.g., H100 prefillers + A100 decoders), then feed each pool's GPU type into ProfileBuilder and run DisaggFleetOptimizer to find the prefill-to-decode ratio and total GPU count.

TokenScale — Token Velocity autoscaling for disaggregated fleets

Lai et al., 2024 · arXiv:2512.03416

TokenScale addresses the runtime autoscaling problem for PD-disaggregated fleets under bursty traffic. It identifies that GPU utilisation and RPS are lagging indicators that react only after SLO violations occur. It introduces Token Velocity — the maximum token-processing rate at each stage (prefill, network, decode) — as a leading indicator that exposes backpressure before it causes degradation. A paired mechanism called Convertible Decoders lets decode GPUs temporarily serve prefill tasks during bursts, absorbing spikes without cold-starting new instances. Improves SLO attainment from 50–88% to 80–96% and reduces costs by 4–14% over DistServe and AIBrix.

Key differences from vllm-sr-sim:

	TokenScale	vllm-sr-sim
Problem	React to traffic bursts in an already-deployed PD fleet	Size the fleet before deployment
Time scale	Seconds (burst detection, Convertible Decoder activation)	Minutes to hours (planning phase)
Key metric	Token Velocity per stage (real-time)	P99 TTFT SLO (planning)
Scaling trigger	Token arrival rate vs. stage velocity ratio	Erlang-C wait ≤ SLO budget
Hardware assumption	Fixed GPU cluster, dynamic role assignment	GPU count is the decision variable
Burst model	Empirical burst statistics from Azure/OpenAI traces	Poisson arrival process

When to use together: vllm-sr-sim gives the steady-state fleet size (minimum GPUs for P99 TTFT ≤ T at rate λ). TokenScale then operates at runtime, handling short-term bursts above λ using Convertible Decoders rather than expensive over-provisioning.

Vidur — high-fidelity single-instance LLM inference simulator

Agrawal et al., Microsoft Research, 2024 · arXiv:2405.05465

Vidur simulates the full inference stack of one model deployment: operator-level profiling (GEMM, attention, MLP, communication), KV-cache block management, continuous batching, chunked prefill, and preemption. It uses a profiling + ML-regression approach to predict per-iteration latency with <9% error. A companion tool, Vidur-Search, explores hundreds of deployment configurations (TP, PP, batch size, chunk size, scheduler) in ~1 CPU-hour, vs. ~42 000 GPU-hours for empirical search. Targets per-engine configuration optimisation, not multi-pool fleet sizing.

Key differences from vllm-sr-sim:

	Vidur	vllm-sr-sim
Scope	One model instance	N-pool fleet with inter-pool routing
Fidelity	Operator-level (GEMM, attention, NCCL profiled)	Request-level M/G/c queuing (W, H from roofline)
Configuration search	TP, PP, batch size, chunk size, scheduler	Pool count, GPU type, routing policy, γ compression
Input requirement	GPU profiling data per model	Only model spec + hardware spec (no GPU needed)
Fleet routing	None	Length, semantic, C+R, model, round-robin
Multi-pool analysis	None	First-class: two-pool, N-pool, disaggregated
SLO metric	TTFT, TBT, E2E latency	P99 TTFT, SLO attainment %, cost/year

When to use together: Use Vidur-Search to determine the best scheduler and batching parameters for one engine replica. Feed its throughput/latency measurements into ManualProfile and pass it to FleetOptimizer to size and validate the full fleet of replicas.

One-line positioning

Tool	Core question answered
Vidur	What batching/scheduling config maximises per-GPU goodput?
AIConfigurator	What TP/EP/engine flags maximise throughput for one cluster?
Mélange	Which GPU types to mix for minimum cost at given SLO?
Splitwise	Which GPU generations to use for prefill vs. decode?
DistServe	How many prefill vs. decode GPUs per cluster replica?
TokenScale	How to scale prefill/decode pools in real time under bursts?
SageServe	How many VM instances to run through a 24-hour demand cycle?
vllm-sr-sim	How many GPU pools, which routing policy, what fleet cost to meet P99 TTFT?

Decision tree — which tool to use first

Use the Fleet Sim guide for the consolidated reading pack, or follow the flowchart below.

Mélange — heterogeneous GPU type selection
SageServe — forecast-aware runtime auto-scaling
AIConfigurator — kernel-level configuration search for disaggregated clusters
DistServe — foundational disaggregated prefill/decode serving
Splitwise — heterogeneous hardware co-design for PD-split clusters
TokenScale — Token Velocity autoscaling for disaggregated fleets
Vidur — high-fidelity single-instance LLM inference simulator
One-line positioning
Decision tree — which tool to use first