版本：v0.2

Model Selection Overview

Model selection is an advanced feature of vLLM Semantic Router that automatically chooses the best LLM from multiple candidates based on learned preferences, query similarity, and cost-quality optimization.

The semantic router supports 9 selection algorithms across two categories:

Core algorithms: Static, Latency-Aware, Elo, RouterDC, AutoMix, Hybrid
RL-driven algorithms: Thompson Sampling, GMTRouter, Router-R1

What Problem Does It Solve?

When you have multiple LLM backends (e.g., GPT-4, Claude, Llama, Mistral), you face a challenge: which model should handle each request?

Traditional approaches:

Static routing: Always use the same model (simple but suboptimal)
Round-robin: Distribute evenly (ignores model strengths)
Random: No intelligence (wastes resources)

Model selection solves this by intelligently matching queries to models based on:

Learned quality preferences (Elo ratings from user feedback)
Query-model similarity (RouterDC embeddings)
Cost-quality tradeoffs (AutoMix optimization)
Combined signals (Hybrid approach)

Available Algorithms

Core Algorithms

Algorithm	Best For	Key Benefit
Static	Simple deployments	Predictable, zero overhead
Latency-Aware	Latency-sensitive routing	Selects by TPOT/TTFT percentiles
Elo	Learning from feedback	Adapts to user preferences
RouterDC	Query-model matching	Matches specialties to queries
AutoMix	Cost optimization	Balances quality and cost
Hybrid	Complex requirements	Combines all methods

RL-Driven Algorithms

Algorithm	Best For	Key Benefit
Thompson Sampling	Exploration/exploitation	Bayesian adaptive learning
GMTRouter	Personalization	Per-user preference learning
Router-R1	Complex reasoning	LLM-powered routing decisions

Quick Start

Basic Configuration (Per-Decision)

Model selection is configured per-decision, allowing different strategies for different query types:

decisions:
  - name: tech
    description: "Technical queries"
    priority: 10
    rules:
      operator: "OR"
      conditions:
        - type: "domain"
          name: "tech"
    modelRefs:
      - model: "llama3.2:3b"
      - model: "phi4"
      - model: "gemma3:27b"
    algorithm:
      type: "elo"  # Use Elo rating for this decision
      elo:
        k_factor: 32
        category_weighted: true

Algorithm Types

Static (Default)

Uses the first model in modelRefs. No learning, fully deterministic.

algorithm:
  type: "static"

Latency-Aware

Selects the fastest model by TPOT/TTFT percentiles.

algorithm:
  type: "latency_aware"
  latency_aware:
    tpot_percentile: 10
    ttft_percentile: 10

Elo Rating

Learns from user feedback to rank models by quality.

algorithm:
  type: "elo"
  elo:
    k_factor: 32
    storage_path: "/var/lib/vsr/elo.json"

RouterDC

Matches query embeddings to model descriptions.

algorithm:
  type: "router_dc"
  router_dc:
    temperature: 0.07
    require_descriptions: true

AutoMix

Optimizes cost-quality tradeoff using POMDP.

algorithm:
  type: "automix"
  automix:
    cost_quality_tradeoff: 0.4

Hybrid

Combines all methods with configurable weights.

algorithm:
  type: "hybrid"
  hybrid:
    elo_weight: 0.3
    router_dc_weight: 0.3
    automix_weight: 0.2
    cost_weight: 0.2

How It Works

┌─────────────────────────────────────────────────────────────────────┐
│                         User Query                                   │
│                    "Explain quantum computing"                       │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Decision Matching                               │
│                 Decision "tech" matches → 3 models                   │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Selection Algorithm                               │
│                                                                      │
│  algorithm.type: "elo"                                              │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────┐        │
│  │ EloSelector.Select()                                    │        │
│  │                                                         │        │
│  │ Model Ratings:                                          │        │
│  │   llama3.2:3b  → 1468 (0 wins, 2 losses)               │        │
│  │   phi4         → 1501 (3 wins, 2 losses)               │        │
│  │   gemma3:27b   → 1531 (5 wins, 1 loss) ← HIGHEST       │        │
│  └─────────────────────────────────────────────────────────┘        │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Selected: gemma3:27b                            │
│                   (highest Elo rating: 1531)                         │
└─────────────────────────────────────────────────────────────────────┘

Choosing an Algorithm

See Choosing the Right Algorithm for detailed guidance.

Quick Decision Tree:

Just getting started? → Use static (default)
Need latency-based routing? → Use latency_aware
Have user feedback? → Use elo
Have model descriptions? → Use router_dc
Want cost optimization? → Use automix
Need everything? → Use hybrid

User Feedback Routing - Collect feedback signals via /api/v1/feedback endpoint
Preference Routing - Route based on user preferences in the system
Domain Routing - Route by topic category using embedding classification

Reference Papers

The selection algorithms are based on these research papers:

Core Algorithms

Elo: Inspired by preference-based routing concepts; see RouteLLM (Ong et al., ICLR 2025) which trains static routers achieving ~50% cost reduction (2x savings)
RouterDC: Query-Based Router by Dual Contrastive Learning (NeurIPS 2024) - +2.76% accuracy improvement
AutoMix: Automatically Mixing Language Models (NeurIPS 2024) - >50% cost reduction
Hybrid: Cost-Efficient Quality-Aware Query Routing (ICLR 2024) - 40% fewer expensive calls

RL-Driven Algorithms

Thompson Sampling: Classical multi-armed bandit approach; see A Tutorial on Thompson Sampling (Russo, Van Roy et al.)
GMTRouter: GMTRouter: Personalized LLM Router over Multi-turn User Interactions (Wang et al.) - 0.9-21.6% accuracy improvement
Router-R1: Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via RL (Hu et al., NeurIPS 2025)

Model Selection Overview

What Problem Does It Solve?​

Available Algorithms​

Core Algorithms​

RL-Driven Algorithms​

Quick Start​

Basic Configuration (Per-Decision)​

Algorithm Types​

Static (Default)​

Latency-Aware​

Elo Rating​

RouterDC​

AutoMix​

Hybrid​

How It Works​

Choosing an Algorithm​

Related Features​

Reference Papers​

Core Algorithms​

RL-Driven Algorithms​

What Problem Does It Solve?

Available Algorithms

Core Algorithms

RL-Driven Algorithms

Quick Start

Basic Configuration (Per-Decision)

Algorithm Types

Static (Default)

Latency-Aware

Elo Rating

RouterDC

AutoMix

Hybrid

How It Works

Choosing an Algorithm

Related Features

Reference Papers

Core Algorithms

RL-Driven Algorithms