Version: v0.1

Model Selection Overview

Model selection is an advanced feature of vLLM Semantic Router that automatically chooses the best LLM from multiple candidates based on learned preferences, query similarity, and cost-quality optimization.

The semantic router supports 8 selection algorithms across two categories:

Core algorithms: Static, Elo, RouterDC, AutoMix, Hybrid
RL-driven algorithms: Thompson Sampling, GMTRouter, Router-R1

What Problem Does It Solve?

When you have multiple LLM backends (e.g., GPT-4, Claude, Llama, Mistral), you face a challenge: which model should handle each request?

Traditional approaches:

Static routing: Always use the same model (simple but suboptimal)
Round-robin: Distribute evenly (ignores model strengths)
Random: No intelligence (wastes resources)

Model selection solves this by intelligently matching queries to models based on:

Learned quality preferences (Elo ratings from user feedback)
Query-model similarity (RouterDC embeddings)
Cost-quality tradeoffs (AutoMix optimization)
Combined signals (Hybrid approach)

Available Algorithms

Core Algorithms

Algorithm	Best For	Key Benefit
Static	Simple deployments	Predictable, zero overhead
Elo	Learning from feedback	Adapts to user preferences
RouterDC	Query-model matching	Matches specialties to queries
AutoMix	Cost optimization	Balances quality and cost
Hybrid	Complex requirements	Combines all methods

RL-Driven Algorithms

Algorithm	Best For	Key Benefit
Thompson Sampling	Exploration/exploitation	Bayesian adaptive learning
GMTRouter	Personalization	Per-user preference learning
Router-R1	Complex reasoning	LLM-powered routing decisions

Quick Start

Basic Configuration (Per-Decision)

Model selection is configured per-decision, allowing different strategies for different query types:

decisions:
  - name: tech
    description: "Technical queries"
    priority: 10
    rules:
      operator: "OR"
      conditions:
        - type: "domain"
          name: "tech"
    modelRefs:
      - model: "llama3.2:3b"
      - model: "phi4"
      - model: "gemma3:27b"
    algorithm:
      type: "elo"  # Use Elo rating for this decision
      elo:
        k_factor: 32
        category_weighted: true

Algorithm Types

Static (Default)

Uses the first model in modelRefs. No learning, fully deterministic.

algorithm:
  type: "static"

Elo Rating

Learns from user feedback to rank models by quality.

algorithm:
  type: "elo"
  elo:
    k_factor: 32
    storage_path: "/var/lib/vsr/elo.json"

RouterDC

Matches query embeddings to model descriptions.

algorithm:
  type: "router_dc"
  router_dc:
    temperature: 0.07
    require_descriptions: true

AutoMix

Optimizes cost-quality tradeoff using POMDP.

algorithm:
  type: "automix"
  automix:
    cost_quality_tradeoff: 0.4

Hybrid

Combines all methods with configurable weights.

algorithm:
  type: "hybrid"
  hybrid:
    elo_weight: 0.3
    router_dc_weight: 0.3
    automix_weight: 0.2
    cost_weight: 0.2

How It Works

┌─────────────────────────────────────────────────────────────────────┐
│                         User Query                                   │
│                    "Explain quantum computing"                       │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Decision Matching                               │
│                 Decision "tech" matches → 3 models                   │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Selection Algorithm                               │
│                                                                      │
│  algorithm.type: "elo"                                              │
│                                                                      │
│  ┌─────────────────────────────────────────────────────────┐        │
│  │ EloSelector.Select()                                    │        │
│  │                                                         │        │
│  │ Model Ratings:                                          │        │
│  │   llama3.2:3b  → 1468 (0 wins, 2 losses)               │        │
│  │   phi4         → 1501 (3 wins, 2 losses)               │        │
│  │   gemma3:27b   → 1531 (5 wins, 1 loss) ← HIGHEST       │        │
│  └─────────────────────────────────────────────────────────┘        │
└─────────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Selected: gemma3:27b                            │
│                   (highest Elo rating: 1531)                         │
└─────────────────────────────────────────────────────────────────────┘

Choosing an Algorithm

See Choosing the Right Algorithm for detailed guidance.

Quick Decision Tree:

Just getting started? → Use static (default)
Have user feedback? → Use elo
Have model descriptions? → Use router_dc
Want cost optimization? → Use automix
Need everything? → Use hybrid

User Feedback Routing - Collect feedback signals via /api/v1/feedback endpoint
Preference Routing - Route based on user preferences in the system
Domain Routing - Route by topic category using embedding classification

Reference Papers

The selection algorithms are based on these research papers:

Core Algorithms

Elo: Inspired by preference-based routing concepts; see RouteLLM (Ong et al., ICLR 2025) which trains static routers achieving ~50% cost reduction (2x savings)
RouterDC: Query-Based Router by Dual Contrastive Learning (NeurIPS 2024) - +2.76% accuracy improvement
AutoMix: Automatically Mixing Language Models (NeurIPS 2024) - >50% cost reduction
Hybrid: Cost-Efficient Quality-Aware Query Routing (ICLR 2024) - 40% fewer expensive calls

RL-Driven Algorithms

Thompson Sampling: Classical multi-armed bandit approach; see A Tutorial on Thompson Sampling (Russo, Van Roy et al.)
GMTRouter: GMTRouter: Personalized LLM Router over Multi-turn User Interactions (Wang et al.) - 0.9-21.6% accuracy improvement
Router-R1: Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via RL (Hu et al., NeurIPS 2025)

What Problem Does It Solve?​

Available Algorithms​

Core Algorithms​

RL-Driven Algorithms​

Quick Start​

Basic Configuration (Per-Decision)​

Algorithm Types​

Static (Default)​

Elo Rating​

RouterDC​

AutoMix​

Hybrid​

How It Works​

Choosing an Algorithm​

Related Features​

Reference Papers​

Core Algorithms​

RL-Driven Algorithms​

What Problem Does It Solve?

Available Algorithms

Core Algorithms

RL-Driven Algorithms

Quick Start

Basic Configuration (Per-Decision)

Algorithm Types

Static (Default)

Elo Rating

RouterDC

AutoMix

Hybrid

How It Works

Choosing an Algorithm

Related Features

Reference Papers

Core Algorithms

RL-Driven Algorithms