版本：v0.2

Router-R1 Selection

Router-R1 uses an LLM as the router itself, performing multi-round "think" and "route" actions to make intelligent routing decisions. The router can reason about query requirements, model capabilities, and cost trade-offs before making selections.

Reference: Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning by Hu et al., NeurIPS 2025. Our implementation is inspired by this paper's think/route action pattern.

Paper vs Implementation

The original Router-R1 paper introduces multi-round, multi-model routing and aggregation - the router calls multiple models sequentially, integrates their responses into its context, and synthesizes a final answer. This is the paper's core contribution.

Our implementation provides a simplified single-model selection variant that uses the think/route action pattern for deliberative routing. For full multi-model aggregation, see the advanced configuration below.

Algorithm Flow

Think/Route Protocol

The Router LLM uses a structured output format with two action types:

Action	Description
`<think>...</think>`	Reasoning step - analyzes the query (can repeat)
`<route>model</route>`	Final routing decision

Example Output

<think>
Analyzing query: "Debug this Python code"
- Query type: coding task
- Requires: code understanding, debugging
- Best model: code-llama (specialized for code)
</think>
<route>code-llama</route>

How It Works

Single-Model Selection (Default)

Router LLM receives the user query and model descriptions
LLM performs THINK actions to analyze query requirements
LLM performs ROUTE action to select the best model
Selected model processes the request
If the route is invalid, retry up to max_iterations

Multi-Model Aggregation (Advanced)

When enable_aggregation: true, Router-R1 can call multiple models:

Router LLM reasons about which models to consult
Router calls Model A, receives response, integrates into context
Router decides whether to call additional models
Router synthesizes final answer from all responses

This matches the paper's multi-round aggregation approach.

RL-Based Training

The router is trained using reinforcement learning with rewards for:

Format correctness: Proper use of think/route tags
Outcome quality: Quality of the final response
Cost efficiency: Balance between performance and cost

Core Algorithm (Go)

// Select using LLM-as-Router
func (s *RouterR1Selector) Select(ctx context.Context, selCtx *SelectionContext) (*SelectionResult, error) {
    if s.routerEndpoint == "" && s.fallbackToStatic {
        return s.staticSelector.Select(ctx, selCtx)
    }
    
    history := []string{}
    
    for i := 0; i < s.maxIterations; i++ {
        response, err := s.callRouterLLM(ctx, selCtx.Query, selCtx.ModelDescriptions, history)
        if err != nil {
            if s.fallbackToStatic {
                return s.staticSelector.Select(ctx, selCtx)
            }
            return nil, err
        }
        
        action := s.parseAction(response)
        
        if action.Type == ActionRoute {
            if s.isValidModel(action.Model, selCtx.CandidateModels) {
                return &SelectionResult{
                    SelectedModel: action.Model,
                    Method:        MethodRouterR1,
                    Reason:        action.Reasoning,
                }, nil
            }
        }
        
        history = append(history, response)
    }
    
    // Max iterations reached
    if s.fallbackToStatic {
        return s.staticSelector.Select(ctx, selCtx)
    }
    return nil, fmt.Errorf("router failed after %d iterations", s.maxIterations)
}

Configuration

Single-Model Selection

decision:
  algorithm:
    type: router_r1
    router_r1:
      router_endpoint: http://localhost:8001  # Router LLM server
      max_iterations: 3        # Max think/route cycles
      temperature: 0.7         # Router LLM temperature
      use_cot: true           # Chain-of-thought reasoning
      fallback_to_static: true # Fallback if router unavailable

models:
  - name: gpt-4
    backend: openai
    description: "Complex reasoning and analysis"
  - name: gpt-3.5-turbo
    backend: openai
    description: "Fast general responses"
  - name: code-llama
    backend: local
    description: "Code generation and debugging"

Multi-Model Aggregation (Advanced)

decision:
  algorithm:
    type: router_r1
    router_r1:
      router_endpoint: http://localhost:8001
      enable_aggregation: true  # Enable multi-model calling
      max_models_per_query: 3   # Max models to consult
      aggregation_strategy: "synthesize"  # or "best_of"

Key Parameters

Parameter	Default	Description
`router_endpoint`	null	URL of Router-R1 server
`max_iterations`	3	Maximum think/route iterations
`temperature`	0.7	Temperature for router LLM
`use_cot`	true	Enable chain-of-thought reasoning
`fallback_to_static`	true	Use static selection if router unavailable
`enable_aggregation`	false	Enable multi-model aggregation

Router-R1 Server

Router-R1 requires a separate server running the router LLM:

cd src/training/rl_model_selection
python router_r1_server.py --port 8001 --model Qwen/Qwen2.5-3B-Instruct

The server exposes:

POST /route - Route a query to a model

Without Router Server

If router_endpoint is null and fallback_to_static: true, Router-R1 falls back to static selection. This allows gradual adoption:

Deploy with fallback_to_static: true
Start Router-R1 server when ready
Configure router_endpoint

When to Use Router-R1

Good for:

Complex routing logic that's hard to encode in rules
Queries requiring semantic understanding
Systems with diverse, specialized models
Multi-model synthesis (with aggregation enabled)

Consider alternatives when:

Latency is critical (LLM routing adds 100-500ms)
Simple routing rules suffice
No GPU available for router LLM

Best Practices

Use a small router model: 3B-7B is sufficient for routing
Enable fallback: Graceful degradation if router fails
Limit iterations: 3 is usually enough
Provide good model descriptions: Router uses these for decisions
Monitor router latency: Track router_r1_decision_latency_seconds

Router-R1 Selection

Paper vs Implementation​

Algorithm Flow​

Think/Route Protocol​

Example Output​

How It Works​

Single-Model Selection (Default)​

Multi-Model Aggregation (Advanced)​

RL-Based Training​

Core Algorithm (Go)​

Configuration​

Single-Model Selection​

Multi-Model Aggregation (Advanced)​

Key Parameters​

Router-R1 Server​

Without Router Server​

When to Use Router-R1​

Best Practices​

Paper vs Implementation

Algorithm Flow

Think/Route Protocol

Example Output

How It Works

Single-Model Selection (Default)

Multi-Model Aggregation (Advanced)

RL-Based Training

Core Algorithm (Go)

Configuration

Single-Model Selection

Multi-Model Aggregation (Advanced)

Key Parameters

Router-R1 Server

Without Router Server

When to Use Router-R1

Best Practices