RL Driven
Overview
rl_driven is a selection algorithm for online exploration and personalization. It supports multiple sub-modes: Thompson Sampling, Router-R1 (LLM-as-router), and Concurrent (arena mode).
It aligns to config/algorithm/selection/rl-driven.yaml.
Papers:
- Router-R1: Routing with Reinforcement Learning — multi-round RL routing with reward structure
- GMTRouter: Personalized LLM Router — GNN-based personalized routing (related)
Key Advantages
- Supports exploration instead of always exploiting the current best model.
- Thompson Sampling provides a principled exploration/exploitation balance.
- Router-R1 mode uses an LLM to reason about routing decisions.
- Per-user personalization adapts routing over time.
- Implicit feedback support (auto-detected satisfaction signals).
Sub-Modes
1. Thompson Sampling (default)
Uses Bayesian posterior sampling for exploration/exploitation. Each model's success probability is modeled as a Beta distribution :
At each request, a sample is drawn for each candidate model, and the model with the highest sample is selected:
After feedback, the distribution is updated:
- Win:
- Loss:
- Tie: