跳到主要内容
Documentation

Router Learning: Memory and Adaptations

Status

版本:最新版

Router Learning: Memory and Adaptations

Status

Accepted. The first implementation covers session-aware Router Learning with conversation/session scopes, decision-level apply / observe / bypass, generic learning headers, replay diagnostics, and the AMD agentic routing recipe. Broader Router Memory features such as public memory configuration, replay-derived priors, multi-adaptation composition, and migration of other learning-style algorithms remain future work.

Summary

The router needs one product concept for cross-request routing intelligence: Router Learning.

Router Learning has two parts:

  • Memory records and learns from routing traffic.
  • Learning adaptations consume memory to adjust a base routing decision.

Router Learning memory should unify three existing concepts that are currently described separately:

  • replay records as the event log
  • session and conversation memory as online state
  • lookup tables as replay-derived materialized priors

The first learning adaptation is session-aware routing stability. It decides when to keep the current model and when to switch based on conversation continuity, session continuity, prefix-cache evidence, handoff cost, and model-switch history.

This proposal moves session-aware behavior out of decision-local algorithm.type: session_aware and into a global router learning layer:

global:
services:
router_replay:
enabled: true
store_backend: postgres
ttl_seconds: 2592000

router:
learning:
enabled: true
adaptations:
session_aware:
enabled: true
scope: conversation
identity:
headers:
session: x-session-id
conversation: x-conversation-id
tuning:
idle_timeout_seconds: 300
min_turns_before_switch: 1
switch_margin: 0.05
cache_weight: 0.20
handoff_penalty: 0.05
handoff_penalty_weight: 1.0
switch_history_weight: 0.04
max_cache_cost_multiplier: 2.5

Decisions remain semantic. A decision can opt out of a learning adaptation when it represents a hard policy boundary:

adaptations:
session_aware:
mode: bypass

The default decision behavior is mode: apply.

Implementation Status

AreaStatusNotes
Global router.learning.adaptations.session_aware APIImplementedSession-aware routing moved out of algorithm.type=session_aware.
Decision-level adaptations.session_awareImplementedSupports apply, observe, bypass, sparse scope overrides, and tuning overrides.
Conversation and session identity scopesImplementedconversation protects one run; session protects across runs until idle release.
Online session/conversation stateImplementedFirst implementation uses low-latency in-process router state and single-replica assumptions.
Generic learning response headersImplementedUses x-vsr-learning-methods, x-vsr-learning-actions, x-vsr-learning-scopes, x-vsr-learning-reasons, and x-vsr-learning-modes.
Replay learning diagnosticsImplemented, partialRecords method/action/reason/scope, base/final model evidence, cache/cost evidence, and hashed identity diagnostics. The exact replay schema can evolve.
Clean break from old public session-aware algorithm configImplementedOld algorithm.type=session_aware, algorithm.session_aware, model_switch_gate, and public lookup_tables config are rejected instead of rewritten.
AMD agentic routing recipe and guideImplementedRecipe uses session-aware Router Learning plus decision bypasses for hard privacy/security boundaries.
Public router.learning.memory configFuture workOnline state is internal today; no public memory backend, TTL, or storage controls are exposed.
Replay-derived materialized priorsFuture workLookup tables remain a roadmap concept under Router Learning memory, not a supported public API.
Additional adaptations and compositionFuture workOnly session_aware is implemented; bandit, personalization, provider health, and cost optimizer need separate designs.
Elo / RL / GMT migrationFuture workThese remain decision algorithms in this implementation.
Distributed memory consistencyFuture workMulti-replica memory semantics, storage hot-path policy, and sticky routing are out of scope.
Full eval harnessFuture workUnit and recipe validation exist; broader cost/cache/latency/replay-quality eval belongs in the external eval harness.

Background

The AMD agentic routing recipe demonstrates a common production pattern:

  • simple requests should use simple, low-cost models
  • complex requests should use stronger models
  • private or sensitive requests should stay on local models
  • domain requests should use domain-specialized models
  • agent conversations should remain stable during tool loops and cache-heavy continuations

An earlier recipe draft used a synthetic agentic_session_route decision and algorithm.type: session_aware to attach this stability behavior. That worked for a narrow demonstration, but it had the wrong abstraction boundary.

An agentic conversation can move through multiple semantic decisions. A first turn may match complex_code, a follow-up tool result may match agentic_workflow, and a later clarification may match simple_general. If session-aware behavior is tied to a single decision-local selector, the protection breaks whenever the next request matches a different decision with a different modelRefs set.

Session-aware behavior is not itself a semantic route. It is a runtime learning algorithm over the selected route.

Goals

  • Make session-aware routing a global learning adaptation under global.router.learning.adaptations.session_aware.
  • Keep algorithm focused on base model selection only.
  • Let learning adaptations apply by default to all decisions.
  • Let individual decisions opt out when they are hard policy boundaries.
  • Distinguish session and conversation identity:
    • x-session-id identifies the long-lived agent session or workspace.
    • x-conversation-id identifies one user-initiated agent conversation.
  • Support both common stability modes:
    • conversation-level protection
    • session-level protection
  • Define Router Learning memory as one product surface with event-log, online-state, and materialized-prior layers.
  • Keep the user-facing API small while preserving advanced tuning knobs.
  • Provide a common extension point for future router-memory learning adaptations.

Non-Goals

  • This proposal does not redesign semantic signals or projections.
  • This proposal does not introduce prompt-visible user memory.
  • The first implementation does not require a distributed memory backend or multi-replica online-state consistency.
  • This proposal does not require every future learning adaptation to use session and conversation identity.
  • This proposal does not let learning silently override decisions that opt out with mode: bypass.

Mental Model

The final routing pipeline should be:

request
-> signals and projections
-> decision rules
-> base selection algorithm
-> router learning memory read
-> router learning adaptations
-> final model
-> router learning memory write

The ownership boundary is:

LayerResponsibility
decision.rulesDecide which semantic scenario matched.
decision.modelRefsDefine the base candidate set for the matched scenario.
decision.algorithmProduce the proposed model for the matched scenario.
router.learning.memoryStore event logs, online state, and materialized priors.
router.learning.adaptationsAdjust the proposed model using router learning memory.
decision.adaptationsDecide whether this decision allows each learning adaptation to affect the final result.

This keeps the recipe explainable:

decision = complex_code
base_model = frontier-model
learning.session_aware.action = stay
learning.session_aware.reason = active_tool_loop
final_model = previous-frontier-model

Global API

Router Learning is configured under global.router.learning. Memory and learning adaptations live under the same product entry because adaptations are defined by how they use cross-request memory.

global:
services:
router_replay:
enabled: true
store_backend: postgres
ttl_seconds: 2592000

router:
learning:
enabled: true
adaptations:
session_aware:
enabled: true
scope: conversation
identity:
headers:
session: x-session-id
conversation: x-conversation-id
tuning:
idle_timeout_seconds: 300
min_turns_before_switch: 1
switch_margin: 0.05
cache_weight: 0.20
handoff_penalty: 0.05
handoff_penalty_weight: 1.0
switch_history_weight: 0.04
max_cache_cost_multiplier: 2.5

adaptations is intentionally plural because it is a collection of cross-request learning adaptations. This avoids confusing router.learning.adaptations.session_aware with decision-local decision.algorithm, which remains the base selector for one request.

Defaults and Memory Dependency

router.learning.enabled is the global gate for Router Learning.

If router.learning.enabled: false, the router ignores learning memory and learning adaptations. Decisions route using their semantic rules and base algorithm only.

If any router.learning.adaptations.<name>.enabled: true, Router Learning needs memory. The router creates the low-latency online state required by enabled adaptations. That online state is runtime state, not a new public storage backend knob.

Online state is not a public toggle. If an enabled adaptation needs session, conversation, provider, or feedback state, the router owns that state internally.

Router Learning event-log data should reuse the existing Router Replay service: global.services.router_replay. Its store_backend, TTL, and async-write settings remain the deployment controls for durable replay records. The route-local router_replay plugin remains the capture policy surface for a decision.

Decision-level adaptations.<name>.mode defaults to apply.

First Implementation Scope

The first implementation should focus on moving session-aware routing into Router Learning:

  • single router replica only
  • no required external storage reads on the request path
  • session-aware online state stored in process memory
  • Router Replay used only as the event log and eval/debug source
  • replay-derived priors documented as the roadmap, not required for the first session-aware migration
  • future adaptation ordering left as an extension point, since the first implementation has only session_aware

If the required identity is missing, session-aware learning should no-op rather than fail the request. The base routing decision remains valid, and diagnostics can record identity_missing when replay/debug evidence is available.

session_aware.enabled

Turns the session-aware learning adaptation on or off.

enabled: true

When disabled, routing behaves like the base decision and base selection algorithm, though other enabled learning adaptations may still run.

scope

Controls the protection lifecycle.

scope: conversation

Allowed values:

ValueMeaning
conversationProtect strongly within one conversation. Across conversations in the same session, re-evaluate routing but account for cache, history, and handoff cost.
sessionProtect strongly across the whole session. The first selected model becomes the session model and later conversations prefer to keep it until the session idles out or a decision bypasses learning.

The recommended default for agentic routing is conversation.

conversation mode gives the desired default behavior:

  • a tool loop does not switch models mid-conversation
  • a new conversation in the same session can route to a simpler or more specialized model
  • cross-conversation cache and handoff cost still influence switching

session mode gives the stricter behavior:

  • the session becomes the stability boundary
  • the router strongly prefers the established session model across conversations
  • policy boundaries can still bypass the learning layer

identity

Defines how session-aware learning reads request identity.

identity:
headers:
session: x-session-id
conversation: x-conversation-id

Defaults:

FieldDefault
headers.sessionx-session-id
headers.conversationx-conversation-id

The map form keeps the API extensible. session_aware needs session and conversation; future adaptations can add keys such as tenant, workspace, user, or provider without adding a new top-level header field each time.

If the conversation header is missing, the router may use explicit missing-header resilience such as request-shape inference, but diagnostics should mark the identity source as inferred. Production agent clients should send both headers.

tuning

The tuning block preserves the useful parts of the existing session_aware selector without making them required in every recipe.

tuning:
idle_timeout_seconds: 300
min_turns_before_switch: 1
switch_margin: 0.05
cache_weight: 0.20
handoff_penalty: 0.05
handoff_penalty_weight: 1.0
switch_history_weight: 0.04
max_cache_cost_multiplier: 2.5
FieldMeaning
idle_timeout_secondsExpire protection after inactivity.
min_turns_before_switchRequire a short warm-up period before switching.
switch_marginRequire the proposed model to be better by this margin before switching.
cache_weightIncrease stay preference when prefix-cache evidence is warm.
handoff_penaltyDefault model-to-model switch cost when no learned penalty exists.
handoff_penalty_weightWeight applied to handoff cost.
switch_history_weightPenalize repeated model switching.
max_cache_cost_multiplierPrevent cache preservation from justifying unbounded cost increases.

The following existing session_aware knobs should not be part of the primary user-facing API:

Existing FieldProposed Handling
stay_biasFold into switch_margin and internal defaults.
tool_loop_stay_biasKeep internal; tool-loop protection should be explicit behavior, not routine tuning.
quality_gap_multiplierKeep internal; it is tied to selector score normalization.
remaining_turn_prior_weightKeep internal or future advanced setting.
remaining_turn_prior_horizonKeep internal or future advanced setting.
min_remaining_turn_prior_samplesKeep internal or future advanced setting.

Two protections should default to enabled and can later be exposed as advanced settings if operators need them:

  • tool-loop protection
  • context-portability protection

Decision API

Learning adaptation behavior can be controlled per decision.

Decision-level config uses adaptations directly. router.learning remains the global management namespace for memory, adaptation registration, and future learning-wide settings. A decision is narrower: it only declares how the matched semantic route interacts with globally registered adaptations.

The default is:

adaptations:
session_aware:
mode: apply

Most decisions do not need to write anything.

Apply

adaptations:
session_aware:
mode: apply

apply allows session-aware learning to adjust the base selection.

In apply mode, the learning layer may keep the current protected model even if the current matched decision has a different modelRefs set. This is intentional. It is what allows a conversation to remain stable when follow-up turns match different semantic decisions.

The protected carry-over model must still be a configured backend model and must not be blocked by the current decision's adaptation setting.

Bypass

adaptations:
session_aware:
mode: bypass

bypass makes the base decision result final. Session-aware learning can record diagnostics, but it cannot change the selected model.

Use bypass for hard boundaries:

  • privacy containment
  • security containment
  • explicit local-only policy
  • compliance routes
  • probes or operational routes where model stability should not interfere

Observe

adaptations:
session_aware:
mode: observe

observe computes what session-aware learning would have done, records it in diagnostics and the event log, but does not change the final model.

Use observe for rollout, debugging, and evaluation.

Local Overrides

Decision-level overrides should be sparse. They exist for exceptional cases:

adaptations:
session_aware:
mode: apply
scope: session
tuning:
switch_margin: 0.10

Local overrides merge with the global configuration. Unset fields inherit from global.router.learning.adaptations.session_aware.

Base Algorithm vs Learning Adaptation Boundary

The clean API should draw a hard line between request-time selection algorithms and cross-request learning adaptations.

decision.algorithm         = score or select using the current request
router.learning.adaptations = use cross-request memory, feedback, cache, cost,
health, or history to adjust the proposed model

The practical rule is:

If it persists state across requests and that state can influence later routing, it belongs under router.learning, not under decision-local algorithm.

Algorithms can still consume read-only learning evidence. They should not own the durable memory themselves.

Existing Algorithm Classification

Current CapabilityFinal HomeReason
staticalgorithmRequest-local deterministic selection.
hybridalgorithmRequest-local score composition. It may read learning evidence later, but should not own memory.
multi_factoralgorithmRequest-local quality, latency, cost, and load scoring.
latency_awarealgorithmRequest-local runtime metric scoring, unless it starts owning cross-request health state.
router_dcalgorithmQuery/model contrastive matching is a request-time selector. Learned affinity state, if added later, should be learning evidence.
automixmostly algorithmEscalation and verification are request-time selection behavior. Durable verifier success priors should move to Router Memory.
knn, kmeans, svm, mlpalgorithmTrained model artifacts are offline selector assets, not online router memory.
session_awarerouter.learning.adaptations.session_awareIt is cross-request continuity and stay/switch behavior.
model_switch_gaterouter.learning.adaptations.session_awareIt is also stay/switch behavior and overlaps with session-aware routing.
lookup_tablesrouter.learning.memory.priorsThey are replay-derived memory views, not selector configuration.
elosplit; ratings in Router Memory, scoring policy optionalElo ratings are cross-request learned state. A future Elo scorer can read Router Memory.
rl_driven / Thompsonsplit; posterior in Router Learning memory, scoring policy optionalBandit posterior and feedback updates are cross-request learning.
gmtroutersplit; interaction graph in Router Learning memory, scoring policy optionalUser/model/task interaction history is cross-request personalization memory.

This means router_dc can remain an algorithm, as long as it is doing request-time semantic matching. If it later learns query/model affinity from traffic, that learned affinity becomes Router Memory that router_dc may read.

The final public algorithm surface should keep request-time selectors and remove learning-owned selectors:

keep as algorithm:
static, hybrid, multi_factor, latency_aware, router_dc, automix,
knn, kmeans, svm, mlp

move out of algorithm:
session_aware, elo, rl_driven, gmtrouter

hybrid can still combine memory-backed evidence, such as Elo ratings or bandit priors, but those facts should be read from Router Memory. The hybrid algorithm should not own the rating store, posterior, interaction graph, or feedback update loop.

Split Pattern for Learning Adaptations

Learning-capable selectors should be decomposed into:

durable state/update loop -> router.learning.memory
read-time learning policy -> router.learning.adaptations.<name>
base request-time selector -> decision.algorithm

For example, Thompson Sampling should not keep posterior state inside algorithm.rl_driven. The posterior belongs to Router Learning memory:

global:
router:
learning:
adaptations:
bandit:
enabled: true
method: thompson
reward: quality_cost_latency
memory_scope: decision

Most decisions can omit algorithm. The bandit learning adaptation then adjusts the model proposed by the router's normal default selector. A decision can still use a normal base selector when it has a specific selection policy, but the adaptation is not a reason to force every decision to spell out a base algorithm.

Elo follows the same model:

global:
router:
learning:
adaptations:
elo:
enabled: true
scope: decision
initial_rating: 1200
k_factor: 32

The ratings are router memory. A read-time Elo scoring policy can consume those ratings, but config should not imply that a decision-local algorithm owns the rating store.

Router Memory Product Model

Router Memory should be the product-level umbrella for replay, live state, and derived priors. Users should not need to reason about three independent systems.

Router Memory
event_log
replay records
immutable or append-oriented history for audit, debug, eval, and replay

online_state
session state, conversation state, provider health, bandit posterior,
Elo ratings, personalization graph

priors
materialized views derived from event_log and optional overrides:
quality_gap, handoff_penalty, remaining_turn_prior

The product relationship is:

requests and responses
-> Router Memory event_log
-> aggregation / outcome scoring
-> Router Memory online_state and priors
-> Router Learning consumers

Replay is the event-log layer. Session and conversation memory are online-state views. The old lookup tables become materialized priors. They are all Router Memory.

Latency Model

Router Learning must not turn every routed request into an external storage round trip. The memory layers have different latency contracts:

LayerHot Request PathStorage PathExamples
online_stateRead and update locally during routing. This is latency-sensitive.Optional shared backend or replay-derived rebuild, but not required for every request.current model, tool-loop state, turn count, cache warmth, switch history
priorsRead from an in-process or locally cached snapshot.Recomputed from replay by background jobs or offline commands.quality gaps, handoff penalties, remaining-turn priors
event_logAppend after the routing decision, preferably async or fail-open. Reads are not required for the current decision.Durable Router Replay backend.replay records, diagnostics, eval traces

The hot path should therefore be:

request
-> read online_state from local memory
-> read priors from local snapshot
-> compute adaptations
-> update online_state locally
-> append replay event asynchronously or fail-open

External storage calls are appropriate for:

  • replay writes and replay APIs
  • background prior materialization
  • cold-start seeding of online state or prior snapshots
  • optional shared online-state synchronization when an operator chooses that deployment mode

External storage calls should not be required to decide every request. If a shared online-state backend is introduced, it needs a small timeout, fail-open behavior, and local fallback so storage latency or transient outages do not become routing latency spikes.

Memory API

Router Memory is the product model, but the public API should reuse existing storage seams instead of introducing a second replay backend.

The existing Router Replay implementation already provides the event-log building blocks:

  • global.services.router_replay.enabled is the router-wide replay gate.
  • store_backend supports postgres, redis, milvus, qdrant, and memory.
  • Shared backends use one shared recorder/store; memory is local development storage and is lost on restart.
  • The route-local router_replay plugin can disable capture or tune capture policy for one decision.
  • Replay APIs already expose list, aggregate, record lookup, and trajectory views.

Router Learning should therefore attach learning diagnostics to Router Replay records instead of inventing a parallel event-log service. Replay remains the append-oriented history used for audit, eval, and prior materialization; it is not the low-latency online state that decides the next request.

The event-log layer is Router Replay:

global:
services:
router_replay:
enabled: true
store_backend: postgres
ttl_seconds: 2592000

router:
learning:
enabled: true
memory:
priors:
source: router_replay
populate_interval: 15m
overrides:
quality_gaps: []
handoff_penalties: []
remaining_turn_priors: []

The default recipe can keep Router Learning itself simple:

global:
services:
router_replay:
enabled: true
store_backend: postgres

router:
learning:
enabled: true

global.services.router_replay already owns the replay storage backend. It supports postgres, redis, milvus, qdrant, and memory. Production recipes should use a shared durable backend such as postgres or redis; memory remains a local development backend and loses records on restart.

Router Learning should attach structured learning diagnostics to replay records when Router Replay captures the request. It should not expose a second router.learning.memory.replay.enabled flag, because that would duplicate the existing router-wide service and route-local plugin.

The existing router_replay plugin remains the per-decision capture policy. Learning does not need raw payload fields, but Router Learning should not add a second privacy policy on top of Router Replay. If operators enable payload capture, payloads may be persisted according to the existing Router Replay configuration. Learning evals should ignore those payload fields and use structured routing metadata instead.

Privacy-sensitive recipes can still use the existing plugin flags when they want metadata-only replay records:

plugins:
- type: router_replay
configuration:
enabled: true
capture_request_body: false
capture_response_body: false
max_tool_trace_steps: 100

Learning diagnostics, selected model, decision, signal names, token counts, cache counts, and cost estimates are enough for the first eval path.

online_state is intentionally not configurable. It is required runtime state for enabled adaptations, so exposing online_state.enabled would let users create invalid or confusing configs.

priors is optional and advanced. It exists for replay-derived materialized views and manual overrides. source: router_replay means the priors are built from the same event log exposed by /v1/router_replay. If omitted, adaptations still work from live online state and built-in defaults.

When memory is configured, learning adaptations can consume the available layers by default. For example, session_aware reads conversation state, session state, cache accounting, and materialized priors when those views exist.

Memory Views

Initial Router Memory should expose at least these views:

ViewLayerSourceConsumers
replay_recordevent_logrequest/response recordsaudit, eval, replay, priors
session_stateonline_staterequest/response telemetrysession_aware
conversation_stateonline_staterequest/response telemetrysession_aware
provider_healthonline_statetransport outcomesprovider_health, multi_factor, hybrid
elo_ratingonline_statefeedback/outcome updateselo, hybrid
bandit_posterioronline_statefeedback/outcome updatesbandit, hybrid
interaction_graphonline_statefeedback, user/session historypersonalization, future GMTRouter policy
quality_gappriorsreplay aggregation, outcomes, overridessession_aware, hybrid, bandit
handoff_penaltypriorsreplay aggregation, overridessession_aware
remaining_turn_priorpriorsreplay aggregation, overridessession_aware

The important part is that these are shared memory views. They should not be private fields inside individual algorithms.

Lookup Table Migration

The current lookup_tables block is useful conceptually but sits in the wrong place:

global:
router:
model_selection:
lookup_tables:
enabled: true

It is not a model-selection algorithm. It is the old name for materialized Router Memory priors.

Old FieldNew Field
model_selection.lookup_tables.enabledrouter.learning.memory.priors.enabled
model_selection.lookup_tables.storage_pathImplementation detail removed from public config.
model_selection.lookup_tables.auto_save_intervalImplementation detail removed from public config.
model_selection.lookup_tables.populate_from_replayrouter.learning.memory.priors.source: router_replay
model_selection.lookup_tables.populate_intervalrouter.learning.memory.priors.populate_interval
model_selection.lookup_tables.quality_gapsrouter.learning.memory.priors.overrides.quality_gaps
model_selection.lookup_tables.handoff_penaltiesrouter.learning.memory.priors.overrides.handoff_penalties
model_selection.lookup_tables.remaining_turn_priorsrouter.learning.memory.priors.overrides.remaining_turn_priors

Manual overrides remain useful, but they become overrides for Router Memory priors, not selector-local tables.

Online State vs Priors

Online state and priors are different layers of Router Memory.

Memory LayerScopePurpose
online_statelive session, conversation, provider, user, or decisionProtect current continuity, cache, tool loops, health, preference, and switch history.
priorsaggregate traffic historyProvide learned quality gaps, handoff penalties, and expected remaining turns.

Session-aware learning should read both:

conversation_state says: active tool loop, keep current model
session_state says: cache is warm, switching has been frequent
priors say: switching from local to frontier has handoff penalty 0.05

This lets the runtime make an auditable stay/switch decision without putting memory ownership inside algorithm.type: session_aware.

Example Recipe Shape

Global Learning

global:
services:
router_replay:
enabled: true
store_backend: postgres

router:
learning:
enabled: true
adaptations:
session_aware:
enabled: true
scope: conversation
identity:
headers:
session: x-session-id
conversation: x-conversation-id

Simple Decision

- name: simple_general
description: Simple public requests route to the low-cost model by default.
rules: ...
modelRefs:
- model: simple-model
- model: frontier-model

This decision inherits mode: apply and uses the router's normal default selector before adaptation.

Complex Decision

- name: complex_code
description: Complex coding requests route to stronger models.
rules: ...
modelRefs:
- model: frontier-model
- model: local-rocm-model

This decision also inherits mode: apply and does not need an explicit algorithm block.

If the current conversation is in a tool loop, the learning layer can keep the current model even if the new request matches simple_general.

Privacy Decision

- name: privacy_sensitive
description: Sensitive content stays on the local model.
rules: ...
modelRefs:
- model: local-rocm-model
algorithm:
type: static
adaptations:
session_aware:
mode: bypass

This decision cannot be overridden by session-aware stability. If privacy matches, the final model remains local.

Runtime Semantics

Conversation Scope

With:

scope: conversation

the router maintains two related memories:

conversation key = session_id + conversation_id
session key = session_id

Strong protections use the conversation key:

  • active tool loop
  • non-portable provider state
  • current conversation model
  • short-turn continuity

Soft trade-offs can use the session key:

  • previous session model
  • cross-conversation cache evidence
  • switch history
  • handoff cost

This means a new x-conversation-id releases the previous conversation's hard locks, while still letting session-level evidence influence the next selection.

Session Scope

With:

scope: session

the session key becomes the strong protection boundary. A model established in the session is preferred across conversations. The session model is the first selected model after the session state is created or after the session has expired.

The router can reselect the session model when one of these happens:

  • the session is idle past idle_timeout_seconds
  • the current decision uses mode: bypass
  • an explicit future reset mechanism clears the session state

This mode is for users who want a stable model throughout an IDE session, workspace session, or long agent session.

Policy Boundaries

mode: bypass always wins over session-aware learning.

For example, if the current session model is a cloud frontier model and a later request matches privacy containment, the privacy decision routes to the local model and session-aware learning cannot keep the cloud model.

Diagnostics and Event Log

Response headers should stay compact. Detailed evidence belongs in Router Memory's event-log layer. The current replay record is the concrete event-log implementation.

Recommended response headers:

x-vsr-learning-methods: session_aware
x-vsr-learning-actions: session_aware=stay
x-vsr-learning-scopes: session_aware=conversation
x-vsr-learning-reasons: session_aware=active_tool_loop
x-vsr-learning-modes: session_aware=apply
x-vsr-replay-id: replay_...

The x-vsr-learning-* header family is method-keyed so future adaptations can share the same contract without adaptation-specific header names. Detailed scoring evidence, cache math, identity source, multi-adaptation traces, and alternative-model candidates should remain in the replay record pointed to by x-vsr-replay-id.

The event log should include structured diagnostics:

{
"learning": {
"adaptations": {
"session_aware": {
"enabled": true,
"mode": "apply",
"scope": "conversation",
"identity": {
"scope": "conversation",
"headers": {
"session": "x-session-id",
"conversation": "x-conversation-id"
},
"session": {
"source": "header:x-session-id",
"required": true,
"status": "present",
"hash": "4f2a8c0e9b7d3411"
},
"conversation": {
"source": "header:x-conversation-id",
"required": true,
"status": "present",
"hash": "0bb97f4a3c812efe"
}
},
"base_model": "simple-model",
"final_model": "frontier-model",
"action": "stay",
"reason": "active_tool_loop",
"cache": {
"prompt_tokens": 12000,
"cached_tokens": 8200,
"cache_weight": 0.2
},
"cost": {
"handoff_penalty": 0.05,
"handoff_penalty_weight": 1.0
}
}
}
}
}

Raw session and conversation identifiers are operationally sensitive. Learning diagnostics should store source and status plus a bounded hash, not the raw identity values.

Breaking Change and Migration

This proposal intentionally uses a clean breaking change. The old session-aware configuration should not remain as a compatibility alias in the final API. Keeping both spellings would make it unclear whether session-aware behavior is a selector, a post-selection gate, or a Router Learning adaptation.

Removed API

Old recipes may currently use:

algorithm:
type: session_aware
session_aware:
base_method: hybrid

The final API should reject this shape.

These old blocks should also be rejected:

global:
router:
model_selection:
session_aware: ...
model_switch_gate: ...
lookup_tables: ...
elo: ...

model_selection.session_aware and model_switch_gate are older ways to express stay-vs-switch behavior. In the final API, that behavior belongs under global.router.learning.adaptations.session_aware.

model_selection.lookup_tables is a replay-derived memory view. It belongs under global.router.learning.memory.priors.

Learning-owned algorithm types should also be rejected as final public adaptations:

algorithm:
type: elo
algorithm:
type: rl_driven
algorithm:
type: gmtrouter

Their durable state and feedback loops should move to router.learning.adaptations.elo, router.learning.adaptations.bandit, and router.learning.adaptations.personalization respectively.

Replacement API

The target shape is a normal decision selector plus global learning. Most decisions can omit algorithm and use the router's normal default selector:

- name: simple_general
rules: ...
modelRefs:
- model: simple-model
- model: frontier-model

A decision can keep an explicit base algorithm only when it needs a specific selector. The important part is that session_aware is no longer an algorithm wrapper.

Session-aware behavior is configured globally:

global:
services:
router_replay:
enabled: true
store_backend: postgres

router:
learning:
enabled: true
adaptations:
session_aware:
enabled: true
scope: conversation

If an old session_aware block had base_method, that value can become the decision's explicit base algorithm.type only when the recipe wants to preserve that explicit selector. If the decision can rely on the default selector, it can omit algorithm entirely.

Migration Map

Old FieldNew Field
global.router.model_selection.session_awareglobal.router.learning.adaptations.session_aware
algorithm.type: session_awareremove; use a normal base algorithm only if needed
algorithm.session_aware.base_methodoptional explicit base algorithm.type; otherwise omit algorithm
algorithm.session_aware.idle_timeout_secondsrouter.learning.adaptations.session_aware.tuning.idle_timeout_seconds
algorithm.session_aware.min_turns_before_switchrouter.learning.adaptations.session_aware.tuning.min_turns_before_switch
algorithm.session_aware.switch_marginrouter.learning.adaptations.session_aware.tuning.switch_margin
algorithm.session_aware.prefix_cache_weightrouter.learning.adaptations.session_aware.tuning.cache_weight
algorithm.session_aware.default_handoff_penaltyrouter.learning.adaptations.session_aware.tuning.handoff_penalty
algorithm.session_aware.handoff_penalty_weightrouter.learning.adaptations.session_aware.tuning.handoff_penalty_weight
algorithm.session_aware.switch_history_weightrouter.learning.adaptations.session_aware.tuning.switch_history_weight
algorithm.session_aware.max_cache_cost_multiplierrouter.learning.adaptations.session_aware.tuning.max_cache_cost_multiplier
model_switch_gate.min_switch_advantagerouter.learning.adaptations.session_aware.tuning.switch_margin
model_switch_gate.cache_warmth_weightrouter.learning.adaptations.session_aware.tuning.cache_weight
model_switch_gate.default_handoff_penaltyrouter.learning.adaptations.session_aware.tuning.handoff_penalty
model_switch_gate.mode: shadowdecision.adaptations.session_aware.mode: observe for rollout decisions
model_switch_gate.mode: enforcedecision.adaptations.session_aware.mode: apply
model_selection.lookup_tablesrouter.learning.memory.priors
algorithm.type: elorouter.learning.adaptations.elo; add a base algorithm only if the decision needs one
model_selection.elorouter.learning.adaptations.elo
algorithm.type: rl_drivenrouter.learning.adaptations.bandit; add a base algorithm only if the decision needs one
algorithm.rl_driven.use_thompson_samplingrouter.learning.adaptations.bandit.method: thompson
algorithm.type: gmtrouterrouter.learning.adaptations.personalization; add a base algorithm only if the decision needs one
algorithm.gmtrouter.storage_pathrouter.learning.memory backend configuration

These old fields should not have public replacements in the clean API:

Old FieldHandling
stay_biasFold into switch_margin and internal defaults.
tool_loop_stay_biasInternal behavior; tool-loop protection should not require score tuning.
tool_loop_hard_lockDefault enabled in session-aware learning.
context_portability_hard_lockDefault enabled in session-aware learning.
decision_drift_resetReplaced by explicit conversation identity.
quality_gap_multiplierInternal selector-normalization detail.
remaining_turn_prior_weightInternal or future advanced setting.
remaining_turn_prior_horizonInternal or future advanced setting.
min_remaining_turn_prior_samplesInternal or future advanced setting.

Validation Behavior

The router should fail fast with actionable validation errors when it sees old configuration:

algorithm.type=session_aware has moved to global.router.learning.adaptations.session_aware.
Remove algorithm.type=session_aware. If the recipe needs an explicit base
selector, set algorithm.type to the old session_aware.base_method; otherwise
omit algorithm and enable global router.learning.adaptations.session_aware.
global.router.model_selection.session_aware has moved to
global.router.learning.adaptations.session_aware.
global.router.model_selection.model_switch_gate has been folded into
global.router.learning.adaptations.session_aware.tuning.
algorithm.type=elo has moved to router learning. Enable
global.router.learning.adaptations.elo and choose a request-time base algorithm.
algorithm.type=rl_driven has moved to router learning. Enable
global.router.learning.adaptations.bandit and choose a request-time base algorithm.
algorithm.type=gmtrouter has moved to router learning. Enable
global.router.learning.adaptations.personalization and choose a request-time base
algorithm.
global.router.model_selection.lookup_tables has moved to
global.router.learning.memory.priors.

Runtime config loading should not silently rewrite old config, and the first implementation should not provide an automatic config rewrite command. Migration is a deliberate manual edit from the old API shape to the new one. Silent or automatic rewrites would preserve ambiguity in the API.

Implementation Plan

1. Config Contract

  • Add global.router.learning.
  • Add global.router.learning.adaptations.session_aware.
  • Add decision-level adaptations.session_aware.
  • Remove global.router.model_selection.session_aware from the final public config contract.
  • Remove algorithm.type: session_aware from the final public config contract.
  • Fold model_switch_gate into session-aware learning and reject the old standalone block.
  • Reject lookup_tables from global.router.model_selection; replay-derived priors move to the Router Learning roadmap.
  • Reuse global.services.router_replay as the Router Memory event-log implementation.
  • Add global.router.learning.memory.priors only for replay-derived materialized priors and manual overrides in the roadmap; the first implementation can omit prior materialization.
  • Implement required online state internally for enabled adaptations.
  • Validate scope values: conversation, session.
  • Validate mode values: apply, bypass, observe.
  • Emit hard validation errors for old memory, learning, and session-aware config shapes with explicit migration messages.
  • Add defaults:
    • router.learning.enabled: false unless a recipe enables it
    • enabled learning adaptations create required online state automatically
    • Router Learning diagnostics attach to Router Replay records when global.services.router_replay.enabled: true
    • Router Replay storage defaults and backends remain owned by global.services.router_replay
    • scope: conversation
    • identity.headers.session: x-session-id
    • identity.headers.conversation: x-conversation-id
    • decision mode defaults to apply

The final architecture still classifies elo, rl_driven, and gmtrouter as learning-owned capabilities. Moving them requires new feedback, posterior, and personalization adaptation APIs, so that migration is future Router Learning work rather than part of this session-aware implementation.

2. Request Identity

  • Parse identity values from identity.headers.
  • For session_aware, require the session key and use the conversation key when present.
  • Treat configured identity header names as the source of truth.
  • If identity is missing, use only explicit fallback behavior and mark the identity source as missing in diagnostics when diagnostics are captured.
  • If the required identity is missing, session-aware learning should no-op and allow the base routing decision to proceed.
  • Store identity source in diagnostics.
  • Treat conversation identity as distinct from session identity throughout request context, selection context, learning trace, and event-log records.

3. Learning Runtime

  • Run decision matching as today.
  • Run the decision's base algorithm to produce a proposed model.
  • Apply enabled Router Learning adaptations after base selection.
  • For session-aware learning:
    • read Router Memory online_state
    • read priors when a local snapshot exists
    • read conversation state and session state from online_state
    • evaluate the decision's adaptation mode
    • compute keep/switch/observe
    • produce a final model and structured trace
  • Allow protected carry-over of the previous model across decision modelRef boundaries only when mode is apply.
  • Never override a decision with mode: bypass.

4. Router Memory

  • Keep the request hot path free of required external storage reads.
  • Read online_state from in-process state or a local fast cache.
  • Read priors from an in-process snapshot refreshed outside the request path.
  • Treat shared online-state backends as optional synchronization layers with timeout and fail-open behavior, not mandatory per-request dependencies.
  • Use existing Router Replay records as the durable event stream for audit, debugging, eval, and prior generation.
  • Extend replay records with generic learning diagnostics instead of adding a parallel learning event store.
  • Reuse existing Router Replay storage backends and API surfaces:
    • store_backend: postgres|redis|milvus|qdrant|memory
    • /v1/router_replay
    • /v1/router_replay/aggregate
    • /v1/router_replay/trajectory
    • route-local router_replay plugin capture policy
  • Do not make learning depend on replay payload fields. Request bodies, response bodies, prompts, tool schemas, tool arguments, and tool outputs are controlled by the existing Router Replay capture policy.
  • Keep replay writes off the critical path where possible. Async writes should have bounded queues, backpressure metrics, and safe drop behavior for diagnostics when the replay backend is unhealthy.
  • Implement first-version online_state as single-replica in-process state:
    • conversation-scoped state keyed by session ID and conversation ID
    • session-scoped state keyed by session ID
  • Record session-aware online state:
    • previous model
    • turn count
    • active tool-loop state
    • non-portable context state
    • switch history
    • cache accounting
    • last decision and last learning reason
  • Leave priors as roadmap materialized views derived from event_log:
    • quality gaps
    • handoff penalties
    • remaining-turn priors
  • Future implementations can refresh prior snapshots through a background materializer or offline command.
  • Expire online state by idle timeout or configured TTL.

5. Event Log and Headers

  • Add structured learning diagnostics to Router Replay records.
  • Add compact generic response headers for bounded adaptation summaries: x-vsr-learning-methods, x-vsr-learning-actions, x-vsr-learning-scopes, x-vsr-learning-reasons, and x-vsr-learning-modes.
  • Keep x-vsr-replay-id as the stable pointer to full diagnostics.
  • Include cache details needed to debug cache protection:
    • prompt tokens
    • cached tokens
    • cache source
    • estimated cache gap where exact backend accounting is unavailable

6. Recipe Migration

  • Remove the synthetic agentic_session_route.
  • Configure global.services.router_replay when eval/debug event logs are required.
  • Configure global router.learning.adaptations.session_aware.
  • Keep simple, complex, privacy, and domain decisions as semantic decisions.
  • Mark privacy, security, and local-only policy decisions with mode: bypass.
  • Leave normal simple, complex, and domain decisions on the default mode: apply.

7. Evaluation

  • Update the agentic routing eval profile to send configured identity headers for session and conversation.
  • Capture x-vsr-learning-* and x-vsr-replay-id from router responses.
  • Validate semantic routing:
    • simple requests route to the simple model
    • complex requests route to the stronger model
    • privacy-sensitive requests route to the local model
    • domain requests route to the domain model
  • Validate conversation scope:
    • same session and same conversation protects tool-loop continuity
    • same session and new conversation releases tool-loop hard locks
    • same session and new conversation still accounts for cache and handoff cost
  • Validate session scope:
    • same session strongly preserves the established model across conversations
    • bypass decisions still override session protection
  • Validate Router Memory event log:
    • every learning adaptation has action, scope, reason, base model, and final model
    • cache accounting appears when available
    • observe mode records the hypothetical result without changing the final model
  • Report cache and cost evidence:
    • cached tokens
    • cache gap
    • estimated cost delta from auto routing and cache preservation
  • Report learning latency overhead at p50 and p95.

8. Documentation

  • Add a dedicated tutorial section under docs/tutorials/learning/.
  • Use the section as the user-facing home for Router Learning, Router Memory, and learning adaptations.
  • Start with these pages:
    • docs/tutorials/learning/overview.md
    • docs/tutorials/learning/session-aware.md
    • docs/tutorials/learning/memory-and-replay.md
    • docs/tutorials/learning/decision-adaptations.md
    • docs/tutorials/learning/priors.md
  • Keep future pages in the same section as new adaptations graduate:
    • docs/tutorials/learning/bandit.md
    • docs/tutorials/learning/personalization.md
    • docs/tutorials/learning/provider-health.md
  • Cover these concepts in the tutorials:
    • why Router Learning is separate from decision.algorithm
    • how router.learning.memory works at a product level
    • how router.learning.adaptations.<name> registers global behavior
    • how decision.adaptations.<name>.mode controls hard boundaries
    • session vs conversation identity and headers
    • apply, bypass, and observe
    • replay records, diagnostics, and cache evidence
    • how to evaluate a learning adaptation before enabling it broadly
  • Cross-link the tutorials from:
    • the AMD agentic routing recipe
    • the router config reference
    • any agentic routing blog post
    • future adaptation-specific proposal or reference pages

Future Learning Extensions

router.learning.adaptations is intentionally broader than session-aware routing. Future router-memory adaptations should use the same Router Memory plus global learning plus decision override model.

Example:

global:
services:
router_replay:
enabled: true
store_backend: postgres

router:
learning:
enabled: true
memory:
priors:
source: router_replay
populate_interval: 15m

adaptations:
session_aware:
enabled: true
scope: conversation

bandit:
enabled: false
exploration_rate: 0.05

personalization:
enabled: false

provider_health:
enabled: false

Decision-level controls can remain consistent:

adaptations:
session_aware:
mode: apply
bandit:
mode: observe
personalization:
mode: bypass

The common contract is:

FieldMeaning
enabledWhether the learning adaptation exists globally.
modeWhether a decision allows that learning adaptation to affect the final result.
observeSafe rollout mode that records evidence without changing routing.
diagnosticsEvery learning adaptation must explain base result, final result, action, and reason.

Potential future learning adaptations:

Learning AdaptationPurpose
banditOnline exploration and exploitation from feedback or judged outcomes.
personalizationUser or workspace preference learning.
provider_healthAvoid unhealthy model endpoints or providers.
cost_optimizerLearn cost/performance trade-offs from actual traffic.
quality_feedbackAdjust routing from explicit feedback or automatic judges.

These should not require new top-level decision fields. They should plug into global.router.learning.adaptations.<name> and decision.adaptations.<name>.

Confirmed Decisions

  • The public namespace is global.router.learning.memory and global.router.learning.adaptations.<name>.
  • Session-aware identity lives under global.router.learning.adaptations.session_aware.identity.
  • Session-aware identity uses a generic map: identity.headers.session and identity.headers.conversation, rather than one hard-coded config field per header.
  • Decision-level control uses top-level decision.adaptations.<name> and mirrors only the adaptation names registered under global.router.learning.adaptations.
  • The default decision mode is apply.
  • Hard policy boundaries explicitly use mode: bypass.
  • Unknown decision.adaptations.<name> entries should fail validation unless the global adaptation exists.
  • scope: conversation protects the active conversation and uses session memory only for soft trade-offs between conversations.
  • scope: session preserves the first selected model for the session until the session idles out, a bypass decision wins, or an explicit future reset clears the session state.
  • In mode: apply, a protected carry-over model may cross the current decision's modelRefs boundary. This is required for stable multi-decision agent conversations.
  • The migration is a one-time breaking change. Old API shapes should fail validation with actionable errors rather than silently rewrite.
  • Router Memory event-log records should reuse the existing Router Replay service, storage backends, route-local plugin, and replay API.
  • Response diagnostics should use a small generic x-vsr-learning-* header family plus the existing x-vsr-replay-id pointer. Adaptation-specific header names should not become the stable API.
  • The first implementation targets a single router replica.
  • Missing identity should no-op session-aware learning without failing the request.
  • Router Learning should not depend on raw replay payload fields. Payload persistence is governed by existing Router Replay capture configuration.
  • New replay diagnostics use learning.adaptations.<name> only; the new implementation should not compatibility-fill older session-aware-specific replay fields.
  • Replay-derived priors are roadmap work. The first implementation focuses on moving session-aware routing into Router Learning.

Closed Decisions For First Implementation

1. Hot Path

Request-time routing must not require synchronous external storage reads. Session-aware learning reads and writes in-process online state. Replay writes are off the critical path and should fail open.

2. Deployment Scope

The first implementation targets a single router replica. Multi-replica online state, shared Redis state, sticky routing requirements, and cross-replica consistency are roadmap items.

3. Missing Identity

Missing session or conversation identity should not fail the request. If the identity required for a session-aware protection is absent, the adaptation no-ops and the base routing decision proceeds. Diagnostics may record the missing identity when replay/debug evidence is captured.

4. Replay Payloads

Router Learning should not add a second payload persistence policy. Payload storage is governed by the existing Router Replay capture configuration. If an operator enables payload capture, payloads may be stored; Router Learning and its evals simply do not use those payload fields.

5. Replay Schema

Use the new generic diagnostics shape: learning.adaptations.<name>. Do not compatibility-fill older session-aware-specific replay fields for the new implementation.

6. Priors Roadmap

Replay-derived priors remain in the design and roadmap, but they are not part of the first implementation. The first implementation focuses on moving session-aware routing into Router Learning.

7. Future Adaptation Ordering

The first implementation has only session_aware, so it does not need public ordering controls. The design should leave room for future weighting, priorities, or phases when multiple adaptations can modify the same decision.

8. Eval Contract

The agentic routing eval should cover semantic routing, session-aware stability, new-conversation release, bypass behavior, cache evidence, learning latency overhead, and replay explainability.

Migration Tooling Decision

Do not provide an automatic config rewrite command in the first implementation. This is a clean new API, not a compatibility layer. Old config shapes should fail fast with actionable validation errors and documentation should show the new shape.

Success Criteria

  • A recipe enables session-aware behavior once globally.
  • Normal decisions do not need per-decision configuration.
  • Hard policy decisions can bypass learning with one small block.
  • Conversation scope and session scope are both expressible.
  • A new conversation in the same session does not inherit tool-loop hard locks from the previous conversation.
  • Session scope can express "keep the session's established model."
  • Event-log records and headers explain every keep, switch, bypass, and observe result.
  • Request-time routing does not require synchronous external storage reads.
  • Replay backend slowness or outage does not fail the routing request.
  • Missing session or conversation identity does not fail the request.
  • The first implementation is valid for a single router replica.
  • Future learning adaptations can be added without adding new decision-level concepts.