跳到主要内容
Documentation

Multi Factor

Overview

版本:最新版

Multi Factor

Overview

multi_factor is a selection algorithm that composes four raw runtime signals — quality, latency, cost, and load — into a single weighted score per candidate, with optional SLO hard ceilings that prune candidates before scoring.

It aligns to config/algorithm/selection/multi-factor.yaml and addresses issue #37.

Key Advantages

  • Single-decision SLO-aware routing without orchestrating multiple selectors.
  • Each signal is a live source: quality from quality_score config, latency from pkg/latency percentiles, cost from pricing, load from pkg/inflight.
  • Min-max normalization across the candidate set means weights have intuitive meaning regardless of absolute signal scales.
  • No model state to train. No external service required.
  • Hard SLO ceilings (TPOT, TTFT, cost, in-flight) prune unsafe candidates before scoring.

What Problem Does It Solve?

Real routes care about more than one dimension at once: a faster cheaper model and a slower better model both exist in the same candidate pool, and the "right" answer depends on current load and SLO targets, not just the static config. Existing single-signal selectors (latency_aware, cost-only routing, quality-only routing) force a hard choice. multi_factor lets one decision rule express a smooth tradeoff across all four dimensions, with optional hard SLO ceilings to fence off unsafe candidates.

When to Use

  • A decision has 2+ candidate models that differ along multiple dimensions (e.g. a faster cheaper model and a slower better model) and you want a smooth tradeoff knob.
  • You want SLO enforcement (e.g. "never route to a model with p95 TPOT > 200ms") without writing a separate decision rule.
  • Quality, latency, cost, and load all matter and no single one dominates.

Sibling Algorithms

  • latency_aware is a special case of this — latency-only scoring. Use it when the other dimensions truly do not matter.
  • hybrid composes other selectors (Elo + RouterDC + AutoMix) into one. multi_factor composes raw signals directly. Both are useful and complementary.

Algorithm Principle

For each candidate model mm in the candidate set, after SLO filtering:

score(m)=wQQ^(m)+wL(1T^(m))+wC(1C^(m))+wload(1N^(m))\text{score}(m) = w_Q \cdot \hat{Q}(m) + w_L \cdot (1 - \hat{T}(m)) + w_C \cdot (1 - \hat{C}(m)) + w_{\text{load}} \cdot (1 - \hat{N}(m))

Where:

  • Q^(m)\hat{Q}(m), T^(m)\hat{T}(m), C^(m)\hat{C}(m), N^(m)\hat{N}(m) are quality / latency / cost / load values min-max normalized to [0, 1] across the surviving candidate set.
  • Latency, cost, and load are inverted (1 - ...) because lower-is-better.
  • Quality is direct because higher-is-better.
  • Weights are normalized to sum to 1 (negative weights clamped to zero). Equal weights are the recoverable default.

SLO Filtering

Before scoring, any candidate that exceeds a non-zero ceiling is removed:

  • max_tpot_ms — p95 (or configured) TPOT observed via pkg/latency
  • max_ttft_ms — p95 (or configured) TTFT observed via pkg/latency
  • max_cost_per_1m — configured prompt pricing
  • max_inflight — current in-flight request count from pkg/inflight

If all candidates are filtered out, behavior is controlled by on_no_candidates:

ValueBehavior
cheapest (default)Return the candidate with the lowest configured prompt_per_1m
firstReturn the first candidate as listed
failReturn an error to the caller

Configuration

algorithm:
type: multi_factor
multi_factor:
weights:
quality: 0.4
latency: 0.2
cost: 0.2
load: 0.2
slo:
max_tpot_ms: 200 # optional, omit for no ceiling
max_ttft_ms: 800 # optional
max_cost_per_1m: 5.0 # optional, USD per 1M prompt tokens
max_inflight: 50 # optional
latency_percentile: 95 # which percentile to read (default 95)
on_no_candidates: cheapest

Parameters

ParameterTypeDefaultDescription
weights.qualityfloat0.25Weight for quality_score configured per model
weights.latencyfloat0.25Weight for percentile latency (lower-is-better, inverted)
weights.costfloat0.25Weight for prompt pricing (lower-is-better, inverted)
weights.loadfloat0.25Weight for in-flight request count (lower-is-better, inverted)
slo.max_tpot_msfloat0 (off)Hard ceiling for p95 TPOT in milliseconds
slo.max_ttft_msfloat0 (off)Hard ceiling for p95 TTFT in milliseconds
slo.max_cost_per_1mfloat0 (off)Hard ceiling for prompt cost per 1M tokens
slo.max_inflightint0 (off)Hard ceiling for concurrent in-flight requests
latency_percentileint95Percentile read from pkg/latency (1-100)
on_no_candidatesstringcheapestFallback policy when SLO filters everything: cheapest, first, fail

Known Limitations

  • Quality scoring depends on quality_score being configured per model. Models without it contribute zero to the quality signal.
  • Min-max normalization is per-request across the candidate set, so absolute scale of any signal does not matter — but if all candidates have the same value on a dimension, that dimension contributes 0.5 (neutral).
  • Load uses an in-process tracker (pkg/inflight), so in multi-replica deployments each replica sees only its own load, not cluster-wide. Acceptable for the typical sidecar deployment; an external state store could be wired later for true cluster-wide load awareness.
  • The in-flight tracker self-heals via TTL eviction (default 10 minutes) to recover from missed End calls, but cannot detect actively-running long requests beyond that window — they will appear "free" to the selector. Tune via pkg/inflight.SetMaxAge if your workloads routinely exceed 10 minutes per request.