跳到主要内容
Blog

Journal

Release notes, field reports, and research commentary from the vLLM Semantic Router project.

Giving AgentGateway a Semantic Brain with vLLM Semantic Router

· 阅读需 10 分钟
Aayush Saini
SDE, Data and AI @ Red Hat
Anup Sharma
AI & Distributed System @ Nutanix

vLLM Agent Architecture Workflow: Custom Semantic Routing with AgentGateway and Semantic Router

Agent systems that span multiple models — a local endpoint for coding, a frontier cloud model for deep reasoning, and a fast general-purpose model for everyday tasks — all face the same routing question: how should each request be directed to the right backend?

Many deployments start with a lightweight Python proxy or keyword matcher in front of the gateway. That approach works at small scale, but misroutes grow quickly as traffic, languages, and task types diversify. This post shows how vLLM Semantic Router running as an Envoy ExtProc sidecar inside AgentGateway replaces that pattern with semantic, config-driven routing.

Agentic Routing on AMD ROCm

· 阅读需 14 分钟
Xunzhuo Liu
Intelligent Routing @vLLM
Haichen Zhang
Sr. AI Engineer @AMD
Andy Luo
Sr. Director @AMD

Most agent systems start with a simple idea: call model: auto and let the inference layer pick the right model. That is useful, but it is not enough for long-running agents.

A coding agent can begin with architecture work, call tools, receive short tool outputs, continue with "fix that", then ask a privacy-sensitive question in the same user session. The latest message may look simple, but the route cannot be chosen from the latest message alone. The router also has to know whether this is a safe moment to switch models.

This guide shows how to deploy that pattern on AMD ROCm with vLLM Semantic Router. You will start one ROCm vLLM backend, serve the agentic routing recipe, open the dashboard, validate the OpenAI-compatible API, and use Inferoa to experience route decisions and Router Learning behavior from an agent client.

Agent session routed through router memory to model paths
Agentic routing is not only choosing a model. It is choosing when to keep one.

Deploying vLLM Semantic Router on AMD Developer Cloud

· 阅读需 12 分钟
Xunzhuo Liu
Intelligent Routing @vLLM
Haichen Zhang
Sr. AI Engineer @AMD
Andy Luo
Sr. Director @AMD

AMD Developer Cloud and vLLM Semantic Router overview

Running vLLM Semantic Router on AMD Developer Cloud is not just about bringing up one more inference endpoint. It is about turning it into a routed multi-tier system that can classify requests, choose a semantic lane, and make replay and Insights immediately useful.

This post walks through the practical path: start the ROCm backend on an AMD Developer Cloud instance, install vLLM-SR, import the reference profile, and validate the deployment end to end.

v0.3 Themis Roadmap: Stability at Scale

· 阅读需 10 分钟
Xunzhuo Liu
Intelligent Routing @vLLM
Huamin Chen
Distinguished Engineer @ Red Hat

v0.3, codename Themis, is our production-readiness release for Semantic Router. The theme is simple: Stability at Scale. After Athena expanded the system brain, Themis is the release where we make that intelligence dependable across real environments, clearer to operate, and safer to ship into production.

This roadmap is not just about adding more capability. It is about making the full system coherent: one stable contract across Docker and Kubernetes, one cleaner deployment path, one real version story for images and packages, stronger performance validation on both NVIDIA and AMD, and a research track that directly improves the product instead of sitting outside it.

img

vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain

· 阅读需 1 分钟
Xunzhuo Liu
Intelligent Routing @vLLM
Huamin Chen
Distinguished Engineer @ Red Hat

athena-release

Athena is the first major hardening step after Iris. It refreshes the model stack, extends routing into safety and semantic control, and starts shaping the system brain needed to make Semantic Router easier to govern, operate, and scale in real deployments.

vLLM Semantic Router v0.1 Iris: The First Major Release

· 阅读需 1 分钟
Xunzhuo Liu
Intelligent Routing @vLLM

We are thrilled to announce the release of vLLM Semantic Router v0.1, codename Iris—our first major release that marks a transformative milestone for intelligent LLM routing. Since our experimental launch in September 2025, we've witnessed extraordinary community growth: over 600 Pull Requests merged, 300+ Issues addressed, and contributions from more than 50 outstanding engineers worldwide.

In Greek mythology, Iris (Ἶρις) served as the divine messenger who bridged the realms of gods and mortals, traveling on the arc of the rainbow to deliver messages across vast distances. This symbolism perfectly captures what vLLM Semantic Router v0.1 achieves: a bridge between users and diverse AI models, intelligently routing requests across different LLM providers and architectures.

Synced from official vLLM Blog: vLLM Semantic Router v0.1 Iris: The First Major Release

banner


AMD × vLLM Semantic Router: Building the System Intelligence Together

· 阅读需 1 分钟
Xunzhuo Liu
Intelligent Routing @vLLM

Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in how we think about AI system architecture.

AMD has been a long-term technology partner for the vLLM community, from accelerating the vLLM inference engine on AMD GPUs and ROCm™ Software to now co-building the next layer of the AI stack: intelligent routing and governance for Mixture-of-Models (MoM) systems.

Synced from official vLLM Blog: AMD × vLLM Semantic Router: Building the System Intelligence Together

banner


Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

· 阅读需 1 分钟
Xunzhuo Liu
Intelligent Routing @vLLM
Huamin Chen
Distinguished Engineer @ Red Hat

Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right in front of them.

Building on our Signal-Decision Architecture, we introduce HaluGate—a conditional, token-level hallucination detection pipeline that catches unsupported claims before they reach your users. No LLM-as-judge. No Python runtime. Just fast, explainable verification at the point of delivery.

Synced from official vLLM Blog: Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

banner


Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

· 阅读需 1 分钟
Xunzhuo Liu
Intelligent Routing @vLLM

The earlier versions of vLLM Semantic Router relied on classification-based routing, a straightforward approach where user queries are classified into one of 14 MMLU domain categories, and then routed to corresponding models. While this worked for basic scenarios, we quickly discovered its limitations when building production AI systems for enterprises.

Synced from official vLLM Blog: Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

banner