Giving AgentGateway a Semantic Brain with vLLM Semantic Router

2026年6月28日 · 阅读需 10 分钟

Aayush Saini

SDE, Data and AI @ Red Hat

Anup Sharma

AI & Distributed System @ Nutanix

vLLM Agent Architecture Workflow: Custom Semantic Routing with AgentGateway and Semantic Router

Agent systems that span multiple models — a local endpoint for coding, a frontier cloud model for deep reasoning, and a fast general-purpose model for everyday tasks — all face the same routing question: how should each request be directed to the right backend?

Many deployments start with a lightweight Python proxy or keyword matcher in front of the gateway. That approach works at small scale, but misroutes grow quickly as traffic, languages, and task types diversify. This post shows how vLLM Semantic Router running as an Envoy ExtProc sidecar inside AgentGateway replaces that pattern with semantic, config-driven routing.

Agentic Routing on AMD ROCm

2026年6月18日 · 阅读需 14 分钟

Xunzhuo Liu

Intelligent Routing @vLLM

Haichen Zhang

Sr. AI Engineer @AMD

Andy Luo

Sr. Director @AMD

Most agent systems start with a simple idea: call model: auto and let the inference layer pick the right model. That is useful, but it is not enough for long-running agents.

A coding agent can begin with architecture work, call tools, receive short tool outputs, continue with "fix that", then ask a privacy-sensitive question in the same user session. The latest message may look simple, but the route cannot be chosen from the latest message alone. The router also has to know whether this is a safe moment to switch models.

This guide shows how to deploy that pattern on AMD ROCm with vLLM Semantic Router. You will start one ROCm vLLM backend, serve the agentic routing recipe, open the dashboard, validate the OpenAI-compatible API, and use Inferoa to experience route decisions and Router Learning behavior from an agent client.

Agent session routed through router memory to model paths
Agentic routing is not only choosing a model. It is choosing when to keep one.

Deploying vLLM Semantic Router on AMD Developer Cloud

2026年3月25日 · 阅读需 12 分钟

Xunzhuo Liu

Intelligent Routing @vLLM

Haichen Zhang

Sr. AI Engineer @AMD

Andy Luo

Sr. Director @AMD

AMD Developer Cloud and vLLM Semantic Router overview

Running vLLM Semantic Router on AMD Developer Cloud is not just about bringing up one more inference endpoint. It is about turning it into a routed multi-tier system that can classify requests, choose a semantic lane, and make replay and Insights immediately useful.

This post walks through the practical path: start the ROCm backend on an AMD Developer Cloud instance, install vLLM-SR, import the reference profile, and validate the deployment end to end.

v0.3 Themis Roadmap: Stability at Scale

2026年3月12日 · 阅读需 10 分钟

Xunzhuo Liu

Intelligent Routing @vLLM

Huamin Chen

Distinguished Engineer @ Red Hat

v0.3, codename Themis, is our production-readiness release for Semantic Router. The theme is simple: Stability at Scale. After Athena expanded the system brain, Themis is the release where we make that intelligence dependable across real environments, clearer to operate, and safer to ship into production.

This roadmap is not just about adding more capability. It is about making the full system coherent: one stable contract across Docker and Kubernetes, one cleaner deployment path, one real version story for images and packages, stronger performance validation on both NVIDIA and AMD, and a research track that directly improves the product instead of sitting outside it.

vLLM Semantic Router v0.2 Athena: ClawOS, Model Refresh, and the System Brain

2026年3月10日 · 阅读需 1 分钟

Xunzhuo Liu

Intelligent Routing @vLLM

Huamin Chen

Distinguished Engineer @ Red Hat

athena-release

Athena is the first major hardening step after Iris. It refreshes the model stack, extends routing into safety and semantic control, and starts shaping the system brain needed to make Semantic Router easier to govern, operate, and scale in real deployments.

Building Mixture-of-Models on AMD GPUs with vLLM-SR

2026年1月23日 · 阅读需 1 分钟

Xunzhuo Liu

Intelligent Routing @vLLM

mom-on-amd

Building Mixture-of-Models on AMD GPUs is not just about serving one more model on one more device. It is about turning routing, governance, and inference into a coordinated system so MoM workloads can run efficiently on AMD hardware at production scale.

vLLM Semantic Router v0.1 Iris: The First Major Release

2026年1月5日 · 阅读需 1 分钟

Xunzhuo Liu

Intelligent Routing @vLLM

We are thrilled to announce the release of vLLM Semantic Router v0.1, codename Iris—our first major release that marks a transformative milestone for intelligent LLM routing. Since our experimental launch in September 2025, we've witnessed extraordinary community growth: over 600 Pull Requests merged, 300+ Issues addressed, and contributions from more than 50 outstanding engineers worldwide.

In Greek mythology, Iris (Ἶρις) served as the divine messenger who bridged the realms of gods and mortals, traveling on the arc of the rainbow to deliver messages across vast distances. This symbolism perfectly captures what vLLM Semantic Router v0.1 achieves: a bridge between users and diverse AI models, intelligently routing requests across different LLM providers and architectures.

Synced from official vLLM Blog: vLLM Semantic Router v0.1 Iris: The First Major Release

banner

AMD × vLLM Semantic Router: Building the System Intelligence Together

2025年12月16日 · 阅读需 1 分钟

Xunzhuo Liu

Intelligent Routing @vLLM

Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in how we think about AI system architecture.

AMD has been a long-term technology partner for the vLLM community, from accelerating the vLLM inference engine on AMD GPUs and ROCm™ Software to now co-building the next layer of the AI stack: intelligent routing and governance for Mixture-of-Models (MoM) systems.

Synced from official vLLM Blog: AMD × vLLM Semantic Router: Building the System Intelligence Together

banner

Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

2025年12月14日 · 阅读需 1 分钟

Xunzhuo Liu

Intelligent Routing @vLLM

Huamin Chen

Distinguished Engineer @ Red Hat

Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right in front of them.

Building on our Signal-Decision Architecture, we introduce HaluGate—a conditional, token-level hallucination detection pipeline that catches unsupported claims before they reach your users. No LLM-as-judge. No Python runtime. Just fast, explainable verification at the point of delivery.

Synced from official vLLM Blog: Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

banner

Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

2025年11月19日 · 阅读需 1 分钟

Xunzhuo Liu

Intelligent Routing @vLLM

The earlier versions of vLLM Semantic Router relied on classification-based routing, a straightforward approach where user queries are classified into one of 14 MMLU domain categories, and then routed to corresponding models. While this worked for basic scenarios, we quickly discovered its limitations when building production AI systems for enterprises.

Synced from official vLLM Blog: Signal-Decision Driven Architecture: Reshaping Semantic Routing at Scale

banner