跳到主要内容
Documentation

Install with AgentGateway

This guide provides step-by-step instructions for integrating the vLLM Semantic Router with AgentGateway on Kubernetes. AgentGateway acts as the Gateway API data plane for OpenAI-compatible traffic, and vLLM Semantic Router runs as an Envoy ExtProc server that classifies each request and mutates the request body before AgentGateway forwards it to vLLM.

版本:最新版

Install with AgentGateway

This guide provides step-by-step instructions for integrating the vLLM Semantic Router with AgentGateway on Kubernetes. AgentGateway acts as the Gateway API data plane for OpenAI-compatible traffic, and vLLM Semantic Router runs as an Envoy ExtProc server that classifies each request and mutates the request body before AgentGateway forwards it to vLLM.

Architecture Overview

The deployment consists of:

  • vLLM Semantic Router: Provides prompt classification, model selection, request mutation, and response processing through ExtProc
  • AgentGateway: Provides the Kubernetes Gateway API proxy, AgentgatewayBackend, HTTPRoute, and AgentgatewayPolicy resources
  • Demo vLLM-compatible backend: Serves a base model and LoRA adapters through an OpenAI-compatible API

Prerequisites

Before starting, ensure you have the following tools installed:

  • kind - Kubernetes in Docker (Optional)
  • kubectl - Kubernetes CLI
  • Helm - Package manager for Kubernetes

This guide requires AgentGateway v1.3.0-alpha.1 or newer because it uses the ExtProc processingOptions and allowModeOverride fields that were added after v1.2.1.

Step 1: Create Kind Cluster (Optional)

Create a local Kubernetes cluster for testing:

kind create cluster --name semantic-router-agentgateway

# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s

Step 2: Install AgentGateway

Install the Kubernetes Gateway API CRDs and the AgentGateway control plane:

export AGENTGATEWAY_VERSION=v1.3.0-alpha.1

kubectl apply --server-side --force-conflicts \
-f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.0/standard-install.yaml

helm upgrade -i agentgateway-crds oci://cr.agentgateway.dev/charts/agentgateway-crds \
--create-namespace \
--namespace agentgateway-system \
--version "${AGENTGATEWAY_VERSION}" \
--set controller.image.pullPolicy=Always

helm upgrade -i agentgateway oci://cr.agentgateway.dev/charts/agentgateway \
--namespace agentgateway-system \
--version "${AGENTGATEWAY_VERSION}" \
--set controller.image.pullPolicy=Always \
--set controller.extraEnv.KGW_ENABLE_GATEWAY_API_EXPERIMENTAL_FEATURES=true \
--wait

kubectl get pods -n agentgateway-system

Step 3: Create an AgentGateway Proxy

Create a Gateway that uses the AgentGateway GatewayClass:

kubectl apply -f- <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: agentgateway-proxy
namespace: agentgateway-system
spec:
gatewayClassName: agentgateway
listeners:
- protocol: HTTP
port: 80
name: http
allowedRoutes:
namespaces:
from: All
EOF

kubectl wait --for=condition=Available deployment/agentgateway-proxy \
-n agentgateway-system \
--timeout=300s

Step 4: Deploy Demo LLM

Deploy a lightweight OpenAI-compatible simulator that serves base-model plus the LoRA adapter names selected by the Semantic Router demo configuration:

kubectl apply -f- <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b-instruct
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama3-8b-instruct
template:
metadata:
labels:
app: vllm-llama3-8b-instruct
spec:
containers:
- name: vllm-sim
image: ghcr.io/llm-d/llm-d-inference-sim:v0.5.0
imagePullPolicy: IfNotPresent
args:
- --model
- base-model
- --port
- "8000"
- --max-loras
- "6"
- --lora-modules
- '{"name": "math-expert"}'
- '{"name": "science-expert"}'
- '{"name": "social-expert"}'
- '{"name": "humanities-expert"}'
- '{"name": "law-expert"}'
- '{"name": "general-expert"}'
ports:
- containerPort: 8000
name: http
protocol: TCP
readinessProbe:
httpGet:
path: /health
port: http
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama3-8b-instruct
namespace: default
labels:
app: vllm-llama3-8b-instruct
spec:
type: ClusterIP
ports:
- port: 8000
targetPort: 8000
protocol: TCP
selector:
app: vllm-llama3-8b-instruct
EOF

kubectl wait --for=condition=Available deployment/vllm-llama3-8b-instruct \
-n default \
--timeout=300s

Step 5: Deploy vLLM Semantic Router

Install the Semantic Router in the agentgateway-system namespace so the AgentGateway ExtProc policy can reference the semantic-router service directly:

helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
--version v0.0.0-latest \
--namespace agentgateway-system \
-f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/agentgateway/semantic-router-values/values.yaml

kubectl wait --for=condition=Available deployment/semantic-router \
-n agentgateway-system \
--timeout=600s

The values file configures Semantic Router to send traffic to vllm-llama3-8b-instruct.default.svc.cluster.local:8000 and to select adapter names such as math-expert, science-expert, and general-expert.

Step 6: Create AgentGateway Routing Resources

Create an AgentgatewayBackend for the vLLM-compatible backend and route OpenAI-compatible requests to it:

kubectl apply -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
name: semantic-router-vllm
namespace: agentgateway-system
spec:
ai:
provider:
openai: {}
host: vllm-llama3-8b-instruct.default.svc.cluster.local
port: 8000
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: semantic-router-vllm
namespace: agentgateway-system
spec:
parentRefs:
- name: agentgateway-proxy
namespace: agentgateway-system
rules:
- backendRefs:
- name: semantic-router-vllm
namespace: agentgateway-system
group: agentgateway.dev
kind: AgentgatewayBackend
EOF

The openai.model field is intentionally omitted so AgentGateway uses the model name from the request body after Semantic Router selects the target model or LoRA adapter.

Step 7: Attach Semantic Router as ExtProc

Create an AgentgatewayPolicy that sends request and response processing phases to the Semantic Router ExtProc service:

kubectl apply -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
name: semantic-router-extproc
namespace: agentgateway-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: agentgateway-proxy
traffic:
extProc:
backendRef:
name: semantic-router
namespace: agentgateway-system
port: 50051
processingOptions:
requestHeaderMode: Send
requestBodyMode: Buffered
responseHeaderMode: Send
responseBodyMode: Buffered
allowModeOverride: true
EOF

The demo uses buffered request bodies to match the existing Envoy AI Gateway and Istio examples. For large prompts, use requestBodyMode: FullDuplexStreamed together with a Semantic Router configuration that enables streamed body handling. AgentGateway does not support Streamed mode; FullDuplexStreamed is the only streaming option.

Testing the Deployment

Start a port-forward to the AgentGateway proxy:

kubectl port-forward -n agentgateway-system svc/agentgateway-proxy 8080:80

In another terminal, send an OpenAI-compatible request with "model": "auto":

curl -i -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "What is the derivative of f(x) = x^3?"}
],
"max_tokens": 64,
"temperature": 0
}'

Semantic Router should classify the math prompt, select the configured math route, and mutate the request model before AgentGateway forwards the request to the vLLM-compatible backend. Use -i to inspect Semantic Router response headers such as the selected model metadata.

Troubleshooting

AgentGateway proxy not ready:

kubectl get gateway agentgateway-proxy -n agentgateway-system
kubectl get deployment agentgateway-proxy -n agentgateway-system
kubectl logs -n agentgateway-system deployment/agentgateway

HTTPRoute or AgentGateway backend not accepted:

kubectl describe httproute semantic-router-vllm -n agentgateway-system
kubectl describe agentgatewaybackend semantic-router-vllm -n agentgateway-system

Semantic Router not responding to ExtProc:

kubectl get pods -n agentgateway-system
kubectl get svc semantic-router -n agentgateway-system
kubectl logs -n agentgateway-system deployment/semantic-router
kubectl describe agentgatewaypolicy semantic-router-extproc -n agentgateway-system

Demo LLM not responding:

kubectl get pods -n default -l app=vllm-llama3-8b-instruct
kubectl logs -n default deployment/vllm-llama3-8b-instruct

Cleanup

To remove the entire deployment:

kubectl delete agentgatewaypolicy semantic-router-extproc -n agentgateway-system
kubectl delete httproute semantic-router-vllm -n agentgateway-system
kubectl delete agentgatewaybackend semantic-router-vllm -n agentgateway-system
kubectl delete gateway agentgateway-proxy -n agentgateway-system
kubectl delete deployment vllm-llama3-8b-instruct -n default
kubectl delete service vllm-llama3-8b-instruct -n default

helm uninstall semantic-router -n agentgateway-system
helm uninstall agentgateway -n agentgateway-system
helm uninstall agentgateway-crds -n agentgateway-system

kind delete cluster --name semantic-router-agentgateway