Version: Latest

Install with AgentGateway

This guide provides step-by-step instructions for integrating the vLLM Semantic Router with AgentGateway on Kubernetes. AgentGateway acts as the Gateway API data plane for OpenAI-compatible traffic, and vLLM Semantic Router runs as an Envoy ExtProc server that classifies each request and mutates the request body before AgentGateway forwards it to vLLM.

Architecture Overview

The deployment consists of:

vLLM Semantic Router: Provides prompt classification, model selection, request mutation, and response processing through ExtProc
AgentGateway: Provides the Kubernetes Gateway API proxy, AgentgatewayBackend, HTTPRoute, and AgentgatewayPolicy resources
Demo vLLM-compatible backend: Serves a base model and LoRA adapters through an OpenAI-compatible API

Prerequisites

Before starting, ensure you have the following tools installed:

kind - Kubernetes in Docker (Optional)
kubectl - Kubernetes CLI
Helm - Package manager for Kubernetes

This guide requires AgentGateway v1.3.0-alpha.1 or newer because it uses the ExtProc processingOptions and allowModeOverride fields that were added after v1.2.1.

Step 1: Create Kind Cluster (Optional)

Create a local Kubernetes cluster for testing:

kind create cluster --name semantic-router-agentgateway

# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s

Step 2: Install AgentGateway

Install the Kubernetes Gateway API CRDs and the AgentGateway control plane:

export AGENTGATEWAY_VERSION=v1.3.0-alpha.1

kubectl apply --server-side --force-conflicts \
  -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.5.0/standard-install.yaml

helm upgrade -i agentgateway-crds oci://cr.agentgateway.dev/charts/agentgateway-crds \
  --create-namespace \
  --namespace agentgateway-system \
  --version "${AGENTGATEWAY_VERSION}" \
  --set controller.image.pullPolicy=Always

helm upgrade -i agentgateway oci://cr.agentgateway.dev/charts/agentgateway \
  --namespace agentgateway-system \
  --version "${AGENTGATEWAY_VERSION}" \
  --set controller.image.pullPolicy=Always \
  --set controller.extraEnv.KGW_ENABLE_GATEWAY_API_EXPERIMENTAL_FEATURES=true \
  --wait

kubectl get pods -n agentgateway-system

Step 3: Create an AgentGateway Proxy

Create a Gateway that uses the AgentGateway GatewayClass:

kubectl apply -f- <<'EOF'
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: agentgateway-proxy
  namespace: agentgateway-system
spec:
  gatewayClassName: agentgateway
  listeners:
  - protocol: HTTP
    port: 80
    name: http
    allowedRoutes:
      namespaces:
        from: All
EOF

kubectl wait --for=condition=Available deployment/agentgateway-proxy \
  -n agentgateway-system \
  --timeout=300s

Step 4: Deploy Demo LLM

Deploy a lightweight OpenAI-compatible simulator that serves base-model plus the LoRA adapter names selected by the Semantic Router demo configuration:

kubectl apply -f- <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-8b-instruct
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama3-8b-instruct
  template:
    metadata:
      labels:
        app: vllm-llama3-8b-instruct
    spec:
      containers:
      - name: vllm-sim
        image: ghcr.io/llm-d/llm-d-inference-sim:v0.5.0
        imagePullPolicy: IfNotPresent
        args:
        - --model
        - base-model
        - --port
        - "8000"
        - --max-loras
        - "6"
        - --lora-modules
        - '{"name": "math-expert"}'
        - '{"name": "science-expert"}'
        - '{"name": "social-expert"}'
        - '{"name": "humanities-expert"}'
        - '{"name": "law-expert"}'
        - '{"name": "general-expert"}'
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /health
            port: http
          periodSeconds: 5
          timeoutSeconds: 5
          failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-8b-instruct
  namespace: default
  labels:
    app: vllm-llama3-8b-instruct
spec:
  type: ClusterIP
  ports:
  - port: 8000
    targetPort: 8000
    protocol: TCP
  selector:
    app: vllm-llama3-8b-instruct
EOF

kubectl wait --for=condition=Available deployment/vllm-llama3-8b-instruct \
  -n default \
  --timeout=300s

Step 5: Deploy vLLM Semantic Router

Install the Semantic Router in the agentgateway-system namespace so the AgentGateway ExtProc policy can reference the semantic-router service directly:

helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
  --version v0.0.0-latest \
  --namespace agentgateway-system \
  -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/agentgateway/semantic-router-values/values.yaml

kubectl wait --for=condition=Available deployment/semantic-router \
  -n agentgateway-system \
  --timeout=600s

The values file configures Semantic Router to send traffic to vllm-llama3-8b-instruct.default.svc.cluster.local:8000 and to select adapter names such as math-expert, science-expert, and general-expert.

Step 6: Create AgentGateway Routing Resources

Create an AgentgatewayBackend for the vLLM-compatible backend and route OpenAI-compatible requests to it:

kubectl apply -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
  name: semantic-router-vllm
  namespace: agentgateway-system
spec:
  ai:
    provider:
      openai: {}
      host: vllm-llama3-8b-instruct.default.svc.cluster.local
      port: 8000
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: semantic-router-vllm
  namespace: agentgateway-system
spec:
  parentRefs:
  - name: agentgateway-proxy
    namespace: agentgateway-system
  rules:
  - backendRefs:
    - name: semantic-router-vllm
      namespace: agentgateway-system
      group: agentgateway.dev
      kind: AgentgatewayBackend
EOF

The openai.model field is intentionally omitted so AgentGateway uses the model name from the request body after Semantic Router selects the target model or LoRA adapter.

Step 7: Attach Semantic Router as ExtProc

Create an AgentgatewayPolicy that sends request and response processing phases to the Semantic Router ExtProc service:

kubectl apply -f- <<'EOF'
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
  name: semantic-router-extproc
  namespace: agentgateway-system
spec:
  targetRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: agentgateway-proxy
  traffic:
    extProc:
      backendRef:
        name: semantic-router
        namespace: agentgateway-system
        port: 50051
      processingOptions:
        requestHeaderMode: Send
        requestBodyMode: Buffered
        responseHeaderMode: Send
        responseBodyMode: Buffered
        allowModeOverride: true
EOF

The demo uses buffered request bodies to match the existing Envoy AI Gateway and Istio examples. For large prompts, use requestBodyMode: FullDuplexStreamed together with a Semantic Router configuration that enables streamed body handling. AgentGateway does not support Streamed mode; FullDuplexStreamed is the only streaming option.

Testing the Deployment

Start a port-forward to the AgentGateway proxy:

kubectl port-forward -n agentgateway-system svc/agentgateway-proxy 8080:80

In another terminal, send an OpenAI-compatible request with "model": "auto":

curl -i -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      {"role": "user", "content": "What is the derivative of f(x) = x^3?"}
    ],
    "max_tokens": 64,
    "temperature": 0
  }'

Semantic Router should classify the math prompt, select the configured math route, and mutate the request model before AgentGateway forwards the request to the vLLM-compatible backend. Use -i to inspect Semantic Router response headers such as the selected model metadata.

Troubleshooting

AgentGateway proxy not ready:

kubectl get gateway agentgateway-proxy -n agentgateway-system
kubectl get deployment agentgateway-proxy -n agentgateway-system
kubectl logs -n agentgateway-system deployment/agentgateway

HTTPRoute or AgentGateway backend not accepted:

kubectl describe httproute semantic-router-vllm -n agentgateway-system
kubectl describe agentgatewaybackend semantic-router-vllm -n agentgateway-system

Semantic Router not responding to ExtProc:

kubectl get pods -n agentgateway-system
kubectl get svc semantic-router -n agentgateway-system
kubectl logs -n agentgateway-system deployment/semantic-router
kubectl describe agentgatewaypolicy semantic-router-extproc -n agentgateway-system

Demo LLM not responding:

kubectl get pods -n default -l app=vllm-llama3-8b-instruct
kubectl logs -n default deployment/vllm-llama3-8b-instruct

Cleanup

To remove the entire deployment:

kubectl delete agentgatewaypolicy semantic-router-extproc -n agentgateway-system
kubectl delete httproute semantic-router-vllm -n agentgateway-system
kubectl delete agentgatewaybackend semantic-router-vllm -n agentgateway-system
kubectl delete gateway agentgateway-proxy -n agentgateway-system
kubectl delete deployment vllm-llama3-8b-instruct -n default
kubectl delete service vllm-llama3-8b-instruct -n default

helm uninstall semantic-router -n agentgateway-system
helm uninstall agentgateway -n agentgateway-system
helm uninstall agentgateway-crds -n agentgateway-system

kind delete cluster --name semantic-router-agentgateway

Install with AgentGateway

Architecture Overview​

Prerequisites​

Step 1: Create Kind Cluster (Optional)​

Step 2: Install AgentGateway​

Step 3: Create an AgentGateway Proxy​

Step 4: Deploy Demo LLM​

Step 5: Deploy vLLM Semantic Router​

Step 6: Create AgentGateway Routing Resources​

Step 7: Attach Semantic Router as ExtProc​

Testing the Deployment​

Troubleshooting​

Cleanup​