Skip to main content
Version: v0.1

Install with NVIDIA Dynamo

This guide provides step-by-step instructions for integrating vLLM Semantic Router with NVIDIA Dynamo.

About NVIDIA Dynamo​

NVIDIA Dynamo is a high-performance distributed inference platform designed for large language model serving. Dynamo provides advanced features for optimizing GPU utilization and reducing inference latency through intelligent routing and caching mechanisms.

Key Features​

  • Disaggregated Serving: Separate Prefill and Decode workers for optimal GPU utilization
  • KV-Aware Routing: Routes requests to workers with relevant KV cache for prefix cache optimization
  • Dynamic Scaling: Planner component handles auto-scaling based on workload
  • Multi-Tier KV Cache: GPU HBM → System Memory → NVMe for efficient cache management
  • Worker Coordination: etcd and NATS for distributed worker registration and message queuing
  • Backend Agnostic: Supports vLLM, SGLang, and TensorRT-LLM backends

Integration Benefits​

Integrating vLLM Semantic Router with NVIDIA Dynamo provides several advantages:

  1. Dual-Layer Intelligence: Semantic Router provides request-level intelligence (model selection, classification) while Dynamo optimizes infrastructure-level efficiency (worker selection, KV cache reuse)

  2. Intelligent Model Selection: Semantic Router analyzes incoming requests and routes them to the most appropriate model based on content understanding, while Dynamo's KV-aware router efficiently selects optimal workers

  3. Dual-Layer Caching: Semantic cache (request-level, Milvus-backed) combined with KV cache (token-level, Dynamo-managed) for maximum latency reduction

  4. Enhanced Security: PII detection and jailbreak prevention filter requests before reaching inference workers

  5. Disaggregated Architecture: Separate prefill and decode workers with KV-aware routing for reduced latency and better throughput

Architecture​

┌─────────────────────────────────────────────────────────────────┐
│ CLIENT │
│ curl -X POST http://localhost:8080/v1/chat/completions │
│ -d '{"model": "MoM", "messages": [...]}' │
└─────────────────────────────────────────────────────────────────┘
│
â–ŧ
┌─────────────────────────────────────────────────────────────────┐
│ ENVOY GATEWAY │
│ â€ĸ Routes traffic, applies ExtProc filter │
└─────────────────────────────────────────────────────────────────┘
│
â–ŧ
┌─────────────────────────────────────────────────────────────────┐
│ SEMANTIC ROUTER (ExtProc Filter) │
│ â€ĸ Classifies query → selects category (e.g., "math") │
│ â€ĸ Selects model → rewrites request │
│ â€ĸ Injects domain-specific system prompt │
│ â€ĸ PII/Jailbreak detection │
└─────────────────────────────────────────────────────────────────┘
│
â–ŧ
┌─────────────────────────────────────────────────────────────────┐
│ DYNAMO FRONTEND (KV-Aware Routing) │
│ â€ĸ Receives enriched request with selected model │
│ â€ĸ Routes to optimal worker based on KV cache state │
│ â€ĸ Coordinates workers via etcd/NATS │
└─────────────────────────────────────────────────────────────────┘
│ │
â–ŧ â–ŧ
┌───────────────────────────┐ ┌───────────────────────────┐
│ PREFILL WORKER (GPU 1) │ │ DECODE WORKER (GPU 2) │
│ Processes input tokens │──â–ļ Generates output tokens │
│ --is-prefill-worker │ │ │
└───────────────────────────┘ └───────────────────────────┘

Prerequisites​

GPU Requirements​

This deployment requires a machine with at least 3 GPUs:

ComponentGPUDescription
FrontendGPU 0Dynamo Frontend with KV-aware routing (--router-mode kv)
VLLMPrefillWorkerGPU 1Handles prefill phase of inference (--is-prefill-worker)
VLLMDecodeWorkerGPU 2Handles decode phase of inference

Required Tools​

Before starting, ensure you have the following tools installed:

  • kind - Kubernetes in Docker
  • kubectl - Kubernetes CLI
  • Helm - Package manager for Kubernetes

NVIDIA Runtime Configuration (One-Time Setup)​

Configure Docker to use the NVIDIA runtime as the default:

# Configure NVIDIA runtime as default
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default

# Restart Docker
sudo systemctl restart docker

# Verify configuration
docker info | grep -i "default runtime"
# Expected output: Default Runtime: nvidia

Step 1: Create Kind Cluster with GPU Support​

Create a local Kubernetes cluster with GPU support. Choose one of the following options:

Option 1: Quick Setup (External Documentation)​

For a quick setup, follow the official Kind GPU documentation:

kind create cluster --name semantic-router-dynamo

# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s

For GPU support, see the Kind GPU documentation for details on configuring extra mounts and deploying the NVIDIA device plugin.

Option 2: Full GPU Setup (E2E Procedure)​

This is the procedure used in our E2E tests. It includes all the steps needed to set up GPU support in Kind.

2.1 Create Kind Cluster with GPU Configuration​

Create a Kind config file with GPU mount support:

# Create Kind config for GPU support
cat > kind-gpu-config.yaml << 'EOF'
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: semantic-router-dynamo
nodes:
- role: control-plane
extraMounts:
- hostPath: /mnt
containerPath: /mnt
- role: worker
extraMounts:
- hostPath: /mnt
containerPath: /mnt
- hostPath: /dev/null
containerPath: /var/run/nvidia-container-devices/all
EOF

# Create cluster with GPU config
kind create cluster --name semantic-router-dynamo --config kind-gpu-config.yaml --wait 5m

# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s

2.2 Set Up NVIDIA Libraries in Kind Worker​

Copy NVIDIA libraries from the host to the Kind worker node:

# Set worker name
WORKER_NAME="semantic-router-dynamo-worker"

# Detect NVIDIA driver version
DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1)
echo "Detected NVIDIA driver version: $DRIVER_VERSION"

# Verify GPU devices exist in the Kind worker
docker exec $WORKER_NAME ls /dev/nvidia0
echo "✅ GPU devices found in Kind worker"

# Create directory for NVIDIA libraries
docker exec $WORKER_NAME mkdir -p /nvidia-driver-libs

# Copy nvidia-smi binary
tar -cf - -C /usr/bin nvidia-smi | docker exec -i $WORKER_NAME tar -xf - -C /nvidia-driver-libs/

# Copy NVIDIA libraries from host
tar -cf - -C /usr/lib64 libnvidia-ml.so.$DRIVER_VERSION libcuda.so.$DRIVER_VERSION | \
docker exec -i $WORKER_NAME tar -xf - -C /nvidia-driver-libs/

# Create symlinks
docker exec $WORKER_NAME bash -c "cd /nvidia-driver-libs && \
ln -sf libnvidia-ml.so.$DRIVER_VERSION libnvidia-ml.so.1 && \
ln -sf libcuda.so.$DRIVER_VERSION libcuda.so.1 && \
chmod +x nvidia-smi"

# Verify nvidia-smi works inside the Kind worker
docker exec $WORKER_NAME bash -c "LD_LIBRARY_PATH=/nvidia-driver-libs /nvidia-driver-libs/nvidia-smi"
echo "✅ nvidia-smi verified in Kind worker"

2.3 Deploy NVIDIA Device Plugin​

Deploy the NVIDIA device plugin to make GPUs allocatable in Kubernetes:

# Create device plugin manifest
cat > nvidia-device-plugin.yaml << 'EOF'
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
name: nvidia-device-plugin-ctr
env:
- name: LD_LIBRARY_PATH
value: "/nvidia-driver-libs"
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev
mountPath: /dev
- name: nvidia-driver-libs
mountPath: /nvidia-driver-libs
readOnly: true
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev
hostPath:
path: /dev
- name: nvidia-driver-libs
hostPath:
path: /nvidia-driver-libs
EOF

# Apply device plugin
kubectl apply -f nvidia-device-plugin.yaml

# Wait for device plugin to be ready
sleep 20

# Verify GPUs are allocatable
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\\.com/gpu
echo "✅ GPU setup complete"
E2E Testing

The Semantic Router project includes automated E2E tests that handle all of this GPU setup automatically. You can run:

make e2e-test E2E_PROFILE=dynamo E2E_VERBOSE=true

This will create a Kind cluster with GPU support, deploy all components, and run the test suite.

Step 2: Install Dynamo Platform​

Deploy the Dynamo platform components (etcd, NATS, Dynamo Operator):

# Add the Dynamo Helm repository
helm repo add dynamo https://nvidia.github.io/dynamo
helm repo update

# Install Dynamo CRDs
helm install dynamo-crds dynamo/dynamo-crds \
--namespace dynamo-system \
--create-namespace

# Install Dynamo Platform (etcd, NATS, Operator)
helm install dynamo-platform dynamo/dynamo-platform \
--namespace dynamo-system \
--wait

# Wait for platform components to be ready
kubectl wait --for=condition=Available deployment -l app.kubernetes.io/instance=dynamo-platform -n dynamo-system --timeout=300s

Step 3: Install Envoy Gateway​

Deploy Envoy Gateway with ExtensionAPIs enabled for Semantic Router integration:

# Install Envoy Gateway with custom values
helm install envoy-gateway oci://docker.io/envoyproxy/gateway-helm \
--version v1.3.0 \
--namespace envoy-gateway-system \
--create-namespace \
-f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/dynamo/dynamo-resources/envoy-gateway-values.yaml

# Wait for Envoy Gateway to be ready
kubectl wait --for=condition=Available deployment/envoy-gateway -n envoy-gateway-system --timeout=300s

Important: The values file enables extensionApis.enableEnvoyPatchPolicy: true, which is required for the Semantic Router ExtProc integration.

Step 4: Deploy vLLM Semantic Router​

Deploy the Semantic Router with Dynamo-specific configuration:

# Install Semantic Router from GHCR OCI registry
helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
--version v0.0.0-latest \
--namespace vllm-semantic-router-system \
--create-namespace \
-f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/dynamo/semantic-router-values/values.yaml

# Wait for deployment to be ready
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s

# Verify deployment status
kubectl get pods -n vllm-semantic-router-system

Note: The values file configures Semantic Router to route to the TinyLlama model served by Dynamo workers.

Step 5: Deploy RBAC Resources​

Apply RBAC permissions for Semantic Router to access Dynamo CRDs:

kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/dynamo/dynamo-resources/rbac.yaml

Step 6: Deploy DynamoGraphDeployment​

Deploy the Dynamo workers using the DynamoGraphDeployment CRD:

kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/dynamo/dynamo-resources/dynamo-graph-deployment.yaml

# Wait for the Dynamo operator to create the deployments
kubectl wait --for=condition=Available deployment/vllm-frontend -n dynamo-system --timeout=600s
kubectl wait --for=condition=Available deployment/vllm-vllmprefillworker -n dynamo-system --timeout=600s
kubectl wait --for=condition=Available deployment/vllm-vllmdecodeworker -n dynamo-system --timeout=600s

The DynamoGraphDeployment creates:

  • Frontend: HTTP API server with KV-aware routing
  • VLLMPrefillWorker: Specialized worker for prefill phase
  • VLLMDecodeWorker: Specialized worker for decode phase

Step 7: Create Gateway API Resources​

Deploy the Gateway API resources to connect everything:

kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/dynamo/dynamo-resources/gwapi-resources.yaml

# Verify EnvoyPatchPolicy is accepted
kubectl get envoypatchpolicy -n default

Important: The EnvoyPatchPolicy status must show Accepted: True. If it shows Accepted: False, verify that Envoy Gateway was installed with the correct values file.

Testing the Deployment​

Set up port forwarding to access the gateway locally:

# Get the Envoy service name
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
--selector=gateway.envoyproxy.io/owning-gateway-namespace=default,gateway.envoyproxy.io/owning-gateway-name=semantic-router \
-o jsonpath='{.items[0].metadata.name}')

kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80

Send Test Requests​

Test the inference endpoint with a math query:

curl -i -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [
{"role": "user", "content": "What is 2+2?"}
]
}'

Expected Response​

HTTP/1.1 200 OK
server: fasthttp
date: Thu, 06 Nov 2025 06:38:08 GMT
content-type: application/json
x-vsr-selected-category: math
x-vsr-selected-reasoning: on
x-vsr-selected-model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
x-vsr-injected-system-prompt: true
transfer-encoding: chunked

{"id":"chatcmpl-...","model":"TinyLlama/TinyLlama-1.1B-Chat-v1.0","choices":[{"message":{"role":"assistant","content":"..."}}],"usage":{"prompt_tokens":15,"completion_tokens":54,"total_tokens":69}}

Success indicators:

  • ✅ Request sent with model="MoM"
  • ✅ Response shows model="TinyLlama/TinyLlama-1.1B-Chat-v1.0" (rewritten by Semantic Router)
  • ✅ Headers show category classification and system prompt injection

Check Semantic Router Logs​

# View classification and routing decisions
kubectl logs -n vllm-semantic-router-system deployment/semantic-router -f | grep -E "category|routing_decision"

Expected output:

Classified as category: math (confidence=0.933)
Selected model TinyLlama/TinyLlama-1.1B-Chat-v1.0 for category math with score 1.0000
routing_decision: original_model="MoM", selected_model="TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Verify EnvoyPatchPolicy Status​

kubectl get envoypatchpolicy -n default -o yaml | grep -A 5 "status:"

Expected status:

status:
conditions:
- type: Accepted
status: "True"
- type: Programmed
status: "True"

Cleanup​

To remove the entire deployment:

# Remove Gateway API resources
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/dynamo/dynamo-resources/gwapi-resources.yaml

# Remove DynamoGraphDeployment
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/dynamo/dynamo-resources/dynamo-graph-deployment.yaml

# Remove RBAC
kubectl delete -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/dynamo/dynamo-resources/rbac.yaml

# Remove Semantic Router
helm uninstall semantic-router -n vllm-semantic-router-system

# Remove Envoy Gateway
helm uninstall envoy-gateway -n envoy-gateway-system

# Remove Dynamo Platform
helm uninstall dynamo-platform -n dynamo-system
helm uninstall dynamo-crds -n dynamo-system

# Delete namespaces
kubectl delete namespace vllm-semantic-router-system
kubectl delete namespace envoy-gateway-system
kubectl delete namespace dynamo-system

# Delete Kind cluster (optional)
kind delete cluster --name semantic-router-dynamo

Production Configuration​

For production deployments with larger models, modify the DynamoGraphDeployment:

# Example: Using Llama-3-8B instead of TinyLlama
args:
- "python3 -m dynamo.vllm --model meta-llama/Llama-3-8b-hf --tensor-parallel-size 2 --enforce-eager"
resources:
requests:
nvidia.com/gpu: 2 # Increase for tensor parallelism

Considerations for production:

  • Use larger models appropriate for your use case
  • Configure tensor parallelism for multi-GPU inference
  • Enable distributed KV cache for multi-node deployments
  • Set up monitoring and observability
  • Configure autoscaling based on GPU utilization

Next Steps​

References​