Skip to main content
Version: ๐Ÿšง Next

Kubernetes Operator

The Semantic Router Operator provides a Kubernetes-native way to deploy and manage vLLM Semantic Router instances using Custom Resource Definitions (CRDs). It simplifies deployment, configuration, and lifecycle management across Kubernetes and OpenShift platforms.

Featuresโ€‹

  • ๐Ÿš€ Declarative Deployment: Define semantic router instances using Kubernetes CRDs
  • ๐Ÿ”„ Automatic Configuration: Generates and manages ConfigMaps for semantic router configuration
  • ๐Ÿ“ฆ Persistent Storage: Manages PVCs for ML model storage with automatic lifecycle
  • ๐Ÿ” Platform Detection: Automatically detects and configures for OpenShift or standard Kubernetes
  • ๐Ÿ“Š Built-in Observability: Metrics, tracing, and monitoring support out of the box
  • ๐ŸŽฏ Production Features: HPA, ingress, service mesh integration, and pod disruption budgets
  • ๐Ÿ›ก๏ธ Secure by Default: Drops all capabilities, prevents privilege escalation

Prerequisitesโ€‹

  • Kubernetes 1.24+ or OpenShift 4.12+
  • kubectl or oc CLI configured
  • Cluster admin access (for CRD installation)

Installationโ€‹

Option 1: Using Kustomize (Standard Kubernetes)โ€‹

# Clone the repository
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator

# Install CRDs
make install

# Deploy the operator
make deploy IMG=ghcr.io/vllm-project/semantic-router-operator:latest

Verify the operator is running:

kubectl get pods -n semantic-router-operator-system

Option 2: Using OLM (OpenShift)โ€‹

For OpenShift deployments using Operator Lifecycle Manager:

cd semantic-router/deploy/operator

# Build and push to your registry (Quay, internal registry, etc.)
podman login quay.io
make podman-build IMG=quay.io/<your-org>/semantic-router-operator:latest
make podman-push IMG=quay.io/<your-org>/semantic-router-operator:latest

# Deploy using OLM
make openshift-deploy

Deploy Your First Routerโ€‹

Create a my-router.yaml file:

apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: my-router
namespace: default
spec:
replicas: 2

image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest

resources:
limits:
memory: "7Gi"
cpu: "2"
requests:
memory: "3Gi"
cpu: "1"

persistence:
enabled: true
size: 10Gi
storageClassName: "standard"

config:
bert_model:
model_id: "models/mom-embedding-light"
threshold: 0.6
use_cpu: true

semantic_cache:
enabled: true
backend_type: "memory"
max_entries: 1000
ttl_seconds: 3600

tools:
enabled: true
top_k: 3
similarity_threshold: 0.2

prompt_guard:
enabled: true
threshold: 0.7

toolsDb:
- tool:
type: "function"
function:
name: "get_weather"
description: "Get weather information for a location"
parameters:
type: "object"
properties:
location:
type: "string"
description: "City and state, e.g. San Francisco, CA"
required: ["location"]
description: "Weather information tool"
category: "weather"
tags: ["weather", "temperature"]

Apply the configuration:

kubectl apply -f my-router.yaml

Verify Deploymentโ€‹

# Check the SemanticRouter resource
kubectl get semanticrouter my-router

# Check created resources
kubectl get deployment,service,configmap -l app.kubernetes.io/instance=my-router

# View status
kubectl describe semanticrouter my-router

# View logs
kubectl logs -f deployment/my-router

Expected output:

NAME                        PHASE     REPLICAS   READY   AGE
semanticrouter.vllm.ai/my-router Running 2 2 5m

Architectureโ€‹

The operator manages a complete stack of resources for each SemanticRouter:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ SemanticRouter CR โ”‚
โ”‚ apiVersion: vllm.ai/v1alpha1 โ”‚
โ”‚ kind: SemanticRouter โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Operator Controller โ”‚
โ”‚ - Watches CR โ”‚
โ”‚ - Reconciles state โ”‚
โ”‚ - Platform detectionโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ–ผ โ–ผ โ–ผ โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚Deploymentโ”‚ โ”‚ Service โ”‚ โ”‚ConfigMapโ”‚ โ”‚ PVC โ”‚
โ”‚ โ”‚ โ”‚ - gRPC โ”‚ โ”‚ - configโ”‚ โ”‚ - modelsโ”‚
โ”‚ โ”‚ โ”‚ - API โ”‚ โ”‚ - tools โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ - metricsโ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Managed Resources:

  • Deployment: Runs semantic router pods with configurable replicas
  • Service: Exposes gRPC (50051), HTTP API (8080), and metrics (9190)
  • ConfigMap: Contains semantic router configuration and tools database
  • ServiceAccount: For RBAC (optional, created when specified)
  • PersistentVolumeClaim: For ML model storage (optional, when persistence enabled)
  • HorizontalPodAutoscaler: For auto-scaling (optional, when autoscaling enabled)
  • Ingress: For external access (optional, when ingress enabled)

Platform Detection and Securityโ€‹

The operator automatically detects the platform and configures security contexts appropriately.

OpenShift Platformโ€‹

When running on OpenShift, the operator:

  • Detects: Checks for route.openshift.io API resources
  • Security Context: Does NOT set runAsUser, runAsGroup, or fsGroup
  • Rationale: Lets OpenShift SCCs assign UIDs/GIDs from the namespace's allowed range
  • Compatible with: restricted SCC (default) and custom SCCs

Standard Kubernetesโ€‹

When running on standard Kubernetes, the operator:

  • Security Context: Sets runAsUser: 1000, fsGroup: 1000, runAsNonRoot: true
  • Rationale: Provides secure defaults for pod security policies/standards

Both Platformsโ€‹

Regardless of platform:

  • Drops ALL capabilities (drop: [ALL])
  • Prevents privilege escalation (allowPrivilegeEscalation: false)
  • No special permissions or SCCs required beyond defaults

Override Security Contextโ€‹

You can override automatic security contexts in your CR:

spec:
# Container security context
securityContext:
runAsNonRoot: true
runAsUser: 2000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL

# Pod security context
podSecurityContext:
runAsNonRoot: true
runAsUser: 2000
fsGroup: 2000
OpenShift Note

When running on OpenShift, it's recommended to omit runAsUser and fsGroup and let SCCs handle UID/GID assignment automatically.

Configuration Referenceโ€‹

Image Configurationโ€‹

spec:
image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest
pullPolicy: IfNotPresent
imageRegistry: "" # Optional: custom registry prefix

# Optional: Image pull secrets
imagePullSecrets:
- name: ghcr-secret

Service Configurationโ€‹

spec:
service:
type: ClusterIP # or NodePort, LoadBalancer

grpc:
port: 50051
targetPort: 50051

api:
port: 8080
targetPort: 8080

metrics:
enabled: true
port: 9190
targetPort: 9190

Persistence Configurationโ€‹

spec:
persistence:
enabled: true
storageClassName: "standard" # Adjust for your cluster
accessMode: ReadWriteOnce
size: 10Gi

# Optional: Use existing PVC
existingClaim: "my-existing-pvc"

# Optional: PVC annotations
annotations:
backup.velero.io/backup-volumes: "models"

Storage Class Examples:

  • AWS EKS: gp3-csi, gp2
  • GKE: standard, premium-rwo
  • Azure AKS: managed, managed-premium
  • OpenShift: gp3-csi, thin, ocs-storagecluster-ceph-rbd

Semantic Router Configurationโ€‹

Full semantic router configuration is embedded in the CR. See the complete example in deploy/operator/config/samples/vllm_v1alpha1_semanticrouter.yaml.

Key configuration sections:

spec:
config:
# BERT model for embeddings
bert_model:
model_id: "models/mom-embedding-light"
threshold: 0.6
use_cpu: true

# Semantic cache
semantic_cache:
enabled: true
backend_type: "memory" # or "milvus"
similarity_threshold: 0.8
max_entries: 1000
ttl_seconds: 3600
eviction_policy: "fifo"

# Tools auto-selection
tools:
enabled: true
top_k: 3
similarity_threshold: 0.2
tools_db_path: "config/tools_db.json"
fallback_to_empty: true

# Prompt guard (jailbreak detection)
prompt_guard:
enabled: true
model_id: "models/mom-jailbreak-classifier"
threshold: 0.7
use_cpu: true

# Classifiers
classifier:
category_model:
model_id: "models/lora_intent_classifier_bert-base-uncased_model"
threshold: 0.6
use_cpu: true
pii_model:
model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
threshold: 0.7
use_cpu: true

# Reasoning configuration per model family
reasoning_families:
deepseek:
type: "chat_template_kwargs"
parameter: "thinking"
qwen3:
type: "chat_template_kwargs"
parameter: "enable_thinking"
gpt:
type: "reasoning_effort"
parameter: "reasoning_effort"

# API batch classification
api:
batch_classification:
max_batch_size: 100
concurrency_threshold: 5
max_concurrency: 8
metrics:
enabled: true
detailed_goroutine_tracking: true
sample_rate: 1.0

# Observability
observability:
tracing:
enabled: false
provider: "opentelemetry"
exporter:
type: "otlp"
endpoint: "jaeger:4317"

Tools Databaseโ€‹

Define available tools for auto-selection:

spec:
toolsDb:
- tool:
type: "function"
function:
name: "search_web"
description: "Search the web for information"
parameters:
type: "object"
properties:
query:
type: "string"
description: "Search query"
required: ["query"]
description: "Search the internet, web search, find information online"
category: "search"
tags: ["search", "web", "internet"]

- tool:
type: "function"
function:
name: "calculate"
description: "Perform mathematical calculations"
parameters:
type: "object"
properties:
expression:
type: "string"
required: ["expression"]
description: "Calculate mathematical expressions"
category: "math"
tags: ["math", "calculation"]

Autoscaling (HPA)โ€‹

spec:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80

Ingress Configurationโ€‹

spec:
ingress:
enabled: true
className: "nginx" # or "haproxy", "traefik", etc.
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
hosts:
- host: router.example.com
paths:
- path: /
pathType: Prefix
servicePort: 8080
tls:
- secretName: router-tls
hosts:
- router.example.com

Production Deploymentโ€‹

High Availability Setupโ€‹

apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: prod-router
spec:
replicas: 3

# Anti-affinity for spreading across nodes
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: prod-router
topologyKey: kubernetes.io/hostname

# Autoscaling
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70

# Production resources
resources:
limits:
memory: "10Gi"
cpu: "4"
requests:
memory: "5Gi"
cpu: "2"

# Strict probes
livenessProbe:
enabled: true
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3

readinessProbe:
enabled: true
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3

Pod Disruption Budgetโ€‹

Create a PDB to ensure availability during updates:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: prod-router-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app.kubernetes.io/instance: prod-router

Resource Allocation Guidelinesโ€‹

Workload TypeMemory RequestCPU RequestMemory LimitCPU Limit
Development1Gi500m2Gi1
Staging3Gi17Gi2
Production5Gi210Gi4

Monitoring and Observabilityโ€‹

Metricsโ€‹

Prometheus metrics are exposed on port 9190:

# Port-forward to access metrics locally
kubectl port-forward svc/my-router 9190:9190

# View metrics
curl http://localhost:9190/metrics

Key Metrics:

  • semantic_router_request_duration_seconds - Request latency
  • semantic_router_cache_hit_total - Cache hit rate
  • semantic_router_classification_duration_seconds - Classification latency
  • semantic_router_tokens_total - Token usage
  • semantic_router_reasoning_requests_total - Reasoning mode usage

ServiceMonitor (Prometheus Operator)โ€‹

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: semantic-router-metrics
spec:
selector:
matchLabels:
app.kubernetes.io/instance: my-router
endpoints:
- port: metrics
interval: 30s
path: /metrics

Distributed Tracingโ€‹

Enable OpenTelemetry tracing:

spec:
config:
observability:
tracing:
enabled: true
provider: "opentelemetry"
exporter:
type: "otlp"
endpoint: "jaeger-collector:4317"
insecure: true
sampling:
type: "always_on"
rate: 1.0

Troubleshootingโ€‹

Common Issuesโ€‹

Pod stuck in ImagePullBackOffโ€‹

# Check image pull secrets
kubectl describe pod <pod-name>

# Create image pull secret
kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=<username> \
--docker-password=<personal-access-token>

# Add to SemanticRouter
spec:
imagePullSecrets:
- name: ghcr-secret

PVC stuck in Pendingโ€‹

# Check storage class exists
kubectl get storageclass

# Check PVC events
kubectl describe pvc my-router-models

# Update storage class in CR
spec:
persistence:
storageClassName: "your-available-storage-class"

Models not downloadingโ€‹

# Check if HF token secret exists
kubectl get secret hf-token-secret

# Create HF token secret
kubectl create secret generic hf-token-secret \
--from-literal=token=hf_xxxxxxxxxxxxx

# Add to SemanticRouter CR
spec:
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token

Operator not detecting platform correctlyโ€‹

# Check operator logs for platform detection
kubectl logs -n semantic-router-operator-system \
deployment/semantic-router-operator-controller-manager \
| grep -i "platform\|openshift"

# Should see one of:
# "Detected OpenShift platform - will use OpenShift-compatible security contexts"
# "Detected standard Kubernetes platform - will use standard security contexts"

Developmentโ€‹

Local Developmentโ€‹

cd deploy/operator

# Run tests
make test

# Generate CRDs and code
make generate
make manifests

# Build operator binary
make build

# Run locally against your kubeconfig
make run

Testing with kindโ€‹

# Create kind cluster
kind create cluster --name operator-test

# Build and load image
make docker-build IMG=semantic-router-operator:dev
kind load docker-image semantic-router-operator:dev --name operator-test

# Deploy
make install
make deploy IMG=semantic-router-operator:dev

# Create test instance
kubectl apply -f config/samples/vllm_v1alpha1_semanticrouter.yaml

Next Stepsโ€‹