Kubernetes Operator
The Semantic Router Operator provides a Kubernetes-native way to deploy and manage vLLM Semantic Router instances using Custom Resource Definitions (CRDs). It simplifies deployment, configuration, and lifecycle management across Kubernetes and OpenShift platforms.
Featuresโ
- ๐ Declarative Deployment: Define semantic router instances using Kubernetes CRDs
- ๐ Automatic Configuration: Generates and manages ConfigMaps for semantic router configuration
- ๐ฆ Persistent Storage: Manages PVCs for ML model storage with automatic lifecycle
- ๐ Platform Detection: Automatically detects and configures for OpenShift or standard Kubernetes
- ๐ Built-in Observability: Metrics, tracing, and monitoring support out of the box
- ๐ฏ Production Features: HPA, ingress, service mesh integration, and pod disruption budgets
- ๐ก๏ธ Secure by Default: Drops all capabilities, prevents privilege escalation
Prerequisitesโ
- Kubernetes 1.24+ or OpenShift 4.12+
kubectlorocCLI configured- Cluster admin access (for CRD installation)
Installationโ
Option 1: Using Kustomize (Standard Kubernetes)โ
# Clone the repository
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator
# Install CRDs
make install
# Deploy the operator
make deploy IMG=ghcr.io/vllm-project/semantic-router-operator:latest
Verify the operator is running:
kubectl get pods -n semantic-router-operator-system
Option 2: Using OLM (OpenShift)โ
For OpenShift deployments using Operator Lifecycle Manager:
cd semantic-router/deploy/operator
# Build and push to your registry (Quay, internal registry, etc.)
podman login quay.io
make podman-build IMG=quay.io/<your-org>/semantic-router-operator:latest
make podman-push IMG=quay.io/<your-org>/semantic-router-operator:latest
# Deploy using OLM
make openshift-deploy
Deploy Your First Routerโ
Create a my-router.yaml file:
apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: my-router
namespace: default
spec:
replicas: 2
image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest
resources:
limits:
memory: "7Gi"
cpu: "2"
requests:
memory: "3Gi"
cpu: "1"
persistence:
enabled: true
size: 10Gi
storageClassName: "standard"
config:
bert_model:
model_id: "models/mom-embedding-light"
threshold: 0.6
use_cpu: true
semantic_cache:
enabled: true
backend_type: "memory"
max_entries: 1000
ttl_seconds: 3600
tools:
enabled: true
top_k: 3
similarity_threshold: 0.2
prompt_guard:
enabled: true
threshold: 0.7
toolsDb:
- tool:
type: "function"
function:
name: "get_weather"
description: "Get weather information for a location"
parameters:
type: "object"
properties:
location:
type: "string"
description: "City and state, e.g. San Francisco, CA"
required: ["location"]
description: "Weather information tool"
category: "weather"
tags: ["weather", "temperature"]
Apply the configuration:
kubectl apply -f my-router.yaml
Verify Deploymentโ
# Check the SemanticRouter resource
kubectl get semanticrouter my-router
# Check created resources
kubectl get deployment,service,configmap -l app.kubernetes.io/instance=my-router
# View status
kubectl describe semanticrouter my-router
# View logs
kubectl logs -f deployment/my-router
Expected output:
NAME PHASE REPLICAS READY AGE
semanticrouter.vllm.ai/my-router Running 2 2 5m
Architectureโ
The operator manages a complete stack of resources for each SemanticRouter:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SemanticRouter CR โ
โ apiVersion: vllm.ai/v1alpha1 โ
โ kind: SemanticRouter โ
โโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโ
โ Operator Controller โ
โ - Watches CR โ
โ - Reconciles state โ
โ - Platform detectionโ
โโโโโโโโโโโฌโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โผ โผ โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
โDeploymentโ โ Service โ โConfigMapโ โ PVC โ
โ โ โ - gRPC โ โ - configโ โ - modelsโ
โ โ โ - API โ โ - tools โ โ โ
โ โ โ - metricsโ โ โ โ โ
โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ
Managed Resources:
- Deployment: Runs semantic router pods with configurable replicas
- Service: Exposes gRPC (50051), HTTP API (8080), and metrics (9190)
- ConfigMap: Contains semantic router configuration and tools database
- ServiceAccount: For RBAC (optional, created when specified)
- PersistentVolumeClaim: For ML model storage (optional, when persistence enabled)
- HorizontalPodAutoscaler: For auto-scaling (optional, when autoscaling enabled)
- Ingress: For external access (optional, when ingress enabled)
Platform Detection and Securityโ
The operator automatically detects the platform and configures security contexts appropriately.
OpenShift Platformโ
When running on OpenShift, the operator:
- Detects: Checks for
route.openshift.ioAPI resources - Security Context: Does NOT set
runAsUser,runAsGroup, orfsGroup - Rationale: Lets OpenShift SCCs assign UIDs/GIDs from the namespace's allowed range
- Compatible with:
restrictedSCC (default) and custom SCCs
Standard Kubernetesโ
When running on standard Kubernetes, the operator:
- Security Context: Sets
runAsUser: 1000,fsGroup: 1000,runAsNonRoot: true - Rationale: Provides secure defaults for pod security policies/standards
Both Platformsโ
Regardless of platform:
- Drops ALL capabilities (
drop: [ALL]) - Prevents privilege escalation (
allowPrivilegeEscalation: false) - No special permissions or SCCs required beyond defaults
Override Security Contextโ
You can override automatic security contexts in your CR:
spec:
# Container security context
securityContext:
runAsNonRoot: true
runAsUser: 2000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
# Pod security context
podSecurityContext:
runAsNonRoot: true
runAsUser: 2000
fsGroup: 2000
When running on OpenShift, it's recommended to omit runAsUser and fsGroup and let SCCs handle UID/GID assignment automatically.
Configuration Referenceโ
Image Configurationโ
spec:
image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest
pullPolicy: IfNotPresent
imageRegistry: "" # Optional: custom registry prefix
# Optional: Image pull secrets
imagePullSecrets:
- name: ghcr-secret
Service Configurationโ
spec:
service:
type: ClusterIP # or NodePort, LoadBalancer
grpc:
port: 50051
targetPort: 50051
api:
port: 8080
targetPort: 8080
metrics:
enabled: true
port: 9190
targetPort: 9190
Persistence Configurationโ
spec:
persistence:
enabled: true
storageClassName: "standard" # Adjust for your cluster
accessMode: ReadWriteOnce
size: 10Gi
# Optional: Use existing PVC
existingClaim: "my-existing-pvc"
# Optional: PVC annotations
annotations:
backup.velero.io/backup-volumes: "models"
Storage Class Examples:
- AWS EKS:
gp3-csi,gp2 - GKE:
standard,premium-rwo - Azure AKS:
managed,managed-premium - OpenShift:
gp3-csi,thin,ocs-storagecluster-ceph-rbd
Semantic Router Configurationโ
Full semantic router configuration is embedded in the CR. See the complete example in deploy/operator/config/samples/vllm_v1alpha1_semanticrouter.yaml.
Key configuration sections:
spec:
config:
# BERT model for embeddings
bert_model:
model_id: "models/mom-embedding-light"
threshold: 0.6
use_cpu: true
# Semantic cache
semantic_cache:
enabled: true
backend_type: "memory" # or "milvus"
similarity_threshold: 0.8
max_entries: 1000
ttl_seconds: 3600
eviction_policy: "fifo"
# Tools auto-selection
tools:
enabled: true
top_k: 3
similarity_threshold: 0.2
tools_db_path: "config/tools_db.json"
fallback_to_empty: true
# Prompt guard (jailbreak detection)
prompt_guard:
enabled: true
model_id: "models/mom-jailbreak-classifier"
threshold: 0.7
use_cpu: true
# Classifiers
classifier:
category_model:
model_id: "models/lora_intent_classifier_bert-base-uncased_model"
threshold: 0.6
use_cpu: true
pii_model:
model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
threshold: 0.7
use_cpu: true
# Reasoning configuration per model family
reasoning_families:
deepseek:
type: "chat_template_kwargs"
parameter: "thinking"
qwen3:
type: "chat_template_kwargs"
parameter: "enable_thinking"
gpt:
type: "reasoning_effort"
parameter: "reasoning_effort"
# API batch classification
api:
batch_classification:
max_batch_size: 100
concurrency_threshold: 5
max_concurrency: 8
metrics:
enabled: true
detailed_goroutine_tracking: true
sample_rate: 1.0
# Observability
observability:
tracing:
enabled: false
provider: "opentelemetry"
exporter:
type: "otlp"
endpoint: "jaeger:4317"
Tools Databaseโ
Define available tools for auto-selection:
spec:
toolsDb:
- tool:
type: "function"
function:
name: "search_web"
description: "Search the web for information"
parameters:
type: "object"
properties:
query:
type: "string"
description: "Search query"
required: ["query"]
description: "Search the internet, web search, find information online"
category: "search"
tags: ["search", "web", "internet"]
- tool:
type: "function"
function:
name: "calculate"
description: "Perform mathematical calculations"
parameters:
type: "object"
properties:
expression:
type: "string"
required: ["expression"]
description: "Calculate mathematical expressions"
category: "math"
tags: ["math", "calculation"]
Autoscaling (HPA)โ
spec:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
Ingress Configurationโ
spec:
ingress:
enabled: true
className: "nginx" # or "haproxy", "traefik", etc.
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
hosts:
- host: router.example.com
paths:
- path: /
pathType: Prefix
servicePort: 8080
tls:
- secretName: router-tls
hosts:
- router.example.com
Production Deploymentโ
High Availability Setupโ
apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: prod-router
spec:
replicas: 3
# Anti-affinity for spreading across nodes
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: prod-router
topologyKey: kubernetes.io/hostname
# Autoscaling
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70
# Production resources
resources:
limits:
memory: "10Gi"
cpu: "4"
requests:
memory: "5Gi"
cpu: "2"
# Strict probes
livenessProbe:
enabled: true
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3
readinessProbe:
enabled: true
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
Pod Disruption Budgetโ
Create a PDB to ensure availability during updates:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: prod-router-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app.kubernetes.io/instance: prod-router
Resource Allocation Guidelinesโ
| Workload Type | Memory Request | CPU Request | Memory Limit | CPU Limit |
|---|---|---|---|---|
| Development | 1Gi | 500m | 2Gi | 1 |
| Staging | 3Gi | 1 | 7Gi | 2 |
| Production | 5Gi | 2 | 10Gi | 4 |
Monitoring and Observabilityโ
Metricsโ
Prometheus metrics are exposed on port 9190:
# Port-forward to access metrics locally
kubectl port-forward svc/my-router 9190:9190
# View metrics
curl http://localhost:9190/metrics
Key Metrics:
semantic_router_request_duration_seconds- Request latencysemantic_router_cache_hit_total- Cache hit ratesemantic_router_classification_duration_seconds- Classification latencysemantic_router_tokens_total- Token usagesemantic_router_reasoning_requests_total- Reasoning mode usage
ServiceMonitor (Prometheus Operator)โ
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: semantic-router-metrics
spec:
selector:
matchLabels:
app.kubernetes.io/instance: my-router
endpoints:
- port: metrics
interval: 30s
path: /metrics
Distributed Tracingโ
Enable OpenTelemetry tracing:
spec:
config:
observability:
tracing:
enabled: true
provider: "opentelemetry"
exporter:
type: "otlp"
endpoint: "jaeger-collector:4317"
insecure: true
sampling:
type: "always_on"
rate: 1.0
Troubleshootingโ
Common Issuesโ
Pod stuck in ImagePullBackOffโ
# Check image pull secrets
kubectl describe pod <pod-name>
# Create image pull secret
kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=<username> \
--docker-password=<personal-access-token>
# Add to SemanticRouter
spec:
imagePullSecrets:
- name: ghcr-secret
PVC stuck in Pendingโ
# Check storage class exists
kubectl get storageclass
# Check PVC events
kubectl describe pvc my-router-models
# Update storage class in CR
spec:
persistence:
storageClassName: "your-available-storage-class"
Models not downloadingโ
# Check if HF token secret exists
kubectl get secret hf-token-secret
# Create HF token secret
kubectl create secret generic hf-token-secret \
--from-literal=token=hf_xxxxxxxxxxxxx
# Add to SemanticRouter CR
spec:
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
Operator not detecting platform correctlyโ
# Check operator logs for platform detection
kubectl logs -n semantic-router-operator-system \
deployment/semantic-router-operator-controller-manager \
| grep -i "platform\|openshift"
# Should see one of:
# "Detected OpenShift platform - will use OpenShift-compatible security contexts"
# "Detected standard Kubernetes platform - will use standard security contexts"
Developmentโ
Local Developmentโ
cd deploy/operator
# Run tests
make test
# Generate CRDs and code
make generate
make manifests
# Build operator binary
make build
# Run locally against your kubeconfig
make run
Testing with kindโ
# Create kind cluster
kind create cluster --name operator-test
# Build and load image
make docker-build IMG=semantic-router-operator:dev
kind load docker-image semantic-router-operator:dev --name operator-test
# Deploy
make install
make deploy IMG=semantic-router-operator:dev
# Create test instance
kubectl apply -f config/samples/vllm_v1alpha1_semanticrouter.yaml