跳到主要内容
版本:🚧 开发中

Kubernetes Operator

The Semantic Router Operator provides a Kubernetes-native way to deploy and manage vLLM Semantic Router instances using Custom Resource Definitions (CRDs). It simplifies deployment, configuration, and lifecycle management across Kubernetes and OpenShift platforms.

Features

  • 🚀 Declarative Deployment: Define semantic router instances using Kubernetes CRDs
  • 🔄 Automatic Configuration: Generates and manages ConfigMaps for semantic router configuration
  • 📦 Persistent Storage: Manages PVCs for ML model storage with automatic lifecycle
  • 🔐 Platform Detection: Automatically detects and configures for OpenShift or standard Kubernetes
  • 📊 Built-in Observability: Metrics, tracing, and monitoring support out of the box
  • 🎯 Production Features: HPA, ingress, service mesh integration, and pod disruption budgets
  • 🛡️ Secure by Default: Drops all capabilities, prevents privilege escalation

Quick Start

Prerequisites

  • Kubernetes 1.24+ or OpenShift 4.12+
  • kubectl or oc CLI configured
  • Cluster admin access (for CRD installation)

Installation

Option 1: Using Kustomize (Standard Kubernetes)

# Clone the repository
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator

# Install CRDs
make install

# Deploy the operator
make deploy IMG=ghcr.io/vllm-project/semantic-router-operator:latest

Verify the operator is running:

kubectl get pods -n semantic-router-operator-system

Option 2: Using OLM (OpenShift)

For OpenShift deployments using Operator Lifecycle Manager:

cd semantic-router/deploy/operator

# Build and push to your registry (Quay, internal registry, etc.)
podman login quay.io
make podman-build IMG=quay.io/<your-org>/semantic-router-operator:latest
make podman-push IMG=quay.io/<your-org>/semantic-router-operator:latest

# Deploy using OLM
make openshift-deploy

See the OpenShift Quick Start Guide for detailed instructions.

Deploy Your First Router

Create a my-router.yaml file:

apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: my-router
namespace: default
spec:
replicas: 2

image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest

resources:
limits:
memory: "7Gi"
cpu: "2"
requests:
memory: "3Gi"
cpu: "1"

persistence:
enabled: true
size: 10Gi
storageClassName: "standard"

config:
bert_model:
model_id: "models/mom-embedding-light"
threshold: 0.6
use_cpu: true

semantic_cache:
enabled: true
backend_type: "memory"
max_entries: 1000
ttl_seconds: 3600

tools:
enabled: true
top_k: 3
similarity_threshold: 0.2

prompt_guard:
enabled: true
threshold: 0.7

toolsDb:
- tool:
type: "function"
function:
name: "get_weather"
description: "Get weather information for a location"
parameters:
type: "object"
properties:
location:
type: "string"
description: "City and state, e.g. San Francisco, CA"
required: ["location"]
description: "Weather information tool"
category: "weather"
tags: ["weather", "temperature"]

Apply the configuration:

kubectl apply -f my-router.yaml

Verify Deployment

# Check the SemanticRouter resource
kubectl get semanticrouter my-router

# Check created resources
kubectl get deployment,service,configmap -l app.kubernetes.io/instance=my-router

# View status
kubectl describe semanticrouter my-router

# View logs
kubectl logs -f deployment/my-router

Expected output:

NAME                        PHASE     REPLICAS   READY   AGE
semanticrouter.vllm.ai/my-router Running 2 2 5m

Architecture

The operator manages a complete stack of resources for each SemanticRouter:

┌─────────────────────────────────────────────────────┐
│ SemanticRouter CR │
│ apiVersion: vllm.ai/v1alpha1 │
│ kind: SemanticRouter │
└──────────────────┬──────────────────────────────────┘


┌─────────────────────┐
│ Operator Controller │
│ - Watches CR │
│ - Reconciles state │
│ - Platform detection│
└─────────┬────────────┘

┌────────────┼────────────┬──────────────┐
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Deployment│ │ Service │ │ConfigMap│ │ PVC │
│ │ │ - gRPC │ │ - config│ │ - models│
│ │ │ - API │ │ - tools │ │ │
│ │ │ - metrics│ │ │ │ │
└─────────┘ └─────────┘ └─────────┘ └─────────┘

Managed Resources:

  • Deployment: Runs semantic router pods with configurable replicas
  • Service: Exposes gRPC (50051), HTTP API (8080), and metrics (9190)
  • ConfigMap: Contains semantic router configuration and tools database
  • ServiceAccount: For RBAC (optional, created when specified)
  • PersistentVolumeClaim: For ML model storage (optional, when persistence enabled)
  • HorizontalPodAutoscaler: For auto-scaling (optional, when autoscaling enabled)
  • Ingress: For external access (optional, when ingress enabled)

Platform Detection and Security

The operator automatically detects the platform and configures security contexts appropriately.

OpenShift Platform

When running on OpenShift, the operator:

  • Detects: Checks for route.openshift.io API resources
  • Security Context: Does NOT set runAsUser, runAsGroup, or fsGroup
  • Rationale: Lets OpenShift SCCs assign UIDs/GIDs from the namespace's allowed range
  • Compatible with: restricted SCC (default) and custom SCCs
  • Log Message: "Detected OpenShift platform - will use OpenShift-compatible security contexts"

Standard Kubernetes

When running on standard Kubernetes, the operator:

  • Security Context: Sets runAsUser: 1000, fsGroup: 1000, runAsNonRoot: true
  • Rationale: Provides secure defaults for pod security policies/standards
  • Log Message: "Detected standard Kubernetes platform - will use standard security contexts"

Both Platforms

Regardless of platform:

  • Drops ALL capabilities (drop: [ALL])
  • Prevents privilege escalation (allowPrivilegeEscalation: false)
  • No special permissions or SCCs required beyond defaults

Override Security Context

You can override automatic security contexts in your CR:

spec:
# Container security context
securityContext:
runAsNonRoot: true
runAsUser: 2000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL

# Pod security context
podSecurityContext:
runAsNonRoot: true
runAsUser: 2000
fsGroup: 2000
OpenShift Note

When running on OpenShift, it's recommended to omit runAsUser and fsGroup and let SCCs handle UID/GID assignment automatically.

Configuration Reference

Image Configuration

spec:
image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest
pullPolicy: IfNotPresent
imageRegistry: "" # Optional: custom registry prefix

# Optional: Image pull secrets
imagePullSecrets:
- name: ghcr-secret

Service Configuration

spec:
service:
type: ClusterIP # or NodePort, LoadBalancer

grpc:
port: 50051
targetPort: 50051

api:
port: 8080
targetPort: 8080

metrics:
enabled: true
port: 9190
targetPort: 9190

Persistence Configuration

spec:
persistence:
enabled: true
storageClassName: "standard" # Adjust for your cluster
accessMode: ReadWriteOnce
size: 10Gi

# Optional: Use existing PVC
existingClaim: "my-existing-pvc"

# Optional: PVC annotations
annotations:
backup.velero.io/backup-volumes: "models"

Storage Class Examples:

  • AWS EKS: gp3-csi, gp2
  • GKE: standard, premium-rwo
  • Azure AKS: managed, managed-premium
  • OpenShift: gp3-csi, thin, ocs-storagecluster-ceph-rbd

Semantic Router Configuration

Full semantic router configuration is embedded in the CR. See the complete example in deploy/operator/config/samples/vllm_v1alpha1_semanticrouter.yaml.

Key configuration sections:

spec:
config:
# BERT model for embeddings
bert_model:
model_id: "models/mom-embedding-light"
threshold: 0.6
use_cpu: true

# Semantic cache
semantic_cache:
enabled: true
backend_type: "memory" # or "milvus"
similarity_threshold: 0.8
max_entries: 1000
ttl_seconds: 3600
eviction_policy: "fifo"

# Tools auto-selection
tools:
enabled: true
top_k: 3
similarity_threshold: 0.2
tools_db_path: "config/tools_db.json"
fallback_to_empty: true

# Prompt guard (jailbreak detection)
prompt_guard:
enabled: true
model_id: "models/mom-jailbreak-classifier"
threshold: 0.7
use_cpu: true

# Classifiers
classifier:
category_model:
model_id: "models/lora_intent_classifier_bert-base-uncased_model"
threshold: 0.6
use_cpu: true
pii_model:
model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
threshold: 0.7
use_cpu: true

# Reasoning configuration per model family
reasoning_families:
deepseek:
type: "chat_template_kwargs"
parameter: "thinking"
qwen3:
type: "chat_template_kwargs"
parameter: "enable_thinking"
gpt:
type: "reasoning_effort"
parameter: "reasoning_effort"

# API batch classification
api:
batch_classification:
max_batch_size: 100
concurrency_threshold: 5
max_concurrency: 8
metrics:
enabled: true
detailed_goroutine_tracking: true
sample_rate: 1.0

# Observability
observability:
tracing:
enabled: false
provider: "opentelemetry"
exporter:
type: "otlp"
endpoint: "jaeger:4317"

Tools Database

Define available tools for auto-selection:

spec:
toolsDb:
- tool:
type: "function"
function:
name: "search_web"
description: "Search the web for information"
parameters:
type: "object"
properties:
query:
type: "string"
description: "Search query"
required: ["query"]
description: "Search the internet, web search, find information online"
category: "search"
tags: ["search", "web", "internet"]

- tool:
type: "function"
function:
name: "calculate"
description: "Perform mathematical calculations"
parameters:
type: "object"
properties:
expression:
type: "string"
required: ["expression"]
description: "Calculate mathematical expressions"
category: "math"
tags: ["math", "calculation"]

Autoscaling (HPA)

spec:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80

Ingress Configuration

spec:
ingress:
enabled: true
className: "nginx" # or "haproxy", "traefik", etc.
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
hosts:
- host: router.example.com
paths:
- path: /
pathType: Prefix
servicePort: 8080
tls:
- secretName: router-tls
hosts:
- router.example.com

Production Deployment

High Availability Setup

apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: prod-router
spec:
replicas: 3

# Anti-affinity for spreading across nodes
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/instance: prod-router
topologyKey: kubernetes.io/hostname

# Autoscaling
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70

# Production resources
resources:
limits:
memory: "10Gi"
cpu: "4"
requests:
memory: "5Gi"
cpu: "2"

# Strict probes
livenessProbe:
enabled: true
initialDelaySeconds: 60
periodSeconds: 30
failureThreshold: 3

readinessProbe:
enabled: true
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3

Pod Disruption Budget

Create a PDB to ensure availability during updates:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: prod-router-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app.kubernetes.io/instance: prod-router

Resource Allocation Guidelines

Workload TypeMemory RequestCPU RequestMemory LimitCPU Limit
Development1Gi500m2Gi1
Staging3Gi17Gi2
Production5Gi210Gi4

Monitoring and Observability

Metrics

Prometheus metrics are exposed on port 9190:

# Port-forward to access metrics locally
kubectl port-forward svc/my-router 9190:9190

# View metrics
curl http://localhost:9190/metrics

Key Metrics:

  • semantic_router_request_duration_seconds - Request latency
  • semantic_router_cache_hit_total - Cache hit rate
  • semantic_router_classification_duration_seconds - Classification latency
  • semantic_router_tokens_total - Token usage
  • semantic_router_reasoning_requests_total - Reasoning mode usage

ServiceMonitor (Prometheus Operator)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: semantic-router-metrics
spec:
selector:
matchLabels:
app.kubernetes.io/instance: my-router
endpoints:
- port: metrics
interval: 30s
path: /metrics

Distributed Tracing

Enable OpenTelemetry tracing:

spec:
config:
observability:
tracing:
enabled: true
provider: "opentelemetry"
exporter:
type: "otlp"
endpoint: "jaeger-collector:4317"
insecure: true
sampling:
type: "always_on"
rate: 1.0

Troubleshooting

Common Issues

Pod stuck in ImagePullBackOff

# Check image pull secrets
kubectl describe pod <pod-name>

# Create image pull secret
kubectl create secret docker-registry ghcr-secret \
--docker-server=ghcr.io \
--docker-username=<username> \
--docker-password=<personal-access-token>

# Add to SemanticRouter
spec:
imagePullSecrets:
- name: ghcr-secret

PVC stuck in Pending

# Check storage class exists
kubectl get storageclass

# Check PVC events
kubectl describe pvc my-router-models

# Update storage class in CR
spec:
persistence:
storageClassName: "your-available-storage-class"

Models not downloading

# Check if HF token secret exists
kubectl get secret hf-token-secret

# Create HF token secret
kubectl create secret generic hf-token-secret \
--from-literal=token=hf_xxxxxxxxxxxxx

# Add to SemanticRouter CR
spec:
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token

Operator not detecting platform correctly

# Check operator logs for platform detection
kubectl logs -n semantic-router-operator-system \
deployment/semantic-router-operator-controller-manager \
| grep -i "platform\|openshift"

# Should see one of:
# "Detected OpenShift platform - will use OpenShift-compatible security contexts"
# "Detected standard Kubernetes platform - will use standard security contexts"

Migration from Helm

If you're currently using Helm to deploy semantic router:

1. Export Current Configuration

# Get current Helm values
helm get values my-router -n semantic-router > current-values.yaml

2. Convert to SemanticRouter CR

Map Helm values to CR format (most fields map directly):

# Helm: replicas
# CR: spec.replicas

# Helm: image.repository + image.tag
# CR: spec.image.repository + spec.image.tag

# Helm: config.bert_model
# CR: spec.config.bert_model

3. Apply CR and Verify

# Apply the SemanticRouter CR
kubectl apply -f semantic-router-cr.yaml

# Wait for resources to be created
kubectl wait --for=condition=Available semanticrouter/my-router --timeout=5m

# Verify
kubectl get semanticrouter,deployment,service

4. Delete Helm Release

Once verified:

helm uninstall my-router -n semantic-router

Benefits of Operator vs Helm:

  • ✅ Better lifecycle management and automatic updates
  • ✅ Platform-aware security contexts (OpenShift/Kubernetes)
  • ✅ Easier configuration updates (just edit CR)
  • ✅ Status conditions and health reporting
  • ✅ Integrated with Kubernetes ecosystem (kubectl, GitOps)

Development and Contributing

Local Development

cd deploy/operator

# Run tests
make test

# Generate CRDs and code
make generate
make manifests

# Build operator binary
make build

# Run locally against your kubeconfig
make run

Testing with kind

# Create kind cluster
kind create cluster --name operator-test

# Build and load image
make docker-build IMG=semantic-router-operator:dev
kind load docker-image semantic-router-operator:dev --name operator-test

# Deploy
make install
make deploy IMG=semantic-router-operator:dev

# Create test instance
kubectl apply -f config/samples/vllm_v1alpha1_semanticrouter.yaml

API Reference

For the complete CRD API reference, see CRD Reference.

Next Steps