Version: 🚧 Next

Kubernetes Operator

The Semantic Router Operator provides a Kubernetes-native way to deploy and manage vLLM Semantic Router instances using Custom Resource Definitions (CRDs). It simplifies deployment, configuration, and lifecycle management across Kubernetes and OpenShift platforms.

Features

🚀 Declarative Deployment: Define semantic router instances using Kubernetes CRDs
🔄 Automatic Configuration: Generates and manages ConfigMaps for semantic router configuration
📦 Persistent Storage: Manages PVCs for ML model storage with automatic lifecycle
🔐 Platform Detection: Automatically detects and configures for OpenShift or standard Kubernetes
📊 Built-in Observability: Metrics, tracing, and monitoring support out of the box
🎯 Production Features: HPA, ingress, service mesh integration, and pod disruption budgets
🛡️ Secure by Default: Drops all capabilities, prevents privilege escalation

Prerequisites

Kubernetes 1.24+ or OpenShift 4.12+
kubectl or oc CLI configured
Cluster admin access (for CRD installation)

Installation

Option 1: Using Kustomize (Standard Kubernetes)

# Clone the repository
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator

# Install CRDs
make install

# Deploy the operator
make deploy IMG=ghcr.io/vllm-project/semantic-router-operator:latest

Verify the operator is running:

kubectl get pods -n semantic-router-operator-system

Option 2: Using OLM (OpenShift)

For OpenShift deployments using Operator Lifecycle Manager:

cd semantic-router/deploy/operator

# Build and push to your registry (Quay, internal registry, etc.)
podman login quay.io
make podman-build IMG=quay.io/<your-org>/semantic-router-operator:latest
make podman-push IMG=quay.io/<your-org>/semantic-router-operator:latest

# Deploy using OLM
make openshift-deploy

Deploy Your First Router

Create a my-router.yaml file:

apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
  name: my-router
  namespace: default
spec:
  replicas: 2

  image:
    repository: ghcr.io/vllm-project/semantic-router/extproc
    tag: latest

  resources:
    limits:
      memory: "7Gi"
      cpu: "2"
    requests:
      memory: "3Gi"
      cpu: "1"

  persistence:
    enabled: true
    size: 10Gi
    storageClassName: "standard"

  config:
    bert_model:
      model_id: "models/mom-embedding-light"
      threshold: 0.6
      use_cpu: true

    semantic_cache:
      enabled: true
      backend_type: "memory"
      max_entries: 1000
      ttl_seconds: 3600

    tools:
      enabled: true
      top_k: 3
      similarity_threshold: 0.2

    prompt_guard:
      enabled: true
      threshold: 0.7

  toolsDb:
    - tool:
        type: "function"
        function:
          name: "get_weather"
          description: "Get weather information for a location"
          parameters:
            type: "object"
            properties:
              location:
                type: "string"
                description: "City and state, e.g. San Francisco, CA"
            required: ["location"]
      description: "Weather information tool"
      category: "weather"
      tags: ["weather", "temperature"]

Apply the configuration:

kubectl apply -f my-router.yaml

Verify Deployment

# Check the SemanticRouter resource
kubectl get semanticrouter my-router

# Check created resources
kubectl get deployment,service,configmap -l app.kubernetes.io/instance=my-router

# View status
kubectl describe semanticrouter my-router

# View logs
kubectl logs -f deployment/my-router

Expected output:

NAME                        PHASE     REPLICAS   READY   AGE
semanticrouter.vllm.ai/my-router   Running   2          2       5m

Architecture

The operator manages a complete stack of resources for each SemanticRouter:

┌─────────────────────────────────────────────────────┐
│              SemanticRouter CR                       │
│  apiVersion: vllm.ai/v1alpha1                       │
│  kind: SemanticRouter                               │
└──────────────────┬──────────────────────────────────┘
                   │
                   ▼
        ┌─────────────────────┐
        │  Operator Controller │
        │  - Watches CR        │
        │  - Reconciles state  │
        │  - Platform detection│
        └─────────┬────────────┘
                  │
     ┌────────────┼────────────┬──────────────┐
     ▼            ▼            ▼              ▼
┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
│Deployment│  │ Service │  │ConfigMap│  │   PVC   │
│         │  │ - gRPC  │  │ - config│  │ - models│
│         │  │ - API   │  │ - tools │  │         │
│         │  │ - metrics│  │         │  │         │
└─────────┘  └─────────┘  └─────────┘  └─────────┘

Managed Resources:

Deployment: Runs semantic router pods with configurable replicas
Service: Exposes gRPC (50051), HTTP API (8080), and metrics (9190)
ConfigMap: Contains semantic router configuration and tools database
ServiceAccount: For RBAC (optional, created when specified)
PersistentVolumeClaim: For ML model storage (optional, when persistence enabled)
HorizontalPodAutoscaler: For auto-scaling (optional, when autoscaling enabled)
Ingress: For external access (optional, when ingress enabled)

Platform Detection and Security

The operator automatically detects the platform and configures security contexts appropriately.

OpenShift Platform

When running on OpenShift, the operator:

Detects: Checks for route.openshift.io API resources
Security Context: Does NOT set runAsUser, runAsGroup, or fsGroup
Rationale: Lets OpenShift SCCs assign UIDs/GIDs from the namespace's allowed range
Compatible with: restricted SCC (default) and custom SCCs

Standard Kubernetes

When running on standard Kubernetes, the operator:

Security Context: Sets runAsUser: 1000, fsGroup: 1000, runAsNonRoot: true
Rationale: Provides secure defaults for pod security policies/standards

Both Platforms

Regardless of platform:

Drops ALL capabilities (drop: [ALL])
Prevents privilege escalation (allowPrivilegeEscalation: false)
No special permissions or SCCs required beyond defaults

Override Security Context

You can override automatic security contexts in your CR:

spec:
  # Container security context
  securityContext:
    runAsNonRoot: true
    runAsUser: 2000
    allowPrivilegeEscalation: false
    capabilities:
      drop:
        - ALL

  # Pod security context
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 2000
    fsGroup: 2000

OpenShift Note

When running on OpenShift, it's recommended to omit runAsUser and fsGroup and let SCCs handle UID/GID assignment automatically.

Configuration Reference

Image Configuration

spec:
  image:
    repository: ghcr.io/vllm-project/semantic-router/extproc
    tag: latest
    pullPolicy: IfNotPresent
    imageRegistry: ""  # Optional: custom registry prefix

  # Optional: Image pull secrets
  imagePullSecrets:
    - name: ghcr-secret

Service Configuration

spec:
  service:
    type: ClusterIP  # or NodePort, LoadBalancer

    grpc:
      port: 50051
      targetPort: 50051

    api:
      port: 8080
      targetPort: 8080

    metrics:
      enabled: true
      port: 9190
      targetPort: 9190

Persistence Configuration

spec:
  persistence:
    enabled: true
    storageClassName: "standard"  # Adjust for your cluster
    accessMode: ReadWriteOnce
    size: 10Gi

    # Optional: Use existing PVC
    existingClaim: "my-existing-pvc"

    # Optional: PVC annotations
    annotations:
      backup.velero.io/backup-volumes: "models"

Storage Class Examples:

AWS EKS: gp3-csi, gp2
GKE: standard, premium-rwo
Azure AKS: managed, managed-premium
OpenShift: gp3-csi, thin, ocs-storagecluster-ceph-rbd

Semantic Router Configuration

Full semantic router configuration is embedded in the CR. See the complete example in deploy/operator/config/samples/vllm_v1alpha1_semanticrouter.yaml.

Key configuration sections:

spec:
  config:
    # BERT model for embeddings
    bert_model:
      model_id: "models/mom-embedding-light"
      threshold: 0.6
      use_cpu: true

    # Semantic cache
    semantic_cache:
      enabled: true
      backend_type: "memory"  # or "milvus"
      similarity_threshold: 0.8
      max_entries: 1000
      ttl_seconds: 3600
      eviction_policy: "fifo"

    # Tools auto-selection
    tools:
      enabled: true
      top_k: 3
      similarity_threshold: 0.2
      tools_db_path: "config/tools_db.json"
      fallback_to_empty: true

    # Prompt guard (jailbreak detection)
    prompt_guard:
      enabled: true
      model_id: "models/mom-jailbreak-classifier"
      threshold: 0.7
      use_cpu: true

    # Classifiers
    classifier:
      category_model:
        model_id: "models/lora_intent_classifier_bert-base-uncased_model"
        threshold: 0.6
        use_cpu: true
      pii_model:
        model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
        threshold: 0.7
        use_cpu: true

    # Reasoning configuration per model family
    reasoning_families:
      deepseek:
        type: "chat_template_kwargs"
        parameter: "thinking"
      qwen3:
        type: "chat_template_kwargs"
        parameter: "enable_thinking"
      gpt:
        type: "reasoning_effort"
        parameter: "reasoning_effort"

    # API batch classification
    api:
      batch_classification:
        max_batch_size: 100
        concurrency_threshold: 5
        max_concurrency: 8
        metrics:
          enabled: true
          detailed_goroutine_tracking: true
          sample_rate: 1.0

    # Observability
    observability:
      tracing:
        enabled: false
        provider: "opentelemetry"
        exporter:
          type: "otlp"
          endpoint: "jaeger:4317"

Tools Database

Define available tools for auto-selection:

spec:
  toolsDb:
    - tool:
        type: "function"
        function:
          name: "search_web"
          description: "Search the web for information"
          parameters:
            type: "object"
            properties:
              query:
                type: "string"
                description: "Search query"
            required: ["query"]
      description: "Search the internet, web search, find information online"
      category: "search"
      tags: ["search", "web", "internet"]

    - tool:
        type: "function"
        function:
          name: "calculate"
          description: "Perform mathematical calculations"
          parameters:
            type: "object"
            properties:
              expression:
                type: "string"
            required: ["expression"]
      description: "Calculate mathematical expressions"
      category: "math"
      tags: ["math", "calculation"]

Autoscaling (HPA)

spec:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70
    targetMemoryUtilizationPercentage: 80

Ingress Configuration

spec:
  ingress:
    enabled: true
    className: "nginx"  # or "haproxy", "traefik", etc.
    annotations:
      cert-manager.io/cluster-issuer: "letsencrypt-prod"
    hosts:
      - host: router.example.com
        paths:
          - path: /
            pathType: Prefix
            servicePort: 8080
    tls:
      - secretName: router-tls
        hosts:
          - router.example.com

Production Deployment

High Availability Setup

apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
  name: prod-router
spec:
  replicas: 3

  # Anti-affinity for spreading across nodes
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/instance: prod-router
          topologyKey: kubernetes.io/hostname

  # Autoscaling
  autoscaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 20
    targetCPUUtilizationPercentage: 70

  # Production resources
  resources:
    limits:
      memory: "10Gi"
      cpu: "4"
    requests:
      memory: "5Gi"
      cpu: "2"

  # Strict probes
  livenessProbe:
    enabled: true
    initialDelaySeconds: 60
    periodSeconds: 30
    failureThreshold: 3

  readinessProbe:
    enabled: true
    initialDelaySeconds: 30
    periodSeconds: 10
    failureThreshold: 3

Pod Disruption Budget

Create a PDB to ensure availability during updates:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: prod-router-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/instance: prod-router

Resource Allocation Guidelines

Workload Type	Memory Request	CPU Request	Memory Limit	CPU Limit
Development	1Gi	500m	2Gi	1
Staging	3Gi	1	7Gi	2
Production	5Gi	2	10Gi	4

Monitoring and Observability

Metrics

Prometheus metrics are exposed on port 9190:

# Port-forward to access metrics locally
kubectl port-forward svc/my-router 9190:9190

# View metrics
curl http://localhost:9190/metrics

Key Metrics:

semantic_router_request_duration_seconds - Request latency
semantic_router_cache_hit_total - Cache hit rate
semantic_router_classification_duration_seconds - Classification latency
semantic_router_tokens_total - Token usage
semantic_router_reasoning_requests_total - Reasoning mode usage

ServiceMonitor (Prometheus Operator)

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: semantic-router-metrics
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: my-router
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Distributed Tracing

Enable OpenTelemetry tracing:

spec:
  config:
    observability:
      tracing:
        enabled: true
        provider: "opentelemetry"
        exporter:
          type: "otlp"
          endpoint: "jaeger-collector:4317"
          insecure: true
        sampling:
          type: "always_on"
          rate: 1.0

Troubleshooting

Common Issues

Pod stuck in `ImagePullBackOff`

# Check image pull secrets
kubectl describe pod <pod-name>

# Create image pull secret
kubectl create secret docker-registry ghcr-secret \
  --docker-server=ghcr.io \
  --docker-username=<username> \
  --docker-password=<personal-access-token>

# Add to SemanticRouter
spec:
  imagePullSecrets:
    - name: ghcr-secret

PVC stuck in `Pending`

# Check storage class exists
kubectl get storageclass

# Check PVC events
kubectl describe pvc my-router-models

# Update storage class in CR
spec:
  persistence:
    storageClassName: "your-available-storage-class"

Models not downloading

# Check if HF token secret exists
kubectl get secret hf-token-secret

# Create HF token secret
kubectl create secret generic hf-token-secret \
  --from-literal=token=hf_xxxxxxxxxxxxx

# Add to SemanticRouter CR
spec:
  env:
    - name: HF_TOKEN
      valueFrom:
        secretKeyRef:
          name: hf-token-secret
          key: token

Operator not detecting platform correctly

# Check operator logs for platform detection
kubectl logs -n semantic-router-operator-system \
  deployment/semantic-router-operator-controller-manager \
  | grep -i "platform\|openshift"

# Should see one of:
# "Detected OpenShift platform - will use OpenShift-compatible security contexts"
# "Detected standard Kubernetes platform - will use standard security contexts"

Development

Local Development

cd deploy/operator

# Run tests
make test

# Generate CRDs and code
make generate
make manifests

# Build operator binary
make build

# Run locally against your kubeconfig
make run

Testing with kind

# Create kind cluster
kind create cluster --name operator-test

# Build and load image
make docker-build IMG=semantic-router-operator:dev
kind load docker-image semantic-router-operator:dev --name operator-test

# Deploy
make install
make deploy IMG=semantic-router-operator:dev

# Create test instance
kubectl apply -f config/samples/vllm_v1alpha1_semanticrouter.yaml

Features​

Prerequisites​

Installation​

Option 1: Using Kustomize (Standard Kubernetes)​

Option 2: Using OLM (OpenShift)​

Deploy Your First Router​

Verify Deployment​

Architecture​

Platform Detection and Security​

OpenShift Platform​

Standard Kubernetes​

Both Platforms​

Override Security Context​

Configuration Reference​

Image Configuration​

Service Configuration​

Persistence Configuration​

Semantic Router Configuration​

Tools Database​

Autoscaling (HPA)​

Ingress Configuration​

Production Deployment​

High Availability Setup​

Pod Disruption Budget​

Resource Allocation Guidelines​

Monitoring and Observability​

Metrics​

ServiceMonitor (Prometheus Operator)​

Distributed Tracing​

Troubleshooting​

Common Issues​

Pod stuck in ImagePullBackOff​

PVC stuck in Pending​

Models not downloading​

Operator not detecting platform correctly​

Development​

Local Development​

Testing with kind​

Next Steps​