Distributed Tracing with OpenTelemetry

This guide explains how to configure and use distributed tracing in vLLM Semantic Router for enhanced observability and debugging capabilities.

Overview

vLLM Semantic Router implements comprehensive distributed tracing using OpenTelemetry, providing fine-grained visibility into the request processing pipeline. Tracing helps you:

Debug Production Issues: Trace individual requests through the entire routing pipeline
Optimize Performance: Identify bottlenecks in classification, caching, and routing
Monitor Security: Track PII detection and jailbreak prevention operations
Analyze Decisions: Understand routing logic and reasoning mode selection
Correlate Services: Connect traces across the router and vLLM backends

Architecture

Trace Hierarchy

A typical request trace follows this structure:

semantic_router.request.received [root span]
├─ semantic_router.classification
├─ semantic_router.security.pii_detection
├─ semantic_router.security.jailbreak_detection
├─ semantic_router.cache.lookup
├─ semantic_router.routing.decision
├─ semantic_router.backend.selection
├─ semantic_router.system_prompt.injection
└─ semantic_router.upstream.request

Span Attributes

Each span includes rich attributes following OpenInference conventions for LLM observability:

Request Metadata:

request.id - Unique request identifier
user.id - User identifier (if available)
http.method - HTTP method
http.path - Request path

Model Information:

model.name - Selected model name
routing.original_model - Original requested model
routing.selected_model - Model selected by router

Classification:

category.name - Classified category
classifier.type - Classifier implementation
classification.time_ms - Classification duration

Security:

pii.detected - Whether PII was found
pii.types - Types of PII detected
jailbreak.detected - Whether jailbreak attempt detected
security.action - Action taken (blocked, allowed)

Routing:

routing.strategy - Routing strategy (auto, specified)
routing.reason - Reason for routing decision
reasoning.enabled - Whether reasoning mode enabled
reasoning.effort - Reasoning effort level

Performance:

cache.hit - Cache hit/miss status
cache.lookup_time_ms - Cache lookup duration
processing.time_ms - Total processing time

Configuration

Basic Configuration

Add the observability.tracing section to your config.yaml:

observability:
  tracing:
    enabled: true
    provider: "opentelemetry"
    exporter:
      type: "stdout"  # or "otlp"
      endpoint: "localhost:4317"
      insecure: true
    sampling:
      type: "always_on"  # or "probabilistic"
      rate: 1.0
    resource:
      service_name: "vllm-semantic-router"
      service_version: "v0.1.0"
      deployment_environment: "production"

Configuration Options

Exporter Types

stdout - Print traces to console (development)

exporter:
  type: "stdout"

otlp - Export to OTLP-compatible backend (production)

exporter:
  type: "otlp"
  endpoint: "jaeger:4317"  # Jaeger, Tempo, Datadog, etc.
  insecure: true  # Use false with TLS in production

Sampling Strategies

always_on - Sample all requests (development/debugging)

sampling:
  type: "always_on"

always_off - Disable sampling (emergency performance)

sampling:
  type: "always_off"

probabilistic - Sample a percentage of requests (production)

sampling:
  type: "probabilistic"
  rate: 0.1  # Sample 10% of requests

Environment-Specific Configurations

Development

observability:
  tracing:
    enabled: true
    provider: "opentelemetry"
    exporter:
      type: "stdout"
    sampling:
      type: "always_on"
    resource:
      service_name: "vllm-semantic-router-dev"
      deployment_environment: "development"

Production

observability:
  tracing:
    enabled: true
    provider: "opentelemetry"
    exporter:
      type: "otlp"
      endpoint: "tempo:4317"
      insecure: false  # Use TLS
    sampling:
      type: "probabilistic"
      rate: 0.1  # 10% sampling
    resource:
      service_name: "vllm-semantic-router"
      service_version: "v0.1.0"
      deployment_environment: "production"

Deployment

With Jaeger

Start Jaeger (all-in-one for testing):

docker run -d --name jaeger \
  -p 4317:4317 \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest

Configure Router:

observability:
  tracing:
    enabled: true
    exporter:
      type: "otlp"
      endpoint: "localhost:4317"
      insecure: true
    sampling:
      type: "probabilistic"
      rate: 0.1

Access Jaeger UI: http://localhost:16686

With Grafana Tempo

Configure Tempo (tempo.yaml):

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317

storage:
  trace:
    backend: local
    local:
      path: /tmp/tempo/traces

Start Tempo:

docker run -d --name tempo \
  -p 4317:4317 \
  -p 3200:3200 \
  -v $(pwd)/tempo.yaml:/etc/tempo.yaml \
  grafana/tempo:latest \
  -config.file=/etc/tempo.yaml

Configure Router:

observability:
  tracing:
    enabled: true
    exporter:
      type: "otlp"
      endpoint: "tempo:4317"
      insecure: true

Kubernetes Deployment

apiVersion: v1
kind: ConfigMap
metadata:
  name: router-config
data:
  config.yaml: |
    observability:
      tracing:
        enabled: true
        exporter:
          type: "otlp"
          endpoint: "jaeger-collector.observability.svc:4317"
          insecure: false
        sampling:
          type: "probabilistic"
          rate: 0.1
        resource:
          service_name: "vllm-semantic-router"
          deployment_environment: "production"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: semantic-router
spec:
  template:
    spec:
      containers:
      - name: router
        image: vllm-semantic-router:latest
        env:
        - name: CONFIG_PATH
          value: /config/config.yaml
        volumeMounts:
        - name: config
          mountPath: /config
      volumes:
      - name: config
        configMap:
          name: router-config

Usage Examples

Viewing Traces

Console Output (stdout exporter)

{
  "Name": "semantic_router.classification",
  "SpanContext": {
    "TraceID": "abc123...",
    "SpanID": "def456..."
  },
  "Attributes": [
    {
      "Key": "category.name",
      "Value": "math"
    },
    {
      "Key": "classification.time_ms",
      "Value": 45
    }
  ],
  "Duration": 45000000
}

Jaeger UI

Navigate to http://localhost:16686
Select service: vllm-semantic-router
Click "Find Traces"
View trace details and timeline

Analyzing Performance

Find slow requests:

Service: vllm-semantic-router
Min Duration: 1s
Limit: 20

Analyze classification bottlenecks: Filter by operation: semantic_router.classification Sort by duration (descending)

Track cache effectiveness: Filter by tag: cache.hit = true Compare durations with cache misses

Debugging Issues

Find failed requests: Filter by tag: error = true

Trace specific request: Filter by tag: request.id = req-abc-123

Find PII violations: Filter by tag: security.action = blocked

Trace Context Propagation

The router automatically propagates trace context using W3C Trace Context headers:

Request headers (extracted by router):

traceparent: 00-abc123-def456-01
tracestate: vendor=value

Upstream headers (injected by router):

traceparent: 00-abc123-ghi789-01
x-vsr-destination-endpoint: endpoint1
x-selected-model: gpt-4

This enables end-to-end tracing from client → router → vLLM backend.

Performance Considerations

Overhead

Tracing adds minimal overhead when properly configured:

Always-on sampling: ~1-2% latency increase
10% probabilistic: ~0.1-0.2% latency increase
Async export: No blocking on span export

Optimization Tips

Use probabilistic sampling in production

sampling:
  type: "probabilistic"
  rate: 0.1  # Adjust based on traffic

Adjust sampling rate dynamically
- High traffic: 0.01-0.1 (1-10%)
- Medium traffic: 0.1-0.5 (10-50%)
- Low traffic: 0.5-1.0 (50-100%)
Use batch exporters (default)
- Spans are batched before export
- Reduces network overhead
Monitor exporter health
- Watch for export failures in logs
- Configure retry policies

Troubleshooting

Traces Not Appearing

Check tracing is enabled:

observability:
  tracing:
    enabled: true

Verify exporter endpoint:

# Test OTLP endpoint connectivity
telnet jaeger 4317

Check logs for errors:

Failed to export spans: connection refused

Missing Spans

Check sampling rate:

sampling:
  type: "probabilistic"
  rate: 1.0  # Increase to see more traces

Verify span creation in code:

Spans are created at key processing points
Check for nil context

High Memory Usage

Reduce sampling rate:

sampling:
  rate: 0.01  # 1% sampling

Verify batch exporter is working:

Check export interval
Monitor queue length

Best Practices

Start with stdout in development
- Easy to verify tracing works
- No external dependencies
Use probabilistic sampling in production
- Balances visibility and performance
- Start with 10% and adjust
Set meaningful service names
- Use environment-specific names
- Include version information
Add custom attributes for your use case
- Customer IDs
- Deployment region
- Feature flags
Monitor exporter health
- Track export success rate
- Alert on high failure rates
Correlate with metrics
- Use same service name
- Cross-reference trace IDs in logs

Integration with vLLM Stack

Future Enhancements

The tracing implementation is designed to support future integration with vLLM backends:

Trace context propagation to vLLM
Correlated spans across router and engine
End-to-end latency analysis
Token-level timing from vLLM

Stay tuned for updates on vLLM integration!

Overview​

Architecture​

Trace Hierarchy​

Span Attributes​

Configuration​

Basic Configuration​

Configuration Options​

Exporter Types​

Sampling Strategies​

Environment-Specific Configurations​

Development​

Production​

Deployment​

With Jaeger​

With Grafana Tempo​

Kubernetes Deployment​

Usage Examples​

Viewing Traces​

Console Output (stdout exporter)​

Jaeger UI​

Analyzing Performance​

Debugging Issues​

Trace Context Propagation​

Performance Considerations​

Overhead​

Optimization Tips​

Troubleshooting​

Traces Not Appearing​

Missing Spans​

High Memory Usage​

Best Practices​

Integration with vLLM Stack​

Future Enhancements​

References​