Distributed Tracing with OpenTelemetry
This guide explains how to configure and use distributed tracing in vLLM Semantic Router for enhanced observability and debugging capabilities.
Overview​
vLLM Semantic Router implements comprehensive distributed tracing using OpenTelemetry, providing fine-grained visibility into the request processing pipeline. Tracing helps you:
- Debug Production Issues: Trace individual requests through the entire routing pipeline
- Optimize Performance: Identify bottlenecks in classification, caching, and routing
- Monitor Security: Track PII detection and jailbreak prevention operations
- Analyze Decisions: Understand routing logic and reasoning mode selection
- Correlate Services: Connect traces across the router and vLLM backends
Architecture​
Trace Hierarchy​
A typical request trace follows this structure:
semantic_router.request.received [root span]
├─ semantic_router.classification
├─ semantic_router.security.pii_detection
├─ semantic_router.security.jailbreak_detection
├─ semantic_router.cache.lookup
├─ semantic_router.routing.decision
├─ semantic_router.backend.selection
├─ semantic_router.system_prompt.injection
└─ semantic_router.upstream.request
Span Attributes​
Each span includes rich attributes following OpenInference conventions for LLM observability:
Request Metadata:
request.id
- Unique request identifieruser.id
- User identifier (if available)http.method
- HTTP methodhttp.path
- Request path
Model Information:
model.name
- Selected model namerouting.original_model
- Original requested modelrouting.selected_model
- Model selected by router
Classification:
category.name
- Classified categoryclassifier.type
- Classifier implementationclassification.time_ms
- Classification duration
Security:
pii.detected
- Whether PII was foundpii.types
- Types of PII detectedjailbreak.detected
- Whether jailbreak attempt detectedsecurity.action
- Action taken (blocked, allowed)
Routing:
routing.strategy
- Routing strategy (auto, specified)routing.reason
- Reason for routing decisionreasoning.enabled
- Whether reasoning mode enabledreasoning.effort
- Reasoning effort level
Performance:
cache.hit
- Cache hit/miss statuscache.lookup_time_ms
- Cache lookup durationprocessing.time_ms
- Total processing time
Configuration​
Basic Configuration​
Add the observability.tracing
section to your config.yaml
:
observability:
tracing:
enabled: true
provider: "opentelemetry"
exporter:
type: "stdout" # or "otlp"
endpoint: "localhost:4317"
insecure: true
sampling:
type: "always_on" # or "probabilistic"
rate: 1.0
resource:
service_name: "vllm-semantic-router"
service_version: "v0.1.0"
deployment_environment: "production"
Configuration Options​
Exporter Types​
stdout - Print traces to console (development)
exporter:
type: "stdout"
otlp - Export to OTLP-compatible backend (production)
exporter:
type: "otlp"
endpoint: "jaeger:4317" # Jaeger, Tempo, Datadog, etc.
insecure: true # Use false with TLS in production
Sampling Strategies​
always_on - Sample all requests (development/debugging)
sampling:
type: "always_on"
always_off - Disable sampling (emergency performance)
sampling:
type: "always_off"
probabilistic - Sample a percentage of requests (production)
sampling:
type: "probabilistic"
rate: 0.1 # Sample 10% of requests
Environment-Specific Configurations​
Development​
observability:
tracing:
enabled: true
provider: "opentelemetry"
exporter:
type: "stdout"
sampling:
type: "always_on"
resource:
service_name: "vllm-semantic-router-dev"
deployment_environment: "development"
Production​
observability:
tracing:
enabled: true
provider: "opentelemetry"
exporter:
type: "otlp"
endpoint: "tempo:4317"
insecure: false # Use TLS
sampling:
type: "probabilistic"
rate: 0.1 # 10% sampling
resource:
service_name: "vllm-semantic-router"
service_version: "v0.1.0"
deployment_environment: "production"
Deployment​
With Jaeger​
- Start Jaeger (all-in-one for testing):
docker run -d --name jaeger \
-p 4317:4317 \
-p 16686:16686 \
jaegertracing/all-in-one:latest
- Configure Router:
observability:
tracing:
enabled: true
exporter:
type: "otlp"
endpoint: "localhost:4317"
insecure: true
sampling:
type: "probabilistic"
rate: 0.1
- Access Jaeger UI: http://localhost:16686
With Grafana Tempo​
- Configure Tempo (tempo.yaml):
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: local
local:
path: /tmp/tempo/traces
- Start Tempo:
docker run -d --name tempo \
-p 4317:4317 \
-p 3200:3200 \
-v $(pwd)/tempo.yaml:/etc/tempo.yaml \
grafana/tempo:latest \
-config.file=/etc/tempo.yaml
- Configure Router:
observability:
tracing:
enabled: true
exporter:
type: "otlp"
endpoint: "tempo:4317"
insecure: true
Kubernetes Deployment​
apiVersion: v1
kind: ConfigMap
metadata:
name: router-config
data:
config.yaml: |
observability:
tracing:
enabled: true
exporter:
type: "otlp"
endpoint: "jaeger-collector.observability.svc:4317"
insecure: false
sampling:
type: "probabilistic"
rate: 0.1
resource:
service_name: "vllm-semantic-router"
deployment_environment: "production"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: semantic-router
spec:
template:
spec:
containers:
- name: router
image: vllm-semantic-router:latest
env:
- name: CONFIG_PATH
value: /config/config.yaml
volumeMounts:
- name: config
mountPath: /config
volumes:
- name: config
configMap:
name: router-config
Usage Examples​
Viewing Traces​
Console Output (stdout exporter)​
{
"Name": "semantic_router.classification",
"SpanContext": {
"TraceID": "abc123...",
"SpanID": "def456..."
},
"Attributes": [
{
"Key": "category.name",
"Value": "math"
},
{
"Key": "classification.time_ms",
"Value": 45
}
],
"Duration": 45000000
}
Jaeger UI​
- Navigate to http://localhost:16686
- Select service:
vllm-semantic-router
- Click "Find Traces"
- View trace details and timeline
Analyzing Performance​
Find slow requests:
Service: vllm-semantic-router
Min Duration: 1s
Limit: 20
Analyze classification bottlenecks:
Filter by operation: semantic_router.classification
Sort by duration (descending)
Track cache effectiveness:
Filter by tag: cache.hit = true
Compare durations with cache misses
Debugging Issues​
Find failed requests:
Filter by tag: error = true
Trace specific request:
Filter by tag: request.id = req-abc-123
Find PII violations:
Filter by tag: security.action = blocked
Trace Context Propagation​
The router automatically propagates trace context using W3C Trace Context headers:
Request headers (extracted by router):
traceparent: 00-abc123-def456-01
tracestate: vendor=value
Upstream headers (injected by router):
traceparent: 00-abc123-ghi789-01
x-gateway-destination-endpoint: endpoint1
x-selected-model: gpt-4
This enables end-to-end tracing from client → router → vLLM backend.
Performance Considerations​
Overhead​
Tracing adds minimal overhead when properly configured:
- Always-on sampling: ~1-2% latency increase
- 10% probabilistic: ~0.1-0.2% latency increase
- Async export: No blocking on span export
Optimization Tips​
-
Use probabilistic sampling in production
sampling:
type: "probabilistic"
rate: 0.1 # Adjust based on traffic -
Adjust sampling rate dynamically
- High traffic: 0.01-0.1 (1-10%)
- Medium traffic: 0.1-0.5 (10-50%)
- Low traffic: 0.5-1.0 (50-100%)
-
Use batch exporters (default)
- Spans are batched before export
- Reduces network overhead
-
Monitor exporter health
- Watch for export failures in logs
- Configure retry policies
Troubleshooting​
Traces Not Appearing​
- Check tracing is enabled:
observability:
tracing:
enabled: true
- Verify exporter endpoint:
# Test OTLP endpoint connectivity
telnet jaeger 4317
- Check logs for errors:
Failed to export spans: connection refused
Missing Spans​
- Check sampling rate:
sampling:
type: "probabilistic"
rate: 1.0 # Increase to see more traces
- Verify span creation in code:
- Spans are created at key processing points
- Check for nil context
High Memory Usage​
- Reduce sampling rate:
sampling:
rate: 0.01 # 1% sampling
- Verify batch exporter is working:
- Check export interval
- Monitor queue length
Best Practices​
-
Start with stdout in development
- Easy to verify tracing works
- No external dependencies
-
Use probabilistic sampling in production
- Balances visibility and performance
- Start with 10% and adjust
-
Set meaningful service names
- Use environment-specific names
- Include version information
-
Add custom attributes for your use case
- Customer IDs
- Deployment region
- Feature flags
-
Monitor exporter health
- Track export success rate
- Alert on high failure rates
-
Correlate with metrics
- Use same service name
- Cross-reference trace IDs in logs
Integration with vLLM Stack​
Future Enhancements​
The tracing implementation is designed to support future integration with vLLM backends:
- Trace context propagation to vLLM
- Correlated spans across router and engine
- End-to-end latency analysis
- Token-level timing from vLLM
Stay tuned for updates on vLLM integration!