Router API Reference
The Semantic Router provides a gRPC-based API that integrates seamlessly with Envoy's External Processing (ExtProc) protocol. This document covers the API endpoints, request/response formats, and integration patterns.
API Overview​
The Semantic Router operates as an ExtProc server that processes HTTP requests through Envoy Proxy. It doesn't expose direct REST endpoints but rather processes OpenAI-compatible API requests routed through Envoy.
Request Flow​
OpenAI API Compatibility​
The router processes standard OpenAI API requests:
Chat Completions Endpoint​
Endpoint: POST /v1/chat/completions
Request Format​
{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "user",
"content": "What is the derivative of x^2?"
}
],
"max_tokens": 150,
"temperature": 0.7,
"tools": [
{
"type": "function",
"function": {
"name": "calculator",
"description": "Perform mathematical calculations"
}
}
]
}
Response Format​
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1677858242,
"model": "gpt-3.5-turbo",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The derivative of x^2 is 2x."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 8,
"total_tokens": 20
},
"routing_metadata": {
"selected_model": "mathematics",
"confidence": 0.96,
"processing_time_ms": 15,
"cache_hit": false,
"security_checks": {
"pii_detected": false,
"jailbreak_detected": false
}
}
}
Routing Headers​
The router adds metadata headers to both requests and responses:
Request Headers (Added by Router)​
Header | Description | Example |
---|---|---|
x-semantic-destination-endpoint | Backend endpoint selected | endpoint1 |
x-selected-model | Model category determined | mathematics |
x-routing-confidence | Classification confidence | 0.956 |
x-request-id | Unique request identifier | req-abc123 |
x-cache-status | Cache hit/miss status | miss |
Response Headers (Added by Router)​
Header | Description | Example |
---|---|---|
x-processing-time | Total processing time (ms) | 45 |
x-classification-time | Classification time (ms) | 12 |
x-security-checks | Security check results | pii:false,jailbreak:false |
x-tools-selected | Number of tools selected | 2 |
Health Check API​
The router provides health check endpoints for monitoring:
Router Health​
Endpoint: GET http://localhost:8080/health
{
"status": "healthy",
"version": "1.0.0",
"uptime": 3600,
"models": {
"category_classifier": "loaded",
"pii_detector": "loaded",
"jailbreak_guard": "loaded"
},
"cache": {
"status": "healthy",
"entries": 1247,
"hit_rate": 0.73
},
"endpoints": {
"endpoint1": "healthy",
"endpoint2": "healthy",
"endpoint3": "degraded"
}
}
Metrics API​
Prometheus-compatible metrics are available:
Endpoint: GET http://localhost:9090/metrics
Key Metrics​
# Request metrics
semantic_router_requests_total{endpoint="endpoint1",category="mathematics",status="success"} 1247
semantic_router_request_duration_seconds{endpoint="endpoint1"} 0.045
# Classification metrics
semantic_router_classification_accuracy{category="mathematics"} 0.94
semantic_router_classification_duration_seconds 0.012
# Cache metrics
semantic_router_cache_hit_ratio 0.73
semantic_router_cache_size 1247
# Security metrics
semantic_router_pii_detections_total{action="block"} 23
semantic_router_jailbreak_attempts_total{action="block"} 5
gRPC ExtProc API​
For direct integration with the ExtProc protocol:
Service Definition​
syntax = "proto3";
package envoy.service.ext_proc.v3;
service ExternalProcessor {
rpc Process(stream ProcessingRequest) returns (stream ProcessingResponse);
}
message ProcessingRequest {
oneof request {
RequestHeaders request_headers = 1;
RequestBody request_body = 2;
ResponseHeaders response_headers = 3;
ResponseBody response_body = 4;
}
}
message ProcessingResponse {
oneof response {
RequestHeadersResponse request_headers = 1;
RequestBodyResponse request_body = 2;
ResponseHeadersResponse response_headers = 3;
ResponseBodyResponse response_body = 4;
}
}
Processing Methods​
Request Headers Processing​
func (r *Router) handleRequestHeaders(headers *ProcessingRequest_RequestHeaders) *ProcessingResponse {
// Extract request metadata
// Set up request context
// Return continue response
}
Request Body Processing​
func (r *Router) handleRequestBody(body *ProcessingRequest_RequestBody) *ProcessingResponse {
// Parse OpenAI request
// Classify query intent
// Run security checks
// Select optimal model
// Return routing headers
}
Error Handling​
Error Response Format​
{
"error": {
"message": "PII detected in request",
"type": "security_violation",
"code": "pii_detected",
"details": {
"entities_found": ["EMAIL", "PERSON"],
"action_taken": "block"
}
}
}
HTTP Status Codes​
Status | Description |
---|---|
200 | Success |
400 | Bad Request (malformed input) |
403 | Forbidden (security violation) |
429 | Too Many Requests (rate limited) |
500 | Internal Server Error |
503 | Service Unavailable (backend down) |
Configuration API​
Runtime Configuration Updates​
Endpoint: POST /admin/config/update
{
"classification": {
"confidence_threshold": 0.8
},
"security": {
"enable_pii_detection": true
},
"cache": {
"ttl_seconds": 7200
}
}
WebSocket API (Optional)​
For real-time streaming responses:
Endpoint: ws://localhost:8801/v1/chat/stream
const ws = new WebSocket('ws://localhost:8801/v1/chat/stream');
ws.send(JSON.stringify({
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}));
ws.onmessage = function(event) {
const chunk = JSON.parse(event.data);
console.log(chunk.choices[0].delta.content);
};
Client Libraries​
Python Client​
import requests
class SemanticRouterClient:
def __init__(self, base_url="http://localhost:8801"):
self.base_url = base_url
def chat_completion(self, messages, model="gpt-3.5-turbo", **kwargs):
response = requests.post(
f"{self.base_url}/v1/chat/completions",
json={
"model": model,
"messages": messages,
**kwargs
}
)
return response.json()
def get_health(self):
response = requests.get(f"{self.base_url}/health")
return response.json()
# Usage
client = SemanticRouterClient()
result = client.chat_completion([
{"role": "user", "content": "What is 2 + 2?"}
])
JavaScript Client​
class SemanticRouterClient {
constructor(baseUrl = 'http://localhost:8801') {
this.baseUrl = baseUrl;
}
async chatCompletion(messages, model = 'gpt-3.5-turbo', options = {}) {
const response = await fetch(`${this.baseUrl}/v1/chat/completions`, {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
model,
messages,
...options
})
});
return response.json();
}
async getHealth() {
const response = await fetch(`${this.baseUrl}/health`);
return response.json();
}
}
// Usage
const client = new SemanticRouterClient();
const result = await client.chatCompletion([
{ role: 'user', content: 'Solve x^2 + 5x + 6 = 0' }
]);
Rate Limiting​
The router implements rate limiting with the following headers:
Rate Limit Headers​
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 999
X-RateLimit-Reset: 1640995200
X-RateLimit-Retry-After: 60
Rate Limit Response​
{
"error": {
"message": "Rate limit exceeded",
"type": "rate_limit_error",
"code": "too_many_requests",
"details": {
"limit": 1000,
"window": "1h",
"retry_after": 60
}
}
}
Best Practices​
1. Request Optimization​
# Include relevant context
messages = [
{
"role": "system",
"content": "You are a mathematics tutor."
},
{
"role": "user",
"content": "Explain derivatives in simple terms"
}
]
# Use appropriate tools
tools = [
{
"type": "function",
"function": {
"name": "calculator",
"description": "For mathematical calculations"
}
}
]
2. Error Handling​
try:
response = client.chat_completion(messages)
if 'error' in response:
handle_router_error(response['error'])
else:
process_response(response)
except requests.exceptions.Timeout:
handle_timeout_error()
except requests.exceptions.ConnectionError:
handle_connection_error()
3. Monitoring Integration​
import time
start_time = time.time()
response = client.chat_completion(messages)
duration = time.time() - start_time
# Log routing metadata
routing_info = response.get('routing_metadata', {})
logger.info(f"Request routed to {routing_info.get('selected_model')} "
f"with confidence {routing_info.get('confidence')} "
f"in {duration:.2f}s")
Next Steps​
- Classification API: Detailed classification endpoints
- System Architecture: System monitoring and observability
- Quick Start Guide: Real-world integration examples
- Configuration Guide: Production configuration
For more advanced API usage and custom integrations, refer to the examples directory or join our community discussions.