Skip to main content

System Architecture

The Semantic Router implements a sophisticated Mixture-of-Models (MoM) architecture using Envoy Proxy as the foundation, with an External Processor (ExtProc) service that provides intelligent routing capabilities. This design ensures high performance, scalability, and maintainability for production LLM deployments.

High-Level Architecture Overview​

Core Components​

1. Envoy Proxy - Traffic Management Layer​

Role: Acts as the entry point and traffic director for all LLM requests.

Key Responsibilities:

  • Load Balancing: Distributes requests across backend model endpoints
  • Health Checking: Monitors backend model availability and health
  • Request/Response Processing: Handles HTTP protocol management
  • Header Management: Manages routing headers set by the ExtProc service
  • Timeout Management: Configures appropriate timeouts for different model types

Configuration Highlights:

# Envoy listener configuration
listeners:
- name: listener_0
address:
socket_address:
address: 0.0.0.0
port_value: 8801 # Main entry point

http_filters:
- name: envoy.filters.http.ext_proc
typed_config:
grpc_service:
envoy_grpc:
cluster_name: extproc_service
processing_mode:
request_header_mode: "SEND" # Send headers for routing decisions
response_header_mode: "SEND" # Process response headers
request_body_mode: "BUFFERED" # Analyze request content
response_body_mode: "BUFFERED" # Process response content

2. Semantic Router ExtProc Service - Intelligence Layer​

Role: The brain of the system that makes intelligent routing decisions.

Architecture:

type OpenAIRouter struct {
Config *config.RouterConfig
CategoryDescriptions []string
Classifier *classification.Classifier // ModernBERT-based
PIIChecker *pii.PolicyChecker // Privacy protection
Cache *cache.SemanticCache // Performance optimization
ToolsDatabase *tools.ToolsDatabase // Tool selection

pendingRequests map[string][]byte // Request tracking
pendingRequestsLock sync.Mutex // Thread safety
}

Processing Pipeline:

3. Classification System - Decision Engine​

The classification system uses ModernBERT models for multiple classification tasks:

Category Classification​

Multi-Task Architecture​

# Conceptual model architecture
class SemanticRouter:
def __init__(self):
self.category_classifier = ModernBERTForSequenceClassification(
num_labels=10 # Math, Creative, Code, etc.
)
self.pii_detector = ModernBERTForTokenClassification(
num_labels=6 # PERSON, EMAIL, PHONE, SSN, LOCATION, NO_PII
)
self.jailbreak_guard = ModernBERTForSequenceClassification(
num_labels=2 # Benign, Jailbreak
)

def route_request(self, query):
# Multi-task inference
category = self.category_classifier(query)
pii_entities = self.pii_detector(query)
safety_score = self.jailbreak_guard(query)

return self.make_routing_decision(category, pii_entities, safety_score)

Data Flow Architecture​

Request Processing Flow​

Response Processing Flow​

Threading and Concurrency Model​

Go ExtProc Server Concurrency​

// Server handles multiple concurrent connections
func (s *Server) Start() error {
lis, err := net.Listen("tcp", fmt.Sprintf(":%d", s.port))
if err != nil {
return fmt.Errorf("failed to listen on port %d: %w", s.port, err)
}

s.server = grpc.NewServer()
ext_proc.RegisterExternalProcessorServer(s.server, s.router)

// gRPC handles concurrency automatically
// Each request gets its own goroutine
return s.server.Serve(lis)
}

// Process handles individual request streams
func (r *OpenAIRouter) Process(stream ext_proc.ExternalProcessor_ProcessServer) error {
// Each stream runs in its own goroutine
ctx := &RequestContext{
Headers: make(map[string]string),
}

for {
req, err := stream.Recv()
// Process request with thread-safe operations
switch v := req.Request.(type) {
case *ext_proc.ProcessingRequest_RequestHeaders:
// Handle request headers
case *ext_proc.ProcessingRequest_RequestBody:
// Handle request body - where classification happens
case *ext_proc.ProcessingRequest_ResponseHeaders:
// Handle response headers
}
}
}

Thread Safety Considerations​

type OpenAIRouter struct {
// Thread-safe components
Classifier *classification.Classifier // Read-only after init
PIIChecker *pii.PolicyChecker // Read-only after init
Cache *cache.SemanticCache // Internally synchronized

// Mutable state with protection
pendingRequests map[string][]byte
pendingRequestsLock sync.Mutex // Protects pendingRequests
}

// Thread-safe request tracking
func (r *OpenAIRouter) trackRequest(id string, body []byte) {
r.pendingRequestsLock.Lock()
defer r.pendingRequestsLock.Unlock()
r.pendingRequests[id] = body
}

Performance Characteristics​

Latency Analysis​

ComponentTypical LatencyOptimization
Envoy Routing0.5-2msOptimized configuration
ExtProc gRPC1-3msLocal network communication
PII Detection5-15msModernBERT token classification
Jailbreak Guard3-8msModernBERT binary classification
Category Classification8-20msModernBERT sequence classification
Cache Lookup0.1-0.5msRedis/in-memory cache
Total Overhead15-50msAcceptable for most use cases

Throughput Optimization​

// Batch processing for efficiency
type BatchProcessor struct {
batchSize int
batchTimeout time.Duration
classifier *classification.Classifier
}

func (bp *BatchProcessor) processBatch(queries []string) []Classification {
// Process multiple queries together for better GPU utilization
return bp.classifier.ClassifyBatch(queries)
}

Memory Usage​

ComponentMemory UsageNotes
ModernBERT Models~400MB eachLoaded once, shared across requests
Envoy Process~100-200MBDepends on configuration
Go ExtProc Server~50-100MBScales with concurrent requests
Semantic Cache~500MB-2GBConfigurable, depends on cache size
Total System~1.5-3GBReasonable for production deployment

Configuration Management​

Router Configuration Structure​

# config/config.yaml
router:
# Model endpoints configuration
endpoints:
endpoint1:
url: "http://127.0.0.1:11434"
model_type: "math"
cost_per_token: 0.002
max_tokens: 4096

endpoint2:
url: "http://127.0.0.1:11434"
model_type: "creative"
cost_per_token: 0.003
max_tokens: 8192

endpoint3:
url: "http://127.0.0.1:11434"
model_type: "general"
cost_per_token: 0.01
max_tokens: 4096

# Classification thresholds
classification:
confidence_threshold: 0.7
fallback_model: "general"

# Security settings
security:
enable_pii_detection: true
enable_jailbreak_guard: true
pii_action: "block" # block, mask, or allow

# Caching configuration
cache:
enabled: true
similarity_threshold: 0.85
ttl_seconds: 3600
max_entries: 10000

# Tools configuration
tools:
auto_selection: true
max_tools: 5
relevance_threshold: 0.6

Dynamic Configuration Updates​

// Configuration hot-reloading
type ConfigManager struct {
config *RouterConfig
configLock sync.RWMutex
watchers []ConfigWatcher
}

func (cm *ConfigManager) UpdateConfig(newConfig *RouterConfig) error {
cm.configLock.Lock()
defer cm.configLock.Unlock()

// Validate new configuration
if err := newConfig.Validate(); err != nil {
return err
}

// Apply configuration
cm.config = newConfig

// Notify all watchers
for _, watcher := range cm.watchers {
watcher.OnConfigUpdate(newConfig)
}

return nil
}

Error Handling and Resilience​

Circuit Breaker Pattern​

type CircuitBreaker struct {
maxFailures int
resetTimeout time.Duration
state CircuitState
failures int
lastFailTime time.Time
mutex sync.Mutex
}

func (cb *CircuitBreaker) Call(operation func() error) error {
cb.mutex.Lock()
defer cb.mutex.Unlock()

if cb.state == StateOpen {
if time.Since(cb.lastFailTime) > cb.resetTimeout {
cb.state = StateHalfOpen
} else {
return errors.New("circuit breaker is open")
}
}

err := operation()
if err != nil {
cb.onFailure()
} else {
cb.onSuccess()
}

return err
}

Fallback Strategies​

Monitoring and Observability​

Metrics Collection​

// Prometheus metrics
var (
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "semantic_router_requests_total",
Help: "Total number of requests processed",
},
[]string{"endpoint", "category", "status"},
)

routingLatency = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "semantic_router_routing_duration_seconds",
Help: "Time spent on routing decisions",
Buckets: prometheus.DefBuckets,
},
[]string{"component"},
)

cacheHitRatio = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "semantic_router_cache_hit_ratio",
Help: "Cache hit ratio for semantic cache",
},
[]string{"cache_type"},
)
)

Structured Logging​

type RequestLogger struct {
logger *logrus.Logger
}

func (rl *RequestLogger) LogRouting(ctx context.Context, decision *RoutingDecision) {
rl.logger.WithFields(logrus.Fields{
"request_id": ctx.Value("request_id"),
"category": decision.Category,
"confidence": decision.Confidence,
"selected_model": decision.SelectedModel,
"routing_time_ms": decision.ProcessingTime.Milliseconds(),
"pii_detected": decision.PIIDetected,
"jailbreak_risk": decision.JailbreakRisk,
"cache_hit": decision.CacheHit,
"tools_selected": len(decision.SelectedTools),
}).Info("Request routed")
}

This architecture provides a robust, scalable, and maintainable foundation for intelligent LLM routing. The next section covers the Envoy ExtProc Integration in detail, explaining how the ExtProc protocol works and how our router implements it.