Configuration Guide

This guide covers the configuration options for the Semantic Router. The system uses a single YAML configuration file that controls all aspects of routing, classification, and security.

Configuration File

The configuration file is located at config/config.yaml. Here's the structure based on the actual implementation:

# config/config.yaml - Actual configuration structure

# BERT model for semantic similarity
bert_model:
  model_id: sentence-transformers/all-MiniLM-L12-v2
  threshold: 0.6
  use_cpu: true

# Semantic caching
semantic_cache:
  enabled: false
  similarity_threshold: 0.8
  max_entries: 1000
  ttl_seconds: 3600

# Tool auto-selection
tools:
  enabled: false
  top_k: 3
  similarity_threshold: 0.2
  tools_db_path: "config/tools_db.json"
  fallback_to_empty: true

# Jailbreak protection
prompt_guard:
  enabled: false
  use_modernbert: true
  model_id: "models/jailbreak_classifier_modernbert-base_model"
  threshold: 0.7
  use_cpu: true

# vLLM endpoints - your backend models
vllm_endpoints:
  - name: "endpoint1"
    address: "your-server.com"  # Replace with your server
    port: 11434
    models:
      - "your-model"           # Replace with your model
    weight: 1

# Model configuration
model_config:
  "your-model":
    param_count: 7000000000    # Model parameters
    batch_size: 512.0
    context_size: 4096.0
    pii_policy:
      allow_by_default: true
      pii_types_allowed: ["EMAIL_ADDRESS", "PERSON"]
    preferred_endpoints: ["endpoint1"]

# Classification models
classifier:
  category_model:
    model_id: "models/category_classifier_modernbert-base_model"
    use_modernbert: true
    threshold: 0.6
    use_cpu: true
  pii_model:
    model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
    use_modernbert: true
    threshold: 0.7
    use_cpu: true

# Categories and routing rules
categories:
- name: math
  use_reasoning: true  # Enable reasoning for math
  model_scores:
  - model: your-model
    score: 1.0
- name: computer science
  use_reasoning: true  # Enable reasoning for code
  model_scores:
  - model: your-model
    score: 1.0
- name: other
  use_reasoning: false # No reasoning for general queries
  model_scores:
  - model: your-model
    score: 0.8

default_model: your-model

Key Configuration Sections

Backend Endpoints

Configure your LLM servers:

vllm_endpoints:
  - name: "my_endpoint"
    address: "127.0.0.1"  # Your server IP
    port: 8000                # Your server port
    models:
      - "llama2-7b"          # Model name
    weight: 1                 # Load balancing weight

Model Settings

Configure model-specific settings:

model_config:
  "llama2-7b":
    param_count: 7000000000     # Model size in parameters
    batch_size: 512.0           # Batch size
    context_size: 4096.0        # Context window
    pii_policy:
      allow_by_default: true    # Allow PII by default
      pii_types_allowed: ["EMAIL_ADDRESS", "PERSON"]
    preferred_endpoints: ["my_endpoint"]

Classification Models

Configure the BERT classification models:

classifier:
  category_model:
    model_id: "models/category_classifier_modernbert-base_model"
    use_modernbert: true
    threshold: 0.6            # Classification confidence threshold
    use_cpu: true             # Use CPU (no GPU required)
  pii_model:
    model_id: "models/pii_classifier_modernbert-base_presidio_token_model"
    threshold: 0.7            # PII detection threshold
    use_cpu: true

Categories and Routing

Define how different query types are handled:

categories:
- name: math
  use_reasoning: true              # Enable reasoning for math problems
  reasoning_description: "Mathematical problems require step-by-step reasoning"
  model_scores:
  - model: your-model
    score: 1.0                     # Preference score for this model

- name: computer science
  use_reasoning: true              # Enable reasoning for code
  model_scores:
  - model: your-model
    score: 1.0

- name: other
  use_reasoning: false             # No reasoning for general queries
  model_scores:
  - model: your-model
    score: 0.8

default_model: your-model          # Fallback model

Security Features

Configure PII detection and jailbreak protection:

# PII Detection
classifier:
  pii_model:
    threshold: 0.7                 # Higher = more strict PII detection

# Jailbreak Protection
prompt_guard:
  enabled: true                    # Enable jailbreak detection
  threshold: 0.7                   # Detection sensitivity
  use_cpu: true                    # Runs on CPU

# Model-level PII policies
model_config:
  "your-model":
    pii_policy:
      allow_by_default: true       # Allow most content
      pii_types_allowed: ["EMAIL_ADDRESS", "PERSON"]  # Specific allowed types

Optional Features

Configure additional features:

# Semantic Caching
semantic_cache:
  enabled: true                    # Enable semantic caching
  similarity_threshold: 0.8        # Cache hit threshold
  max_entries: 1000               # Maximum cache entries
  ttl_seconds: 3600               # Cache expiration time

# Tool Auto-Selection
tools:
  enabled: true                    # Enable automatic tool selection
  top_k: 3                        # Number of tools to select
  similarity_threshold: 0.2        # Tool relevance threshold
  tools_db_path: "config/tools_db.json"
  fallback_to_empty: true         # Return empty on failure

# BERT Model for Similarity
bert_model:
  model_id: sentence-transformers/all-MiniLM-L12-v2
  threshold: 0.6                  # Similarity threshold
  use_cpu: true                   # CPU-only inference

Common Configuration Examples

Enable All Security Features

# Enable PII detection
classifier:
  pii_model:
    threshold: 0.8              # Strict PII detection

# Enable jailbreak protection
prompt_guard:
  enabled: true
  threshold: 0.7

# Configure model PII policies
model_config:
  "your-model":
    pii_policy:
      allow_by_default: false   # Block all PII by default
      pii_types_allowed: []     # No PII allowed

Performance Optimization

# Enable caching
semantic_cache:
  enabled: true
  similarity_threshold: 0.85    # Higher = more cache hits
  max_entries: 5000
  ttl_seconds: 7200            # 2 hour cache

# Enable tool selection
tools:
  enabled: true
  top_k: 5                     # Select more tools
  similarity_threshold: 0.1    # Lower = more tools selected

Development Setup

# Disable security for testing
prompt_guard:
  enabled: false

# Disable caching for consistent results
semantic_cache:
  enabled: false

# Lower classification thresholds
classifier:
  category_model:
    threshold: 0.3             # Lower = more specialized routing

Configuration Validation

Test Your Configuration

Validate your configuration before starting:

# Test configuration syntax
python -c "import yaml; yaml.safe_load(open('config/config.yaml'))"

# Test the router with your config
make build
make run-router

Common Configuration Patterns

Multiple Models:

vllm_endpoints:
  - name: "math_endpoint"
    address: "math-server.com"
    port: 8000
    models: ["math-model"]
    weight: 1
  - name: "general_endpoint"
    address: "general-server.com"
    port: 8000
    models: ["general-model"]
    weight: 1

categories:
- name: math
  model_scores:
  - model: math-model
    score: 1.0
- name: other
  model_scores:
  - model: general-model
    score: 1.0

Load Balancing:

vllm_endpoints:
  - name: "endpoint1"
    address: "server1.com"
    port: 8000
    models: ["my-model"]
    weight: 2              # Higher weight = more traffic
  - name: "endpoint2"
    address: "server2.com"
    port: 8000
    models: ["my-model"]
    weight: 1

Best Practices

Security Configuration

For production environments:

# Enable all security features
classifier:
  pii_model:
    threshold: 0.8              # Strict PII detection

prompt_guard:
  enabled: true                 # Enable jailbreak protection
  threshold: 0.7

model_config:
  "your-model":
    pii_policy:
      allow_by_default: false   # Block PII by default

Performance Tuning

For high-traffic scenarios:

# Enable caching
semantic_cache:
  enabled: true
  similarity_threshold: 0.85    # Higher = more cache hits
  max_entries: 10000
  ttl_seconds: 3600

# Optimize classification
classifier:
  category_model:
    threshold: 0.7              # Balance accuracy vs speed

Development vs Production

Development:

# Relaxed settings for testing
classifier:
  category_model:
    threshold: 0.3              # Lower threshold for testing
prompt_guard:
  enabled: false                # Disable for development
semantic_cache:
  enabled: false                # Disable for consistent results

Production:

# Strict settings for production
classifier:
  category_model:
    threshold: 0.7              # Higher threshold for accuracy
prompt_guard:
  enabled: true                 # Enable security
semantic_cache:
  enabled: true                 # Enable for performance

Troubleshooting

Common Issues

Invalid YAML syntax:

# Validate YAML syntax
python -c "import yaml; yaml.safe_load(open('config/config.yaml'))"

Missing model files:

# Check if models are downloaded
ls -la models/
# If missing, run: make download-models

Endpoint connectivity:

# Test your backend server
curl -f http://your-server:8000/health

Configuration not taking effect:

# Restart the router after config changes
make run-router

Testing Configuration

# Test with different queries
make test-auto-prompt-reasoning      # Math query
make test-auto-prompt-no-reasoning   # General query
make test-pii                        # PII detection
make test-prompt-guard               # Jailbreak protection

Configuration Generation

The Semantic Router supports automated configuration generation based on model performance benchmarks. This workflow uses MMLU-Pro evaluation results to determine optimal model routing for different categories.

Benchmarking Workflow

Run MMLU-Pro Evaluation:

# Evaluate models using MMLU-Pro benchmark
python src/training/model_eval/mmlu_pro_vllm_eval.py \
  --endpoint http://localhost:8000/v1 \
  --models phi4,gemma3:27b,mistral-small3.1 \
  --samples-per-category 5 \
  --use-cot \
  --concurrent-requests 4 \
  --output-dir results

Generate Configuration:

# Generate config.yaml from benchmark results
python src/training/model_eval/result_to_config.py \
  --results-dir results \
  --output-file config/config.yaml \
  --similarity-threshold 0.80

Generated Configuration Features

The generated configuration includes:

Model Performance Rankings: Models are ranked by performance for each category
Reasoning Settings: Automatically configures reasoning requirements per category:
- use_reasoning: Whether to use step-by-step reasoning
- reasoning_description: Description of reasoning approach
- reasoning_effort: Required effort level (low/medium/high)
Default Model Selection: Best overall performing model is set as default
Security and Performance Settings: Pre-configured optimal values for:
- PII detection thresholds
- Semantic cache settings
- Tool selection parameters

Customizing Generated Config

The generated config.yaml can be customized by:

Editing category-specific settings in result_to_config.py
Adjusting thresholds and parameters via command line arguments
Manually modifying the generated config.yaml

Example Workflow

Here's a complete example workflow for generating and testing a configuration:

# Run MMLU-Pro evaluation
# Option 1: Specify models manually
python src/training/model_eval/mmlu_pro_vllm_eval.py \
  --endpoint http://localhost:8000/v1 \
  --models phi4,gemma3:27b,mistral-small3.1 \
  --samples-per-category 5 \
  --use-cot \
  --concurrent-requests 4 \
  --output-dir results \
  --max-tokens 2048 \
  --temperature 0.0 \
  --seed 42

# Option 2: Auto-discover models from endpoint
python src/training/model_eval/mmlu_pro_vllm_eval.py \
  --endpoint http://localhost:8000/v1 \
  --samples-per-category 5 \
  --use-cot \
  --concurrent-requests 4 \
  --output-dir results \
  --max-tokens 2048 \
  --temperature 0.0 \
  --seed 42

# Generate initial config
python src/training/model_eval/result_to_config.py \
  --results-dir results \
  --output-file config/config.yaml \
  --similarity-threshold 0.80

# Test the generated config
make test

This workflow ensures your configuration is:

Based on actual model performance
Properly tested before deployment
Version controlled for tracking changes
Optimized for your specific use case

Next Steps

Installation Guide - Setup instructions
Quick Start Guide - Basic usage examples
API Documentation - Complete API reference

The configuration system is designed to be simple yet powerful. Start with the basic configuration and gradually enable advanced features as needed.

Configuration File​

Key Configuration Sections​

Backend Endpoints​

Model Settings​

Classification Models​

Categories and Routing​

Security Features​

Optional Features​

Common Configuration Examples​

Enable All Security Features​

Performance Optimization​

Development Setup​

Configuration Validation​

Test Your Configuration​

Common Configuration Patterns​

Best Practices​

Security Configuration​

Performance Tuning​

Development vs Production​

Troubleshooting​

Common Issues​

Testing Configuration​

Configuration Generation​

Benchmarking Workflow​

Generated Configuration Features​

Customizing Generated Config​

Example Workflow​

Next Steps​