Installation
This guide will help you install and run the vLLM Semantic Router. The router runs entirely on CPU and does not require GPU for inference.
System Requirements​
note
No GPU required - the router runs efficiently on CPU using optimized BERT models.
Requirements:
- Python: 3.10 or higher
- Docker: Required for running the router container
- Optional: HuggingFace token (only for gated models)
Quick Start​
1. Install vLLM Semantic Router​
# Create a virtual environment (recommended)
python -m venv vsr
source vsr/bin/activate # On Windows: vsr\Scripts\activate
# Install from PyPI
pip install vllm-sr
Verify installation:
vllm-sr --version
2. Initialize Configuration​
# Create config.yaml in current directory
vllm-sr init
This creates a config.yaml file with default settings.
3. Configure Your Backend​
Edit the generated config.yaml to configure your model and backend endpoint:
providers:
# Model configuration
models:
- name: "qwen/qwen3-1.8b" # Model name
endpoints:
- name: "my_vllm"
weight: 1
endpoint: "localhost:8000" # Domain or IP:port
protocol: "http" # http or https
access_key: "your-token-here" # Optional: for authentication
# Default model for fallback
default_model: "qwen/qwen3-1.8b"
Configuration Options:
- endpoint: Domain name or IP address with port (e.g.,
localhost:8000,api.openai.com) - protocol:
httporhttps - access_key: Optional authentication token (Bearer token)
- weight: Load balancing weight (default: 1)
Example: Local vLLM
providers:
models:
- name: "qwen/qwen3-1.8b"
endpoints:
- name: "local_vllm"
weight: 1
endpoint: "localhost:8000"
protocol: "http"
default_model: "qwen/qwen3-1.8b"
Example: External API with HTTPS
providers:
models:
- name: "openai/gpt-4"
endpoints:
- name: "openai_api"
weight: 1
endpoint: "api.openai.com"
protocol: "https"
access_key: "sk-xxxxxx"
default_model: "openai/gpt-4"
4. Start the Router​
vllm-sr serve
The router will:
- Automatically download required ML models (~1.5GB, one-time)
- Start Envoy proxy on port 8888
- Start the semantic router service
- Enable metrics on port 9190
5. Test the Router​
curl http://localhost:8888/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Common Commands​
# View logs
vllm-sr logs router # Router logs
vllm-sr logs envoy # Envoy logs
vllm-sr logs router -f # Follow logs
# Check status
vllm-sr status
# Stop the router
vllm-sr stop
Advanced Configuration​
HuggingFace Settings​
Set environment variables before starting:
export HF_ENDPOINT=https://huggingface.co # Or mirror: https://hf-mirror.com
export HF_TOKEN=your_token_here # Only for gated models
export HF_HOME=/path/to/cache # Custom cache directory
vllm-sr serve
Custom Options​
# Use custom config file
vllm-sr serve --config my-config.yaml
# Use custom Docker image
vllm-sr serve --image ghcr.io/vllm-project/semantic-router/vllm-sr:latest
# Control image pull policy
vllm-sr serve --image-pull-policy always
Next Steps​
- Configuration Guide - Advanced routing and signal configuration
- API Documentation - Complete API reference
- Tutorials - Learn by example
Getting Help​
- Issues: GitHub Issues
- Community: Join
#semantic-routerchannel in vLLM Slack - Documentation: vllm-semantic-router.com