Skip to main content
Version: ๐Ÿšง Next

OpenAI RAG Integration

This guide demonstrates how to use OpenAI's File Store and Vector Store APIs for RAG (Retrieval-Augmented Generation) in Semantic Router, following the OpenAI Responses API cookbook.

Overviewโ€‹

The OpenAI RAG backend integrates with OpenAI's File Store and Vector Store APIs to provide a first-class RAG experience. It supports two workflow modes:

  1. Direct Search Mode (default): Synchronous retrieval using vector store search API
  2. Tool-Based Mode: Adds file_search tool to request (Responses API workflow)

Architectureโ€‹

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Client โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Semantic Router โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ RAG Plugin โ”‚ โ”‚
โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ OpenAI RAG Backend โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ OpenAI API โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚ โ”‚ File Store โ”‚ โ”‚Vector Store โ”‚ โ”‚
โ”‚ โ”‚ API โ”‚ โ”‚ API โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Prerequisitesโ€‹

  1. OpenAI API key with access to File Store and Vector Store APIs
  2. Files uploaded to OpenAI File Store
  3. Vector store created and populated with files

Configurationโ€‹

Basic Configurationโ€‹

Add the OpenAI RAG backend to your decision configuration:

decisions:
- name: rag-openai-decision
signals:
- type: keyword
keywords: ["research", "document", "knowledge"]
plugins:
rag:
enabled: true
backend: "openai"
backend_config:
vector_store_id: "vs_abc123" # Your vector store ID
api_key: "${OPENAI_API_KEY}" # Or use environment variable
max_num_results: 10
workflow_mode: "direct_search" # or "tool_based"

Advanced Configurationโ€‹

rag:
enabled: true
backend: "openai"
similarity_threshold: 0.7
top_k: 10
max_context_length: 5000
injection_mode: "tool_role" # or "system_prompt"
on_failure: "skip" # or "warn" or "block"
cache_results: true
cache_ttl_seconds: 3600
backend_config:
vector_store_id: "vs_abc123"
api_key: "${OPENAI_API_KEY}"
base_url: "https://api.openai.com/v1" # Optional, defaults to OpenAI
max_num_results: 10
file_ids: # Optional: restrict search to specific files
- "file-123"
- "file-456"
filter: # Optional: metadata filter
category: "research"
published_date: "2024-01-01"
workflow_mode: "direct_search" # or "tool_based"
timeout_seconds: 30

Workflow Modesโ€‹

1. Direct Search Mode (Default)โ€‹

Synchronous retrieval using vector store search API. Context is retrieved before sending the request to the LLM.

Use Case: When you need immediate context injection and want to control the retrieval process.

Example:

backend_config:
workflow_mode: "direct_search"
vector_store_id: "vs_abc123"

Flow:

  1. User sends query
  2. RAG plugin calls vector store search API
  3. Retrieved context is injected into request
  4. Request sent to LLM with context

2. Tool-Based Mode (Responses API)โ€‹

Adds file_search tool to the request. The LLM calls the tool automatically, and results appear in response annotations.

Use Case: When using Responses API and want the LLM to control when to search.

Example:

backend_config:
workflow_mode: "tool_based"
vector_store_id: "vs_abc123"

Flow:

  1. User sends query
  2. RAG plugin adds file_search tool to request
  3. Request sent to LLM
  4. LLM calls file_search tool
  5. Results appear in response annotations

Usage Examplesโ€‹

Example 1: Basic RAG Queryโ€‹

curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-VSR-Selected-Decision: rag-openai-decision" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{
"role": "user",
"content": "What is Deep Research?"
}
]
}'

Example 2: Responses API with file_search Toolโ€‹

curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"input": "What is Deep Research?",
"tools": [
{
"type": "file_search",
"file_search": {
"vector_store_ids": ["vs_abc123"],
"max_num_results": 5
}
}
]
}'

Example 3: Python Clientโ€‹

import requests

# Direct search mode
response = requests.post(
"http://localhost:8080/v1/chat/completions",
headers={
"Content-Type": "application/json",
"X-VSR-Selected-Decision": "rag-openai-decision"
},
json={
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is Deep Research?"}
]
}
)

result = response.json()
print(result["choices"][0]["message"]["content"])

File Store Operationsโ€‹

The OpenAI RAG backend includes a File Store client for managing files:

Upload Fileโ€‹

import "github.com/vllm-project/semantic-router/src/semantic-router/pkg/openai"

client := openai.NewFileStoreClient("https://api.openai.com/v1", apiKey)
file, err := client.UploadFile(ctx, fileReader, "document.pdf", "assistants")

Create Vector Storeโ€‹

vectorStoreClient := openai.NewVectorStoreClient("https://api.openai.com/v1", apiKey)
store, err := vectorStoreClient.CreateVectorStore(ctx, &openai.CreateVectorStoreRequest{
Name: "my-vector-store",
FileIDs: []string{"file-123", "file-456"},
})

Attach File to Vector Storeโ€‹

_, err := vectorStoreClient.CreateVectorStoreFile(ctx, "vs_abc123", "file-123")

Testingโ€‹

Unit Testsโ€‹

Run unit tests for OpenAI RAG:

cd src/semantic-router
go test ./pkg/openai/... -v
go test ./pkg/extproc/req_filter_rag_openai_test.go -v

E2E Testsโ€‹

Run E2E tests based on the OpenAI cookbook:

# Python-based E2E test
python e2e/testing/08-rag-openai-test.py --base-url http://localhost:8080

# Go-based E2E test (requires Kubernetes cluster)
make e2e-test E2E_TESTS=rag-openai

OpenAI API Validation Test Suiteโ€‹

Validation tests ensure the OpenAI API implementation (Files, Vector Stores, Search) stays compatible with upstream. Adapted from openai-python/tests. Run when OPENAI_API_KEY is set.

Python E2E (contract validation against real API):

# From repo root; skips all tests if OPENAI_API_KEY is not set
OPENAI_API_KEY=sk-... python e2e/testing/09-openai-api-validation-test.py --verbose

# Optional: override API base URL
OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_API_KEY=sk-... python e2e/testing/09-openai-api-validation-test.py

Go integration (pkg/openai client against real API):

cd src/semantic-router
# Skips tests if OPENAI_API_KEY is not set
OPENAI_API_KEY=sk-... go test -tags=openai_validation ./pkg/openai -v

Tests cover: Files (list, upload, get, delete), Vector Stores (list, create, get, update, delete), Vector Store Files (list), and Vector Store Search (response schema).

Monitoring and Observabilityโ€‹

The OpenAI RAG backend exposes the following metrics:

  • rag_retrieval_attempts_total{backend="openai", decision="...", status="success|error"}
  • rag_retrieval_latency_seconds{backend="openai", decision="..."}
  • rag_similarity_score{backend="openai", decision="..."}
  • rag_context_length_chars{backend="openai", decision="..."}
  • rag_cache_hits_total{backend="openai"}
  • rag_cache_misses_total{backend="openai"}

Tracingโ€‹

OpenTelemetry spans are created for:

  • semantic_router.rag.retrieval - RAG retrieval operation
  • semantic_router.rag.context_injection - Context injection operation

Error Handlingโ€‹

The RAG plugin supports three failure modes:

  • skip (default): Continue without context, log warning
  • warn: Continue with warning header
  • block: Return error response (503)
rag:
on_failure: "skip" # or "warn" or "block"

Best Practicesโ€‹

  1. Use Direct Search for Synchronous Workflows: When you need immediate context injection
  2. Use Tool-Based for Responses API: When using Responses API and want LLM-controlled search
  3. Cache Results: Enable caching for frequently accessed queries
  4. Set Appropriate Timeouts: Configure timeout_seconds based on your vector store size
  5. Filter Results: Use file_ids or filter to narrow search scope
  6. Monitor Metrics: Track retrieval latency and similarity scores

Troubleshootingโ€‹

No Results Foundโ€‹

  • Verify vector store ID is correct
  • Check that files are attached to the vector store
  • Ensure files have completed processing (check file_counts.completed)

High Latencyโ€‹

  • Reduce max_num_results
  • Enable result caching
  • Use file_ids to limit search scope

Authentication Errorsโ€‹

  • Verify API key is correct
  • Check API key has access to File Store and Vector Store APIs
  • Ensure base URL is correct (if using custom endpoint)

Referencesโ€‹