Version: 🚧 Next

OpenAI RAG Integration

This guide demonstrates how to use OpenAI's File Store and Vector Store APIs for RAG (Retrieval-Augmented Generation) in Semantic Router, following the OpenAI Responses API cookbook.

Overview

The OpenAI RAG backend integrates with OpenAI's File Store and Vector Store APIs to provide a first-class RAG experience. It supports two workflow modes:

Direct Search Mode (default): Synchronous retrieval using vector store search API
Tool-Based Mode: Adds file_search tool to request (Responses API workflow)

Architecture

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ▼
┌─────────────────────────────────────┐
│     Semantic Router                 │
│  ┌───────────────────────────────┐  │
│  │      RAG Plugin               │  │
│  │  ┌─────────────────────────┐  │  │
│  │  │  OpenAI RAG Backend     │  │  │
│  │  └──────┬──────────────────┘  │  │
│  └─────────┼──────────────────── ┘  │
└────────────┼─────────────────────── ┘
             │
             ▼
┌─────────────────────────────────────┐
│      OpenAI API                     │
│  ┌──────────────┐  ┌──────────────┐ │
│  │ File Store   │  │Vector Store  │ │
│  │   API        │  │   API        │ │
│  └──────────────┘  └──────────────┘ │
└─────────────────────────────────────┘

Prerequisites

OpenAI API key with access to File Store and Vector Store APIs
Files uploaded to OpenAI File Store
Vector store created and populated with files

Configuration

Basic Configuration

Add the OpenAI RAG backend to your decision configuration:

decisions:
  - name: rag-openai-decision
    signals:
      - type: keyword
        keywords: ["research", "document", "knowledge"]
    plugins:
      rag:
        enabled: true
        backend: "openai"
        backend_config:
          vector_store_id: "vs_abc123"  # Your vector store ID
          api_key: "${OPENAI_API_KEY}"  # Or use environment variable
          max_num_results: 10
          workflow_mode: "direct_search"  # or "tool_based"

Advanced Configuration

rag:
  enabled: true
  backend: "openai"
  similarity_threshold: 0.7
  top_k: 10
  max_context_length: 5000
  injection_mode: "tool_role"  # or "system_prompt"
  on_failure: "skip"  # or "warn" or "block"
  cache_results: true
  cache_ttl_seconds: 3600
  backend_config:
    vector_store_id: "vs_abc123"
    api_key: "${OPENAI_API_KEY}"
    base_url: "https://api.openai.com/v1"  # Optional, defaults to OpenAI
    max_num_results: 10
    file_ids:  # Optional: restrict search to specific files
      - "file-123"
      - "file-456"
    filter:  # Optional: metadata filter
      category: "research"
      published_date: "2024-01-01"
    workflow_mode: "direct_search"  # or "tool_based"
    timeout_seconds: 30

Workflow Modes

1. Direct Search Mode (Default)

Synchronous retrieval using vector store search API. Context is retrieved before sending the request to the LLM.

Use Case: When you need immediate context injection and want to control the retrieval process.

Example:

backend_config:
  workflow_mode: "direct_search"
  vector_store_id: "vs_abc123"

Flow:

User sends query
RAG plugin calls vector store search API
Retrieved context is injected into request
Request sent to LLM with context

2. Tool-Based Mode (Responses API)

Adds file_search tool to the request. The LLM calls the tool automatically, and results appear in response annotations.

Use Case: When using Responses API and want the LLM to control when to search.

Example:

backend_config:
  workflow_mode: "tool_based"
  vector_store_id: "vs_abc123"

Flow:

User sends query
RAG plugin adds file_search tool to request
Request sent to LLM
LLM calls file_search tool
Results appear in response annotations

Usage Examples

Example 1: Basic RAG Query

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-VSR-Selected-Decision: rag-openai-decision" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {
        "role": "user",
        "content": "What is Deep Research?"
      }
    ]
  }'

Example 2: Responses API with file_search Tool

curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "input": "What is Deep Research?",
    "tools": [
      {
        "type": "file_search",
        "file_search": {
          "vector_store_ids": ["vs_abc123"],
          "max_num_results": 5
        }
      }
    ]
  }'

Example 3: Python Client

import requests

# Direct search mode
response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    headers={
        "Content-Type": "application/json",
        "X-VSR-Selected-Decision": "rag-openai-decision"
    },
    json={
        "model": "gpt-4o-mini",
        "messages": [
            {"role": "user", "content": "What is Deep Research?"}
        ]
    }
)

result = response.json()
print(result["choices"][0]["message"]["content"])

File Store Operations

The OpenAI RAG backend includes a File Store client for managing files:

Upload File

import "github.com/vllm-project/semantic-router/src/semantic-router/pkg/openai"

client := openai.NewFileStoreClient("https://api.openai.com/v1", apiKey)
file, err := client.UploadFile(ctx, fileReader, "document.pdf", "assistants")

Create Vector Store

vectorStoreClient := openai.NewVectorStoreClient("https://api.openai.com/v1", apiKey)
store, err := vectorStoreClient.CreateVectorStore(ctx, &openai.CreateVectorStoreRequest{
    Name:    "my-vector-store",
    FileIDs: []string{"file-123", "file-456"},
})

Attach File to Vector Store

_, err := vectorStoreClient.CreateVectorStoreFile(ctx, "vs_abc123", "file-123")

Testing

Unit Tests

Run unit tests for OpenAI RAG:

cd src/semantic-router
go test ./pkg/openai/... -v
go test ./pkg/extproc/req_filter_rag_openai_test.go -v

E2E Tests

Run E2E tests based on the OpenAI cookbook:

# Python-based E2E test
python e2e/testing/08-rag-openai-test.py --base-url http://localhost:8080

# Go-based E2E test (requires Kubernetes cluster)
make e2e-test E2E_TESTS=rag-openai

OpenAI API Validation Test Suite

Validation tests ensure the OpenAI API implementation (Files, Vector Stores, Search) stays compatible with upstream. Adapted from openai-python/tests. Run when OPENAI_API_KEY is set.

Python E2E (contract validation against real API):

# From repo root; skips all tests if OPENAI_API_KEY is not set
OPENAI_API_KEY=sk-... python e2e/testing/09-openai-api-validation-test.py --verbose

# Optional: override API base URL
OPENAI_BASE_URL=https://api.openai.com/v1 OPENAI_API_KEY=sk-... python e2e/testing/09-openai-api-validation-test.py

Go integration (pkg/openai client against real API):

cd src/semantic-router
# Skips tests if OPENAI_API_KEY is not set
OPENAI_API_KEY=sk-... go test -tags=openai_validation ./pkg/openai -v

Tests cover: Files (list, upload, get, delete), Vector Stores (list, create, get, update, delete), Vector Store Files (list), and Vector Store Search (response schema).

Monitoring and Observability

The OpenAI RAG backend exposes the following metrics:

rag_retrieval_attempts_total{backend="openai", decision="...", status="success|error"}
rag_retrieval_latency_seconds{backend="openai", decision="..."}
rag_similarity_score{backend="openai", decision="..."}
rag_context_length_chars{backend="openai", decision="..."}
rag_cache_hits_total{backend="openai"}
rag_cache_misses_total{backend="openai"}

Tracing

OpenTelemetry spans are created for:

semantic_router.rag.retrieval - RAG retrieval operation
semantic_router.rag.context_injection - Context injection operation

Error Handling

The RAG plugin supports three failure modes:

skip (default): Continue without context, log warning
warn: Continue with warning header
block: Return error response (503)

rag:
  on_failure: "skip"  # or "warn" or "block"

Best Practices

Use Direct Search for Synchronous Workflows: When you need immediate context injection
Use Tool-Based for Responses API: When using Responses API and want LLM-controlled search
Cache Results: Enable caching for frequently accessed queries
Set Appropriate Timeouts: Configure timeout_seconds based on your vector store size
Filter Results: Use file_ids or filter to narrow search scope
Monitor Metrics: Track retrieval latency and similarity scores

OpenAI RAG Integration

Overview

Architecture

Prerequisites

Configuration

Basic Configuration

Advanced Configuration

Workflow Modes

1. Direct Search Mode (Default)

2. Tool-Based Mode (Responses API)

Usage Examples

Example 1: Basic RAG Query

Example 2: Responses API with file_search Tool

Example 3: Python Client

File Store Operations

Upload File

Create Vector Store

Attach File to Vector Store

Testing

Unit Tests

E2E Tests

OpenAI API Validation Test Suite

Monitoring and Observability

Tracing

Error Handling

Best Practices

Troubleshooting

No Results Found

High Latency

Authentication Errors

References

Overview​

Architecture​

Prerequisites​

Configuration​

Basic Configuration​

Advanced Configuration​

Workflow Modes​

1. Direct Search Mode (Default)​

2. Tool-Based Mode (Responses API)​

Usage Examples​

Example 1: Basic RAG Query​

Example 2: Responses API with file_search Tool​

Example 3: Python Client​

File Store Operations​

Upload File​

Create Vector Store​

Attach File to Vector Store​

Testing​

Unit Tests​

E2E Tests​

OpenAI API Validation Test Suite​

Monitoring and Observability​

Tracing​

Error Handling​

Best Practices​

Troubleshooting​

No Results Found​

High Latency​

Authentication Errors​

References​

Overview

Architecture

Prerequisites

Configuration

Basic Configuration

Advanced Configuration

Workflow Modes

1. Direct Search Mode (Default)

2. Tool-Based Mode (Responses API)

Usage Examples

Example 1: Basic RAG Query

Example 2: Responses API with file_search Tool

Example 3: Python Client

File Store Operations

Upload File

Create Vector Store

Attach File to Vector Store

Testing

Unit Tests

E2E Tests

OpenAI API Validation Test Suite

Monitoring and Observability

Tracing

Error Handling

Best Practices

Troubleshooting

No Results Found

High Latency

Authentication Errors

References