Version: v0.1

In-Memory Semantic Cache

The in-memory cache backend stores semantic embeddings and cached responses directly in memory for fast local caching.

Overview

The in-memory cache stores all cache data in the application's memory, providing low-latency access without external dependencies.

Architecture

How It Works

Write Path

When caching a response:

Generate embedding for the query using the configured embedding model
Store the embedding and response in memory
Apply TTL if configured
Evict oldest/least-used entries if max_entries limit is reached

Read Path

When searching for a cached response:

Generate embedding for the incoming query
Search in-memory cache for similar embeddings
If similarity exceeds threshold, return cached response (cache hit)
Otherwise, forward to LLM and cache the new response (cache miss)

Search Methods

The cache supports two search methods:

Linear Search: Compares query embedding against all cached embeddings
HNSW Index: Uses hierarchical graph structure for faster approximate nearest neighbor search

Configuration

Basic Configuration

# config/config.yaml
semantic_cache:
  enabled: true
  backend_type: "memory"
  similarity_threshold: 0.8       # Global default threshold
  max_entries: 1000
  ttl_seconds: 3600
  eviction_policy: "fifo"

Configuration with HNSW

semantic_cache:
  enabled: true
  backend_type: "memory"
  similarity_threshold: 0.8
  max_entries: 1000
  ttl_seconds: 3600
  eviction_policy: "fifo"
  # HNSW index for faster search
  use_hnsw: true
  hnsw_m: 16
  hnsw_ef_construction: 200

Category-Level Configuration (New)

Configure cache settings per category for fine-grained control:

semantic_cache:
  enabled: true
  backend_type: "memory"
  similarity_threshold: 0.8       # Global default
  max_entries: 1000
  ttl_seconds: 3600
  eviction_policy: "fifo"

categories:
  - name: health
    system_prompt: "You are a health expert..."
    semantic_cache_enabled: true
    semantic_cache_similarity_threshold: 0.95  # Very strict for medical accuracy
    model_scores:
      - model: your-model
        score: 0.5
        use_reasoning: false

  - name: general_chat
    system_prompt: "You are a helpful assistant..."
    semantic_cache_similarity_threshold: 0.75  # Relaxed for better hit rate
    model_scores:
      - model: your-model
        score: 0.7
        use_reasoning: false

  - name: troubleshooting
    # No cache settings - uses global default (0.8)
    model_scores:
      - model: your-model
        score: 0.7
        use_reasoning: false

Configuration Options

Parameter	Type	Default	Description
`enabled`	boolean	`false`	Enable/disable semantic caching globally
`backend_type`	string	`"memory"`	Cache backend type (must be "memory")
`similarity_threshold`	float	`0.8`	Global minimum similarity for cache hits (0.0-1.0)
`max_entries`	integer	`1000`	Maximum number of cached entries
`ttl_seconds`	integer	`3600`	Time-to-live for cache entries (seconds, 0 = no expiration)
`eviction_policy`	string	`"fifo"`	Eviction policy: `"fifo"`, `"lru"`, `"lfu"`
`use_hnsw`	boolean	`false`	Enable HNSW index for similarity search
`hnsw_m`	integer	`16`	HNSW M parameter (bi-directional links per node)
`hnsw_ef_construction`	integer	`200`	HNSW efConstruction parameter (build quality)

HNSW Parameters

The in-memory cache supports HNSW (Hierarchical Navigable Small World) indexing for significantly faster similarity search, especially beneficial with large cache sizes.

When to Use HNSW

Large cache sizes (>100 entries): HNSW provides logarithmic search time vs linear
High query throughput: Reduces CPU usage for similarity search
Production deployments: Better performance under load

HNSW Configuration

semantic_cache:
  enabled: true
  backend_type: "memory"
  similarity_threshold: 0.8
  max_entries: 10000           # Large cache benefits from HNSW
  ttl_seconds: 3600
  eviction_policy: "lru"
  use_hnsw: true               # Enable HNSW index
  hnsw_m: 16                   # Default: 16 (higher = better recall, more memory)
  hnsw_ef_construction: 200    # Default: 200 (higher = better quality, slower build)

HNSW Parameters

hnsw_m: Number of bi-directional links created for each node in the graph
- Lower values (8-12): Faster build, less memory, lower recall
- Default (16): Balanced performance
- Higher values (32-64): Better recall, more memory, slower build
hnsw_ef_construction: Size of dynamic candidate list during index construction
- Lower values (100-150): Faster index building
- Default (200): Good balance
- Higher values (400-800): Better quality, slower build

Performance Comparison

Cache Size	Linear Search	HNSW Search	Speedup
100 entries	~0.5ms	~0.4ms	1.25x
1,000 entries	~5ms	~0.8ms	6.25x
10,000 entries	~50ms	~1.2ms	41.7x
100,000 entries	~500ms	~1.5ms	333x

Benchmarks on typical hardware with 384-dimensional embeddings

Category-Level Configuration Options

Parameter	Type	Default	Description
`semantic_cache_enabled`	boolean	(inherits global)	Enable/disable caching for this category
`semantic_cache_similarity_threshold`	float	(inherits global)	Category-specific similarity threshold (0.0-1.0)

Category-level settings override global settings. If not specified, the category uses the global cache configuration.

Decision-Level Configuration (Plugin-Based)

Configure semantic cache at the decision level using plugins for fine-grained control:

signals:
  domains:
    - name: "math"
      description: "Mathematical queries"
      mmlu_categories: ["math"]

decisions:
  - name: math_route
    description: "Route math queries with strict caching"
    priority: 100
    rules:
      operator: "AND"
      conditions:
        - type: "domain"
          name: "math"
    modelRefs:
      - model: "openai/gpt-oss-120b"
        use_reasoning: true
    plugins:
      - type: "semantic-cache"
        configuration:
          enabled: true
          similarity_threshold: 0.95  # Very strict for math accuracy

  - name: general_route
    description: "General queries with relaxed caching"
    priority: 50
    rules:
      operator: "AND"
      conditions:
        - type: "domain"
          name: "other"
    modelRefs:
      - model: "openai/gpt-oss-120b"
        use_reasoning: false
    plugins:
      - type: "semantic-cache"
        configuration:
          enabled: true
          similarity_threshold: 0.75  # Relaxed for better hit rate

Plugin Configuration Options:

enabled: Enable/disable caching for this decision (boolean)
similarity_threshold: Decision-specific similarity threshold (0.0-1.0)

Decision-level plugin settings override both global and category-level settings.

Environment Examples

Development Environment

semantic_cache:
  enabled: true
  backend_type: "memory"
  similarity_threshold: 0.9     # Strict matching for testing
  max_entries: 500             # Small cache for development
  ttl_seconds: 1800            # 30 minutes
  eviction_policy: "fifo"
  use_hnsw: false              # Optional for small dev cache

Production Environment with HNSW

semantic_cache:
  enabled: true
  backend_type: "memory"
  similarity_threshold: 0.85
  max_entries: 50000           # Large production cache
  ttl_seconds: 7200            # 2 hours
  eviction_policy: "lru"
  use_hnsw: true               # Enable for production
  hnsw_m: 16
  hnsw_ef_construction: 200

Setup and Testing

Enable In-Memory Cache

Update your configuration file:

# Edit config/config.yaml
cat >> config/config.yaml << EOF
semantic_cache:
  enabled: true
  backend_type: "memory"
  similarity_threshold: 0.85
  max_entries: 1000
  ttl_seconds: 3600
EOF

Start the Router

# Start the semantic router
make run-router

# Or run directly
./bin/router --config config/config.yaml

Test Cache Functionality

Send requests to verify cache behavior:

# First request (cache miss)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [{"role": "user", "content": "What is machine learning?"}]
  }'

# Second identical request (cache hit)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [{"role": "user", "content": "What is machine learning?"}]
  }'

# Similar request (semantic cache hit)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [{"role": "user", "content": "Explain machine learning concepts"}]
  }'

Characteristics

Storage

Data is stored in application memory
Cache is cleared when the application restarts
Limited by available system memory

Access Pattern

Direct memory access without network overhead
No external dependencies required

Eviction Policies

FIFO: First In, First Out - removes oldest entries
LRU: Least Recently Used - removes least recently accessed entries
LFU: Least Frequently Used - removes least frequently accessed entries

TTL Management

Entries can have a time-to-live (TTL)
Expired entries are removed during cleanup operations

Next Steps

Hybrid Cache - Learn about HNSW + Milvus hybrid caching
Milvus Cache - Learn about persistent vector database caching
Observability - Monitor cache performance

In-Memory Semantic Cache

Overview​

Architecture​

How It Works​

Write Path​

Read Path​

Search Methods​

Configuration​

Basic Configuration​

Configuration with HNSW​

Category-Level Configuration (New)​

Configuration Options​

HNSW Parameters​

When to Use HNSW​

HNSW Configuration​

HNSW Parameters​

Performance Comparison​

Category-Level Configuration Options​

Decision-Level Configuration (Plugin-Based)​

Environment Examples​

Development Environment​

Production Environment with HNSW​

Setup and Testing​

Enable In-Memory Cache​

Start the Router​

Test Cache Functionality​

Characteristics​

Storage​

Access Pattern​

Eviction Policies​

TTL Management​

Next Steps​

Overview

Architecture

How It Works

Write Path

Read Path

Search Methods

Configuration

Basic Configuration

Configuration with HNSW

Category-Level Configuration (New)

Configuration Options

HNSW Parameters

When to Use HNSW

HNSW Configuration

HNSW Parameters

Performance Comparison

Category-Level Configuration Options

Decision-Level Configuration (Plugin-Based)

Environment Examples

Development Environment

Production Environment with HNSW

Setup and Testing

Enable In-Memory Cache

Start the Router

Test Cache Functionality

Characteristics

Storage

Access Pattern

Eviction Policies

TTL Management

Next Steps