Agentic Memory
Executive Summary
This document describes a Proof of Concept for Agentic Memory in the Semantic Router. Agentic Memory enables AI agents to remember information across sessions, providing continuity and personalization.
⚠️ POC Scope: This is a proof of concept, not a production design. The goal is to validate the core memory flow (retrieve → inject → extract → store) with acceptable accuracy. Production hardening (error handling, scaling, monitoring) is out of scope.
Core Capabilities
| Capability | Description |
|---|---|
| Memory Retrieval | Embedding-based search with simple pre-filtering |
| Memory Saving | LLM-based extraction of facts and procedures |
| Cross-Session Persistence | Memories stored in Milvus (survives restarts; production backup/HA not tested) |
| User Isolation | Memories scoped per user_id (see note below) |
⚠️ User Isolation - Milvus Performance Note:
Approach POC Production (10K+ users) Simple filter ✅ Filter by user_idafter search❌ Degrades: searches all users, then filters Partition Key ❌ Overkill ✅ Physical separation, O(log N) per user Scalar Index ❌ Overkill ✅ Index on user_idfor fast filteringPOC: Uses simple metadata filtering (sufficient for testing).
Production: Configureuser_idas Partition Key or Scalar Indexed Field in Milvus schema.
Key Design Principles
- Simple pre-filter decides if query should search memory
- Context window from history for query disambiguation
- LLM extracts facts and classifies type when saving
- Threshold-based filtering on search results
Explicit Assumptions (POC)
| Assumption | Implication | Risk if Wrong |
|---|---|---|
| LLM extraction is reasonably accurate | Some incorrect facts may be stored | Memory contamination (fixable via Forget API) |
| 0.6 similarity threshold is a starting point | May need tuning (miss relevant or include irrelevant) | Adjustable based on retrieval quality logs |
| Milvus is available and configured | Feature disabled if down | Graceful degradation (no crash) |
| Embedding model produces 384-dim vectors | Must match Milvus schema | Startup failure (detectable) |
| History available via Response API chain | Required for context | Skip memory if unavailable |
Table of Contents
- Problem Statement
- Architecture Overview
- Memory Types
- Pipeline Integration
- Memory Retrieval
- Memory Saving
- Memory Operations
- Data Structures
- API Extension
- Configuration
- Failure Modes and Fallbacks
- Success Criteria
- Implementation Plan
- Future Enhancements
1. Problem Statement
Current State
The Response API provides conversation chaining via previous_response_id, but knowledge is lost across sessions:
Session A (March 15):
User: "My budget for the Hawaii trip is $10,000"
→ Saved in session chain
Session B (March 20) - NEW SESSION:
User: "What's my budget for the trip?"
→ No previous_response_id → Knowledge LOST ❌
Desired State
With Agentic Memory:
Session A (March 15):
User: "My budget for the Hawaii trip is $10,000"
→ Extracted and saved to Milvus
Session B (March 20) - NEW SESSION:
User: "What's my budget for the trip?"
→ Pre-filter: memory-relevant ✓
→ Search Milvus → Found: "budget for Hawaii is $10K"
→ Inject into LLM context
→ Assistant: "Your budget for the Hawaii trip is $10,000!" ✅
2. Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ AGENTIC MEMORY ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ExtProc Pipeline │
│ ┌────────────────────────────────── ────────────────────────────────┐ │
│ │ │ │
│ │ Request → Fact? → Tool? → Security → Cache → MEMORY → LLM │ │
│ │ │ │ ↑↓ │ │
│ │ └───────┴──── signals used ────────┘ │ │
│ │ │ │
│ │ Response ← [extract & store] ←─────────────────┘ │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────┴─────────────────────┐ │
│ │ │ │
│ ┌─────────▼─────────┐ ┌────────────▼───┐ │
│ │ Memory Retrieval │ │ Memory Saving │ │
│ │ (request phase) │ │(response phase)│ │
│ ├───────────────────┤ ├────────────────┤ │
│ │ 1. Check signals │ │ 1. LLM extract │ │
│ │ (Fact? Tool?) │ │ 2. Classify │ │
│ │ 2. Build context │ │ 3. Deduplicate │ │
│ │ 3. Milvus search │ │ 4. Store │ │
│ │ 4. Inject to LLM │ │ │ │
│ └─────────┬─────────┘ └────────┬───────┘ │
│ │ │ │
│ │ ┌──────────────┐ │ │
│ └────────►│ Milvus │◄─────────────┘ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Component Responsibilities
| Component | Responsibility | Location |
|---|---|---|
| Memory Filter | Decision + search + inject | pkg/extproc/req_filter_memory.go |
| Memory Extractor | LLM-based fact extraction | pkg/memory/extractor.go (new) |
| Memory Store | Storage interface | pkg/memory/store.go |
| Milvus Store | Vector database backend | pkg/memory/milvus_store.go |
| Existing Classifiers | Fact/Tool signals (reused) | pkg/extproc/processor_req_body.go |
Storage Architecture
Issue #808 suggests a multi-layer storage architecture. We implement this incrementally:
┌─────────────────────────────────────────────────────────────────────────┐
│ STORAGE ARCHITECTURE (Phased) │
├─────────────── ──────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 1 (MVP) │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ Milvus (Vector Index) │ │ │
│ │ │ • Semantic search over memories │ │ │
│ │ │ • Embedding storage │ │ │
│ │ │ • Content + metadata │ │ │
│ │ └───────────────────────────────── ────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 2 (Performance) │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ Redis (Hot Cache) │ │ │
│ │ │ • Fast metadata lookup │ │ │
│ │ │ • Recently accessed memories │ │ │
│ │ │ • TTL/expiration support │ │ │
│ │ └─────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ PHASE 3+ (If Needed) │ │
│ │ ┌───────────────────────┐ ┌───────────────────────┐ │ │
│ │ │ Graph Store (Neo4j) │ │ Time-Series Index │ │ │
│ │ │ • Memory links │ │ • Temporal queries │ │ │
│ │ │ • Relationships │ │ • Decay scoring │ │ │
│ │ └───────────────────────┘ └───────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
| Layer | Purpose | When Needed | Status |
|---|---|---|---|
| Milvus | Semantic vector search | Core functionality | ✅ MVP |
| Redis | Hot cache, fast access, TTL | Performance optimization | 🔶 Phase 2 |
| Graph (Neo4j) | Memory relationships | Multi-hop reasoning queries | ⚪ If needed |
| Time-Series | Temporal queries, decay | Importance scoring by time | ⚪ If needed |
Design Decision: We start with Milvus only. Additional layers are added based on demonstrated need, not speculation. The
Storeinterface abstracts storage, allowing backends to be added without changing retrieval/saving logic.
3. Memory Types
| Type | Purpose | Example | Status |
|---|---|---|---|
| Semantic | Facts, preferences, knowledge | "User's budget for Hawaii is $10,000" | ✅ MVP |
| Procedural | How-to, steps, processes | "To deploy payment-service: run npm build, then docker push" | ✅ MVP |
| Episodic | Session summaries, past events | "On Dec 29 2024, user planned Hawaii vacation with $10K budget" | ⚠️ MVP (limited) |
| Reflective | Self-analysis, lessons learned | "Previous budget response was incomplete - user prefers detailed breakdowns" | 🔮 Future |
⚠️ Episodic Memory (MVP Limitation): Session-end detection is not implemented. Episodic memories are only created when the LLM extraction explicitly produces a summary-style output. Reliable session-end triggers are deferred to Phase 2.
🔮 Reflective Memory: Self-analysis and lessons learned. Not in scope for this POC. See Appendix A.
Memory Vector Space
Memories cluster by content/topic, not by type. Type is metadata:
┌────────────────────────────────────────────────────────────────────────┐
│ MEMORY VECTOR SPACE │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ BUDGET/MONEY │ │ DEPLOYMENT │ │
│ │ CLUSTER │ │ CLUSTER │ │
│ │ │ │ │ │
│ │ ● budget=$10K │ │ ● npm build │ │
│ │ (semantic) │ │ (procedural) │ │
│ │ ● cost=$5K │ │ ● docker push │ │
│ │ (semantic) │ │ (procedural) │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ ● = memory with type as metadata │
│ Query matches content → type comes from matched memory │
│ │
└────────────────────────────────────────────────────────────────────────┘