ModernBERT-base-32k Performance Benchmark Results
This tutorial provides benchmark results and performance tuning guidance for ModernBERT-base-32k integration. Use these results to provision hardware and adjust workload expectations for your deployment.
Overviewâ
ModernBERT-base-32k extends the context window from 512 tokens (BERT-base) to 32,768 tokens, enabling processing of long documents and conversations. This guide presents empirical benchmark results from comprehensive testing across different context lengths and concurrency levels.
Test Environment:
- GPU: NVIDIA L4 (23GB VRAM)
- Flash Attention 2: Enabled
- Model:
llm-semantic-router/modernbert-base-32k - Test Tool:
candle-binding/examples/benchmark_concurrent.rs
Benchmark Resultsâ
Single Request Latency (C=1)â
| Context Length | Mean Latency | p50 Latency | p95 Latency | p99 Latency | Status |
|---|---|---|---|---|---|
| 1,024 tokens | 90.98ms | 94.18ms | 94.24ms | 94.24ms | Pass |
| 4,096 tokens | 899.87ms | 955.05ms | 955.93ms | 955.93ms | Pass |
| 8,192 tokens | 3,299.92ms | 3,524.62ms | 3,526.34ms | 3,526.34ms | Pass |
Notes:
- 1K tokens: Stable performance with mean â p50
- 4K and 8K tokens: Stable performance with mean â p50
Concurrent Requests (C=10)â
| Context Length | Mean Latency | p50 Latency | p95 Latency | Success Rate | Status |
|---|---|---|---|---|---|
| 1,024 tokens | 1,001.22ms | 970.65ms | 1,379.32ms | 100% | Pass |
| 4,096 tokens | 9,323.45ms | 9,389.28ms | 10,349.11ms | 93% | Partial |
| 8,192 tokens | N/A | N/A | N/A | 0% | Fail |
Notes:
- 1K tokens: Excellent performance with 100% success rate
- 4K tokens: 93% success rate (7 OOM errors out of 100 requests)
- 8K tokens: Failed due to insufficient GPU memory
High Concurrency (C=50, C=100)â
All high concurrency tests (C=50+) failed due to hardware limitations. The current test environment (NVIDIA L4 GPU with 23GB VRAM) does not provide sufficient memory for high concurrency workloads with larger context lengths. Testing high concurrency (C=50+) requires a GPU with 40GB+ VRAM (e.g., NVIDIA A100) as documented in the Big Batch Test Plan.
Hardware Provisioning Guideâ
Minimum Requirementsâ
| Context Length | GPU VRAM | System RAM | Recommended GPU |
|---|---|---|---|
| ⤠1K tokens | âĨ 5GB | âĨ 16GB | NVIDIA T4, L4 |
| ⤠4K tokens | âĨ 10GB | âĨ 32GB | NVIDIA L4, A10G |
| ⤠8K tokens | âĨ 23GB | âĨ 32GB | NVIDIA L4, A10G |
| 16K+ tokens | âĨ 40GB | âĨ 64GB | NVIDIA A100 |
Recommended Configurationâ
For Production (1K-8K tokens):
- GPU: NVIDIA L4 (23GB VRAM) or better
- System RAM: 32GB+
- CUDA: Version 12.0+
- Flash Attention 2: Enabled (provides 1.75x-11.9x speedup)
For Long Context (16K-32K tokens):
- GPU: NVIDIA A100 (40GB+ VRAM) - Required
- System RAM: 64GB+
- See Long Context Test Plan for details
Workload Expectationsâ
Concurrency Limits by Context Lengthâ
| Context Length | Max Concurrency | Expected Throughput | Notes |
|---|---|---|---|
| 1,024 tokens | C=10 | ~10 req/s | Tested and reliable |
| 4,096 tokens | C=10 | ~1 req/s | 88% success rate |
| 8,192 tokens | C=1 | ~0.3 req/s | Only C=1 works reliably |
| 16,384+ tokens | C=1 (with chunking) | Variable | Requires A100 or chunking |
Latency Expectationsâ
Single Request (C=1):
- 1K tokens: ~100ms (p50)
- 4K tokens: ~950ms (p50)
- 8K tokens: ~3,500ms (p50)
Concurrent Requests (C=10):
- 1K tokens: ~1,000ms (mean)
- 4K tokens: ~9,400ms (mean, 93% success)
- 8K tokens: Not supported (OOM)
Memory Usageâ
| Context Length | GPU Memory per Request | Notes |
|---|---|---|
| 512 tokens | ~5MB | Very efficient |
| 1K tokens | ~11MB | Very efficient |
| 4K tokens | ~estimated | Moderate |
| 8K tokens | ~estimated | High (requires 22GB+ VRAM) |
Configuration Guideâ
Enabling ModernBERT-base-32kâ
To use ModernBERT-base-32k in your semantic router configuration:
classifier:
category_model:
model_id: "models/mom-domain-classifier"
use_modernbert: true # Enable ModernBERT-base-32k
threshold: 0.6
use_cpu: false # Use GPU for better performance
category_mapping_path: "models/mom-domain-classifier/category_mapping.json"
pii_model:
model_id: "models/mom-pii-classifier"
use_modernbert: true # Enable ModernBERT-base-32k
threshold: 0.7
use_cpu: false
pii_mapping_path: "models/mom-pii-classifier/pii_type_mapping.json"
prompt_guard:
model_id: "models/mom-jailbreak-classifier"
use_modernbert: true # Enable ModernBERT-base-32k
threshold: 0.7
use_cpu: false
Flash Attention 2â
Flash Attention 2 provides significant performance improvements (1.75x-11.9x speedup). Ensure it's enabled when building:
cargo build --release --features cuda,flash-attn
Performance Tuning Recommendationsâ
1. Context Length Selectionâ
Choose context length based on your use case:
- Short queries (â¤1K tokens): Best performance, supports C=10
- Medium documents (1K-4K tokens): Good performance, supports C=10 with 88% success
- Long documents (4K-8K tokens): Acceptable performance, only C=1 supported
- Very long documents (8K+ tokens): Requires chunking or A100 GPU
2. Concurrency Tuningâ
Start Conservative:
- Begin with C=1 for all context lengths
- Gradually increase to C=10 for 1K-4K tokens
- Monitor GPU memory and error rates
- Reduce concurrency if OOM errors occur
Production Settings:
# Recommended concurrency limits
concurrency_limits:
1024: 10 # 1K tokens: C=10
4096: 10 # 4K tokens: C=10 (monitor for OOM)
8192: 1 # 8K tokens: C=1 only
3. Memory Managementâ
- Monitor GPU memory using
nvidia-smi - Ensure sufficient free memory before processing large batches
- Use chunking for sequences > 8K tokens
- Restart service periodically if memory constraints occur after extended use
4. Device Selectionâ
Always prefer GPU:
- GPU provides 45x speedup for 512 tokens
- Flash Attention 2 provides additional 1.75x-11.9x speedup
- CPU only suitable as fallback for very short sequences
Running Benchmarksâ
Prerequisitesâ
-
Install Rust and CUDA:
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install CUDA toolkit
# See: https://developer.nvidia.com/cuda-downloads -
Build with Flash Attention 2:
cd candle-binding
cargo build --example benchmark_concurrent --release --features cuda,flash-attn
Running Concurrent Request Benchmarkâ
# Run benchmark for 1K-8K tokens
cargo run --example benchmark_concurrent --release --features cuda,flash-attn
Testing Long Context (16K-32K)â
For testing 16K-32K tokens (requires A100 GPU):
-
Uncomment test cases in
benchmark_concurrent.rs:let context_lengths = vec![
1024,
4096,
8192,
16384, // Uncomment this
32768, // Uncomment this
]; -
Run benchmark:
cargo run --example benchmark_concurrent --release --features cuda,flash-attn -
See: Long Context Test Plan for detailed test plan
Troubleshootingâ
Out of Memory (OOM) Errorsâ
Symptoms:
CUDA_ERROR_OUT_OF_MEMORYerrors- Requests failing at high concurrency
Solutions:
- Reduce concurrency (C=10 â C=1)
- Use chunking for sequences > 8K tokens
- Increase wait time between requests
- Use GPU with more VRAM (A100 40GB+)
High Latencyâ
Symptoms:
- Latency higher than expected
- p95/p99 latency spikes
Solutions:
- Enable Flash Attention 2
- Reduce concurrency
- Use chunking for long sequences
- Monitor GPU utilization
Memory Constraintsâ
Symptoms:
- Tests pass initially, then fail
- Memory not released between requests
- OOM errors after initial successful tests
Solutions:
- Upgrade hardware: Use GPU with more VRAM (A100 40GB+ for high concurrency)
- Add explicit memory cleanup between requests
- Increase wait time between requests
- Restart service periodically to clear memory
- Use memory pool management if available
- Reduce concurrency: Lower concurrency levels reduce memory pressure
Key Findingsâ
What Worksâ
- 1K-4K tokens: Reliable with C=1 and C=10
- 8K tokens: Reliable with C=1
- Flash Attention 2: 1.75x-11.9x speedup
- Memory efficiency: ~5-11MB per request
Limitationsâ
- 4K tokens: C=10 has 88% success rate (12 OOM errors)
- 8K tokens: C=10+ not supported (OOM)
- 16K+ tokens: Cannot test with 23GB VRAM (requires A100)
Future Workâ
- 16K-32K tokens: Test plan ready, waiting for A100 environment (40GB+ VRAM)
- High concurrency (C=50+): Test plan ready, waiting for A100 environment (40GB+ VRAM)
See:
Referencesâ
- Benchmark Tool:
candle-binding/examples/benchmark_concurrent.rs - Performance Tool:
candle-binding/examples/benchmark_performance.rs - Full Results: Performance Validation
- Deployment Guide: Deployment Guide