End User Deployment Guide
Project: Issue #995 - ModernBERT-base-32k Integration
Based on: Performance & Functionality Validation Results (1K-8K tokens)
Overviewâ
This guide provides deployment recommendations for ModernBERT-base-32k based on empirical testing results. All recommendations are based on validated test results for context lengths from 512 tokens to 8K tokens.
For long context (16K-32K) and big batch testing, see separate test plans:
1. Performance Recommendationsâ
1.1 Latency Expectationsâ
| Context Length | GPU Latency (C=1) | GPU Latency (C=10) | CPU Latency |
|---|---|---|---|
| 512 tokens | 163ms | N/A | 7367ms |
| 1K tokens | 785ms | 996ms | 806ms |
| 4K tokens | 896ms | 9066ms (88% success) | N/A |
| 8K tokens | 3294ms | N/A (fails) | N/A |
Recommendations:
- Use GPU for all production deployments (45x faster for 512 tokens)
- Flash Attention 2 provides 1.75x-11.9x speedup (highly recommended)
- CPU only suitable for 512 tokens (similar performance to GPU for 1K+)
1.2 Memory Requirementsâ
| Context Length | GPU Memory per Request | Total Memory (C=10) |
|---|---|---|
| 512 tokens | ~5MB | ~50MB |
| 1K tokens | ~11MB | ~110MB |
| 4K tokens | ~estimated | ~estimated |
| 8K tokens | ~estimated | ~estimated |
Recommendations:
- Memory usage is very efficient (~5-11MB per request)
- For 8K tokens with C=1, ensure at least 2GB free GPU memory
- For 4K tokens with C=10, ensure at least 1GB free GPU memory
2. Concurrency Limitsâ
2.1 Recommended Concurrency by Context Lengthâ
| Context Length | Max Concurrency | Notes |
|---|---|---|
| 1024 tokens | C=10 | Tested and works reliably |
| 4096 tokens | C=10 | 88% success rate (12 OOM errors) |
| 8192 tokens | C=1 | Only C=1 works reliably |
| 16384+ tokens | C=1 (with chunking) | Requires A100 or chunking |
2.2 Concurrency Best Practicesâ
- Start with C=1 for new deployments
- Gradually increase to C=10 for 1K-4K tokens
- Monitor memory - if OOM errors occur, reduce concurrency
- Use chunking for sequences > 8K tokens
- Avoid C=50+ for 8K+ tokens (memory fragmentation issues)
3. Device Selectionâ
3.1 GPU vs CPU Decision Matrixâ
| Context Length | Recommended Device | Reason |
|---|---|---|
| 512 tokens | GPU | 45x faster than CPU |
| 1K tokens | GPU | Similar to CPU, but more scalable |
| 4K tokens | GPU | CPU not tested, GPU recommended |
| 8K tokens | GPU | CPU not tested, GPU recommended |
3.2 Device Selection Heuristicsâ
fn select_device(context_length: usize, available_gpu: bool) -> Device {
if !available_gpu {
return Device::Cpu; // Fallback to CPU
}
// GPU recommended for all context lengths
// Flash Attention 2 provides significant speedup
Device::Cuda(0)
}
Recommendations:
- Always use GPU if available
- Flash Attention 2 provides 1.75x-11.9x speedup
- CPU only as fallback for 512 tokens
4. Chunking Strategyâ
4.1 When to Use Chunkingâ
| Context Length | Chunking Required | Reason |
|---|---|---|
| ⤠8K tokens | No | Can process in single pass |
| > 8K tokens | Yes | Memory limitations or latency optimization |
| > 32K tokens | Yes | Model limit is 32K, must chunk |
4.2 Chunking Threshold Recommendationsâ
fn should_chunk(context_length: usize, concurrency: usize) -> bool {
if context_length > 32768 {
return true; // Must chunk (model limit)
}
if context_length > 8192 && concurrency > 1 {
return true; // Chunk for 8K+ with concurrency
}
if context_length > 16384 {
return true; // Chunk for 16K+ (memory optimization)
}
false
}
4.3 Chunking Best Practicesâ
- Overlap: Use 10-20% overlap between chunks
- Size: Keep chunks ⤠8K tokens for optimal performance
- Aggregation:
- Domain classification: Average scores
- PII detection: Union with deduplication
- Jailbreak detection: Maximum score
5. Production Deployment Checklistâ
5.1 Pre-Deploymentâ
- Verify GPU availability (NVIDIA GPU with CUDA support)
- Enable Flash Attention 2 (recommended)
- Set appropriate concurrency limits based on context length
- Configure chunking for sequences > 8K tokens
- Test with expected production load
5.2 Monitoringâ
- Monitor GPU memory usage
- Track latency (p50, p95, p99)
- Monitor error rates (OOM errors)
- Track concurrency levels
- Monitor accuracy (signal extraction)
5.3 Scalingâ
- Start with C=1 for new deployments
- Gradually increase to C=10 for 1K-4K tokens
- Monitor memory and error rates
- Adjust concurrency based on results
- Use chunking for sequences > 8K tokens
6. Troubleshootingâ
6.1 Common Issuesâ
OOM (Out of Memory) Errorsâ
Symptoms:
CUDA_ERROR_OUT_OF_MEMORYerrors- Requests failing at high concurrency
Solutions:
- Reduce concurrency (C=10 â C=1)
- Use chunking for sequences > 8K tokens
- Increase wait time between requests
- Use GPU with more VRAM (A100 40GB+)
High Latencyâ
Symptoms:
- Latency higher than expected
- p95/p99 latency spikes
Solutions:
- Enable Flash Attention 2
- Reduce concurrency
- Use chunking for long sequences
- Monitor GPU utilization
Memory Fragmentationâ
Symptoms:
- Tests pass initially, then fail
- Memory not released between requests
Solutions:
- Add explicit memory cleanup
- Increase wait time between requests
- Restart service periodically
- Use memory pool management
7. Best Practicesâ
7.1 Performanceâ
- Always use GPU with Flash Attention 2
- Monitor concurrency - don't exceed recommended limits
- Use chunking for sequences > 8K tokens
- Cache results when possible
- Batch requests when appropriate
7.2 Reliabilityâ
- Start conservative - C=1 for new deployments
- Gradually scale - increase concurrency based on results
- Monitor errors - track OOM and timeout errors
- Have fallbacks - CPU mode for critical paths
- Test thoroughly - validate with production-like load
7.3 Resource Managementâ
- Monitor memory - track GPU memory usage
- Set limits - enforce concurrency limits
- Clean up - explicit memory cleanup between requests
- Use timeouts - prevent hanging requests
- Scale horizontally - add more GPUs if needed
8. Resource Requirementsâ
8.1 Minimum Requirementsâ
- GPU: NVIDIA GPU with CUDA support (âĨ23GB VRAM for 8K tokens)
- Memory: ~5-11MB per request
- CPU: Multi-core recommended for preprocessing
- Storage: Model files (~500MB)
8.2 Recommended Requirementsâ
- GPU: NVIDIA L4 or better (23GB+ VRAM)
- Memory: 32GB+ system RAM
- CPU: 8+ cores
- Storage: SSD for model files
8.3 For Long Context (16K-32K)â
- GPU: NVIDIA A100 (40GB+ VRAM) - Required
- Memory: 64GB+ system RAM
- See: Long Context Test Plan
9. Referencesâ
- Performance Validation: Performance Validation
- Long Context Test Plan: Long Context Test Plan
- Big Batch Test Plan: Big Batch Test Plan
- Benchmark Results:
BENCHMARK_RESULTS_ANALYSIS.md