版本：🚧 开发中

ModernBERT-base-32k Performance Benchmark Results

This tutorial provides benchmark results and performance tuning guidance for ModernBERT-base-32k integration. Use these results to provision hardware and adjust workload expectations for your deployment.

Overview

ModernBERT-base-32k extends the context window from 512 tokens (BERT-base) to 32,768 tokens, enabling processing of long documents and conversations. This guide presents empirical benchmark results from comprehensive testing across different context lengths and concurrency levels.

Test Environment:

GPU: NVIDIA L4 (23GB VRAM)
Flash Attention 2: Enabled
Model: llm-semantic-router/modernbert-base-32k
Test Tool: candle-binding/examples/benchmark_concurrent.rs

Benchmark Results

Single Request Latency (C=1)

Context Length	Mean Latency	p50 Latency	p95 Latency	p99 Latency	Status
1,024 tokens	90.98ms	94.18ms	94.24ms	94.24ms	Pass
4,096 tokens	899.87ms	955.05ms	955.93ms	955.93ms	Pass
8,192 tokens	3,299.92ms	3,524.62ms	3,526.34ms	3,526.34ms	Pass

Notes:

1K tokens: Stable performance with mean ≈ p50
4K and 8K tokens: Stable performance with mean ≈ p50

Concurrent Requests (C=10)

Context Length	Mean Latency	p50 Latency	p95 Latency	Success Rate	Status
1,024 tokens	1,001.22ms	970.65ms	1,379.32ms	100%	Pass
4,096 tokens	9,323.45ms	9,389.28ms	10,349.11ms	93%	Partial
8,192 tokens	N/A	N/A	N/A	0%	Fail

Notes:

1K tokens: Excellent performance with 100% success rate
4K tokens: 93% success rate (7 OOM errors out of 100 requests)
8K tokens: Failed due to insufficient GPU memory

High Concurrency (C=50, C=100)

All high concurrency tests (C=50+) failed due to hardware limitations. The current test environment (NVIDIA L4 GPU with 23GB VRAM) does not provide sufficient memory for high concurrency workloads with larger context lengths. Testing high concurrency (C=50+) requires a GPU with 40GB+ VRAM (e.g., NVIDIA A100) as documented in the Big Batch Test Plan.

Hardware Provisioning Guide

Minimum Requirements

Context Length	GPU VRAM	System RAM	Recommended GPU
≤ 1K tokens	≥ 5GB	≥ 16GB	NVIDIA T4, L4
≤ 4K tokens	≥ 10GB	≥ 32GB	NVIDIA L4, A10G
≤ 8K tokens	≥ 23GB	≥ 32GB	NVIDIA L4, A10G
16K+ tokens	≥ 40GB	≥ 64GB	NVIDIA A100

Recommended Configuration

For Production (1K-8K tokens):

GPU: NVIDIA L4 (23GB VRAM) or better
System RAM: 32GB+
CUDA: Version 12.0+
Flash Attention 2: Enabled (provides 1.75x-11.9x speedup)

For Long Context (16K-32K tokens):

GPU: NVIDIA A100 (40GB+ VRAM) - Required
System RAM: 64GB+
See Long Context Test Plan for details

Workload Expectations

Concurrency Limits by Context Length

Context Length	Max Concurrency	Expected Throughput	Notes
1,024 tokens	C=10	~10 req/s	Tested and reliable
4,096 tokens	C=10	~1 req/s	88% success rate
8,192 tokens	C=1	~0.3 req/s	Only C=1 works reliably
16,384+ tokens	C=1 (with chunking)	Variable	Requires A100 or chunking

Latency Expectations

Single Request (C=1):

1K tokens: ~100ms (p50)
4K tokens: ~950ms (p50)
8K tokens: ~3,500ms (p50)

Concurrent Requests (C=10):

1K tokens: ~1,000ms (mean)
4K tokens: ~9,400ms (mean, 93% success)
8K tokens: Not supported (OOM)

Memory Usage

Context Length	GPU Memory per Request	Notes
512 tokens	~5MB	Very efficient
1K tokens	~11MB	Very efficient
4K tokens	~estimated	Moderate
8K tokens	~estimated	High (requires 22GB+ VRAM)

Configuration Guide

Enabling ModernBERT-base-32k

To use ModernBERT-base-32k in your semantic router configuration:

classifier:
  category_model:
    model_id: "models/mom-domain-classifier"
    use_modernbert: true  # Enable ModernBERT-base-32k
    threshold: 0.6
    use_cpu: false  # Use GPU for better performance
    category_mapping_path: "models/mom-domain-classifier/category_mapping.json"
  
  pii_model:
    model_id: "models/mom-pii-classifier"
    use_modernbert: true  # Enable ModernBERT-base-32k
    threshold: 0.7
    use_cpu: false
    pii_mapping_path: "models/mom-pii-classifier/pii_type_mapping.json"

prompt_guard:
  model_id: "models/mom-jailbreak-classifier"
  use_modernbert: true  # Enable ModernBERT-base-32k
  threshold: 0.7
  use_cpu: false

Flash Attention 2

Flash Attention 2 provides significant performance improvements (1.75x-11.9x speedup). Ensure it's enabled when building:

cargo build --release --features cuda,flash-attn

Performance Tuning Recommendations

1. Context Length Selection

Choose context length based on your use case:

Short queries (≤1K tokens): Best performance, supports C=10
Medium documents (1K-4K tokens): Good performance, supports C=10 with 88% success
Long documents (4K-8K tokens): Acceptable performance, only C=1 supported
Very long documents (8K+ tokens): Requires chunking or A100 GPU

2. Concurrency Tuning

Start Conservative:

Begin with C=1 for all context lengths
Gradually increase to C=10 for 1K-4K tokens
Monitor GPU memory and error rates
Reduce concurrency if OOM errors occur

Production Settings:

# Recommended concurrency limits
concurrency_limits:
  1024: 10   # 1K tokens: C=10
  4096: 10   # 4K tokens: C=10 (monitor for OOM)
  8192: 1    # 8K tokens: C=1 only

3. Memory Management

Monitor GPU memory using nvidia-smi
Ensure sufficient free memory before processing large batches
Use chunking for sequences > 8K tokens
Restart service periodically if memory constraints occur after extended use

4. Device Selection

Always prefer GPU:

GPU provides 45x speedup for 512 tokens
Flash Attention 2 provides additional 1.75x-11.9x speedup
CPU only suitable as fallback for very short sequences

Running Benchmarks

Prerequisites

Install Rust and CUDA:

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install CUDA toolkit
# See: https://developer.nvidia.com/cuda-downloads

Build with Flash Attention 2:

cd candle-binding
cargo build --example benchmark_concurrent --release --features cuda,flash-attn

Running Concurrent Request Benchmark

# Run benchmark for 1K-8K tokens
cargo run --example benchmark_concurrent --release --features cuda,flash-attn

Testing Long Context (16K-32K)

For testing 16K-32K tokens (requires A100 GPU):

Uncomment test cases in benchmark_concurrent.rs:

let context_lengths = vec![
    1024,
    4096,
    8192,
    16384,  // Uncomment this
    32768,  // Uncomment this
];

Run benchmark:

cargo run --example benchmark_concurrent --release --features cuda,flash-attn

See: Long Context Test Plan for detailed test plan

Troubleshooting

Out of Memory (OOM) Errors

Symptoms:

CUDA_ERROR_OUT_OF_MEMORY errors
Requests failing at high concurrency

Solutions:

Reduce concurrency (C=10 → C=1)
Use chunking for sequences > 8K tokens
Increase wait time between requests
Use GPU with more VRAM (A100 40GB+)

High Latency

Symptoms:

Latency higher than expected
p95/p99 latency spikes

Solutions:

Enable Flash Attention 2
Reduce concurrency
Use chunking for long sequences
Monitor GPU utilization

Memory Constraints

Symptoms:

Tests pass initially, then fail
Memory not released between requests
OOM errors after initial successful tests

Solutions:

Upgrade hardware: Use GPU with more VRAM (A100 40GB+ for high concurrency)
Add explicit memory cleanup between requests
Increase wait time between requests
Restart service periodically to clear memory
Use memory pool management if available
Reduce concurrency: Lower concurrency levels reduce memory pressure

Key Findings

What Works

1K-4K tokens: Reliable with C=1 and C=10
8K tokens: Reliable with C=1
Flash Attention 2: 1.75x-11.9x speedup
Memory efficiency: ~5-11MB per request

Limitations

4K tokens: C=10 has 88% success rate (12 OOM errors)
8K tokens: C=10+ not supported (OOM)
16K+ tokens: Cannot test with 23GB VRAM (requires A100)

Future Work

16K-32K tokens: Test plan ready, waiting for A100 environment (40GB+ VRAM)
High concurrency (C=50+): Test plan ready, waiting for A100 environment (40GB+ VRAM)

See:

References

Benchmark Tool: candle-binding/examples/benchmark_concurrent.rs
Performance Tool: candle-binding/examples/benchmark_performance.rs
Full Results: Performance Validation
Deployment Guide: Deployment Guide

Overview​

Benchmark Results​

Single Request Latency (C=1)​

Concurrent Requests (C=10)​

High Concurrency (C=50, C=100)​

Hardware Provisioning Guide​

Minimum Requirements​

Recommended Configuration​

Workload Expectations​

Concurrency Limits by Context Length​

Latency Expectations​

Memory Usage​

Configuration Guide​

Enabling ModernBERT-base-32k​

Flash Attention 2​

Performance Tuning Recommendations​

1. Context Length Selection​

2. Concurrency Tuning​

3. Memory Management​

4. Device Selection​

Running Benchmarks​

Prerequisites​

Running Concurrent Request Benchmark​

Testing Long Context (16K-32K)​

Troubleshooting​

Out of Memory (OOM) Errors​

High Latency​

Memory Constraints​

Key Findings​

What Works​

Limitations​

Future Work​

References​

Overview

Benchmark Results

Single Request Latency (C=1)

Concurrent Requests (C=10)

High Concurrency (C=50, C=100)

Hardware Provisioning Guide

Minimum Requirements

Recommended Configuration

Workload Expectations

Concurrency Limits by Context Length

Latency Expectations

Memory Usage

Configuration Guide

Enabling ModernBERT-base-32k

Flash Attention 2

Performance Tuning Recommendations

1. Context Length Selection

2. Concurrency Tuning

3. Memory Management

4. Device Selection

Running Benchmarks

Prerequisites

Running Concurrent Request Benchmark

Testing Long Context (16K-32K)

Troubleshooting

Out of Memory (OOM) Errors

High Latency

Memory Constraints

Key Findings

What Works

Limitations

Future Work

References