Big Batch Test Plan (High Concurrency C=50+)
Project: Issue #995 - ModernBERT-base-32k Integration
Required: NVIDIA A100 GPU (40GB+ VRAM)
Overview
This test plan covers validation of ModernBERT-base-32k under high concurrency (big batch) scenarios. These tests cannot be completed with the current environment (NVIDIA L4 GPU, 23GB VRAM) due to memory fragmentation issues and require an A100 GPU with 40GB+ VRAM.
Infrastructure Status: Ready - All tools and test frameworks are prepared
Test Requirements
Hardware Requirements
- GPU: NVIDIA A100 (40GB+ VRAM) - Required
- System RAM: 64GB+ recommended
- CUDA: Version 12.0+
- Driver: Latest NVIDIA driver
Software Requirements
benchmark_concurrent.rs- Supports C=50, C=100 (currently fails due to memory)benchmark_performance.rs- Performance profiling tool- Flash Attention 2 enabled
- All dependencies installed
Test Cases
1. High Concurrency Testing (C=50, C=100)
1.1 Low Context Length (1K-4K tokens)
| Context Length | Concurrency | Expected Success Rate | Expected Latency |
|---|---|---|---|
| 1024 tokens | C=50 | ≥ 90% | < 2000ms (p95) |
| 1024 tokens | C=100 | ≥ 80% | < 3000ms (p95) |
| 4096 tokens | C=50 | ≥ 80% | < 15000ms (p95) |
| 4096 tokens | C=100 | ≥ 70% | < 20000ms (p95) |
Test Steps:
- Run
benchmark_concurrent.rswith C=50, C=100 - Test with 1K, 4K tokens
- Measure latency (mean, p50, p95, p99)
- Track success/error rates
- Document memory usage
Deliverables:
- Latency statistics for each concurrency level
- Success/error rates
- Memory usage profiles
- Throughput measurements
1.2 Medium Context Length (8K tokens)
| Context Length | Concurrency | Expected Success Rate | Expected Latency |
|---|---|---|---|
| 8192 tokens | C=50 | ≥ 70% | < 25000ms (p95) |
| 8192 tokens | C=100 | ≥ 60% | < 35000ms (p95) |
Test Steps:
- Run
benchmark_concurrent.rswith C=50, C=100 - Test with 8K tokens
- Measure latency (mean, p50, p95, p99)
- Track success/error rates
- Document memory usage and fragmentation
Deliverables:
- Latency statistics
- Success/error rates
- Memory fragmentation analysis
- Recommendations for production
2. Throughput Analysis
2.1 Requests Per Second (RPS)
Test Steps:
- Measure throughput at different concurrency levels
- Calculate RPS for C=50, C=100
- Compare with C=1, C=10 baseline
- Document throughput scaling
Deliverables:
- RPS measurements
- Throughput scaling analysis
- Bottleneck identification
3. Memory Fragmentation Analysis
3.1 Memory Management Under High Concurrency
Test Steps:
- Monitor GPU memory usage during high concurrency tests
- Track memory allocation/deallocation patterns
- Identify memory fragmentation issues
- Test memory cleanup strategies
- Document recommendations
Deliverables:
- Memory usage profiles
- Fragmentation analysis
- Cleanup strategy recommendations
4. Latency Distribution Analysis
4.1 P95/P99 Latency Under High Concurrency
Test Steps:
- Measure latency distribution at C=50, C=100
- Analyze p50, p95, p99 percentiles
- Identify latency spikes
- Document tail latency behavior
Deliverables:
- Latency distribution charts
- P95/P99 analysis
- Tail latency recommendations
5. Error Rate Analysis
5.1 OOM Error Patterns
Test Steps:
- Track OOM errors at different concurrency levels
- Analyze error patterns
- Identify failure modes
- Document error recovery strategies
Deliverables:
- Error rate analysis
- Failure mode documentation
- Recovery strategy recommendations
Expected Outcomes
Success Criteria
-
C=50:
- 1K tokens: ≥ 90% success rate
- 4K tokens: ≥ 80% success rate
- 8K tokens: ≥ 70% success rate
-
C=100:
- 1K tokens: ≥ 80% success rate
- 4K tokens: ≥ 70% success rate
- 8K tokens: ≥ 60% success rate
-
Latency:
- P95 latency within acceptable limits
- No excessive tail latency
-
Memory:
- Memory usage within GPU limits
- No memory leaks
- Acceptable fragmentation
Infrastructure Readiness
Tools Ready
benchmark_concurrent.rs- Supports C=50, C=100 (currently fails due to memory)benchmark_performance.rs- Performance profiling ready- Flash Attention 2 enabled
- All dependencies installed
How to Run Tests
-
Ensure sufficient GPU memory (A100 40GB+)
-
Run benchmark with high concurrency:
cargo run --example benchmark_concurrent --release --features cuda,flash-attn -
Monitor memory usage:
watch -n 1 nvidia-smi
Resource Estimates
Time Estimates
- High Concurrency Testing: 6-8 hours
- Throughput Analysis: 2-3 hours
- Memory Fragmentation Analysis: 3-4 hours
- Latency Distribution Analysis: 2-3 hours
- Error Rate Analysis: 2-3 hours
Total: ~15-21 hours
Resource Requirements
- GPU: A100 (40GB VRAM) - Required
- System RAM: 64GB+ recommended
- Storage: ~10GB for test data and results
Deliverables
-
Test Results Report
- Latency measurements (C=50, C=100)
- Success/error rates
- Throughput measurements
- Memory usage profiles
-
Performance Analysis
- Throughput scaling analysis
- Memory fragmentation analysis
- Latency distribution analysis
- Error rate patterns
-
Recommendations
- Production concurrency limits
- Memory management strategies
- Scaling recommendations
- Error handling strategies