Big Batch Test Plan (High Concurrency C=50+)
Project: Issue #995 - ModernBERT-base-32k Integration
Required: NVIDIA A100 GPU (40GB+ VRAM)
Overviewâ
This test plan covers validation of ModernBERT-base-32k under high concurrency (big batch) scenarios. These tests cannot be completed with the current environment (NVIDIA L4 GPU, 23GB VRAM) due to memory fragmentation issues and require an A100 GPU with 40GB+ VRAM.
Infrastructure Status: Ready - All tools and test frameworks are prepared
Test Requirementsâ
Hardware Requirementsâ
- GPU: NVIDIA A100 (40GB+ VRAM) - Required
- System RAM: 64GB+ recommended
- CUDA: Version 12.0+
- Driver: Latest NVIDIA driver
Software Requirementsâ
benchmark_concurrent.rs- Supports C=50, C=100 (currently fails due to memory)benchmark_performance.rs- Performance profiling tool- Flash Attention 2 enabled
- All dependencies installed
Test Casesâ
1. High Concurrency Testing (C=50, C=100)â
1.1 Low Context Length (1K-4K tokens)â
| Context Length | Concurrency | Expected Success Rate | Expected Latency |
|---|---|---|---|
| 1024 tokens | C=50 | âĨ 90% | < 2000ms (p95) |
| 1024 tokens | C=100 | âĨ 80% | < 3000ms (p95) |
| 4096 tokens | C=50 | âĨ 80% | < 15000ms (p95) |
| 4096 tokens | C=100 | âĨ 70% | < 20000ms (p95) |
Test Steps:
- Run
benchmark_concurrent.rswith C=50, C=100 - Test with 1K, 4K tokens
- Measure latency (mean, p50, p95, p99)
- Track success/error rates
- Document memory usage
Deliverables:
- Latency statistics for each concurrency level
- Success/error rates
- Memory usage profiles
- Throughput measurements
1.2 Medium Context Length (8K tokens)â
| Context Length | Concurrency | Expected Success Rate | Expected Latency |
|---|---|---|---|
| 8192 tokens | C=50 | âĨ 70% | < 25000ms (p95) |
| 8192 tokens | C=100 | âĨ 60% | < 35000ms (p95) |
Test Steps:
- Run
benchmark_concurrent.rswith C=50, C=100 - Test with 8K tokens
- Measure latency (mean, p50, p95, p99)
- Track success/error rates
- Document memory usage and fragmentation
Deliverables:
- Latency statistics
- Success/error rates
- Memory fragmentation analysis
- Recommendations for production
2. Throughput Analysisâ
2.1 Requests Per Second (RPS)â
Test Steps:
- Measure throughput at different concurrency levels
- Calculate RPS for C=50, C=100
- Compare with C=1, C=10 baseline
- Document throughput scaling
Deliverables:
- RPS measurements
- Throughput scaling analysis
- Bottleneck identification
3. Memory Fragmentation Analysisâ
3.1 Memory Management Under High Concurrencyâ
Test Steps:
- Monitor GPU memory usage during high concurrency tests
- Track memory allocation/deallocation patterns
- Identify memory fragmentation issues
- Test memory cleanup strategies
- Document recommendations
Deliverables:
- Memory usage profiles
- Fragmentation analysis
- Cleanup strategy recommendations
4. Latency Distribution Analysisâ
4.1 P95/P99 Latency Under High Concurrencyâ
Test Steps:
- Measure latency distribution at C=50, C=100
- Analyze p50, p95, p99 percentiles
- Identify latency spikes
- Document tail latency behavior
Deliverables:
- Latency distribution charts
- P95/P99 analysis
- Tail latency recommendations
5. Error Rate Analysisâ
5.1 OOM Error Patternsâ
Test Steps:
- Track OOM errors at different concurrency levels
- Analyze error patterns
- Identify failure modes
- Document error recovery strategies
Deliverables:
- Error rate analysis
- Failure mode documentation
- Recovery strategy recommendations
Expected Outcomesâ
Success Criteriaâ
-
C=50:
- 1K tokens: âĨ 90% success rate
- 4K tokens: âĨ 80% success rate
- 8K tokens: âĨ 70% success rate
-
C=100:
- 1K tokens: âĨ 80% success rate
- 4K tokens: âĨ 70% success rate
- 8K tokens: âĨ 60% success rate
-
Latency:
- P95 latency within acceptable limits
- No excessive tail latency
-
Memory:
- Memory usage within GPU limits
- No memory leaks
- Acceptable fragmentation
Infrastructure Readinessâ
Tools Readyâ
benchmark_concurrent.rs- Supports C=50, C=100 (currently fails due to memory)benchmark_performance.rs- Performance profiling ready- Flash Attention 2 enabled
- All dependencies installed
How to Run Testsâ
-
Ensure sufficient GPU memory (A100 40GB+)
-
Run benchmark with high concurrency:
cargo run --example benchmark_concurrent --release --features cuda,flash-attn -
Monitor memory usage:
watch -n 1 nvidia-smi
Resource Estimatesâ
Time Estimatesâ
- High Concurrency Testing: 6-8 hours
- Throughput Analysis: 2-3 hours
- Memory Fragmentation Analysis: 3-4 hours
- Latency Distribution Analysis: 2-3 hours
- Error Rate Analysis: 2-3 hours
Total: ~15-21 hours
Resource Requirementsâ
- GPU: A100 (40GB VRAM) - Required
- System RAM: 64GB+ recommended
- Storage: ~10GB for test data and results
Deliverablesâ
-
Test Results Report
- Latency measurements (C=50, C=100)
- Success/error rates
- Throughput measurements
- Memory usage profiles
-
Performance Analysis
- Throughput scaling analysis
- Memory fragmentation analysis
- Latency distribution analysis
- Error rate patterns
-
Recommendations
- Production concurrency limits
- Memory management strategies
- Scaling recommendations
- Error handling strategies
Known Issues (From Current Testing)â
Memory Fragmentationâ
- Issue: GPU memory not released between tests
- Impact: Subsequent high-concurrency tests fail
- Workaround: Test with fresh GPU state or longer wait times
- Solution: Investigate memory cleanup strategies on A100
OOM Errors at High Concurrencyâ
- Issue: OOM errors at C=50+ for 8K+ tokens
- Impact: Cannot test high concurrency with current environment
- Solution: A100 with 40GB+ VRAM required
Referencesâ
- Performance Validation: Performance Validation
- Deployment Guide: Deployment Guide
- Benchmark Tool:
candle-binding/examples/benchmark_concurrent.rs - Performance Tool:
candle-binding/examples/benchmark_performance.rs