Big Batch Test Plan (High Concurrency C=50+)
Project: Issue #995 - ModernBERT-base-32k Integration
Required: NVIDIA A100 GPU (40GB+ VRAM)
Overview
This test plan covers validation of ModernBERT-base-32k under high concurrency (big batch) scenarios. These tests cannot be completed with the current environment (NVIDIA L4 GPU, 23GB VRAM) due to memory fragmentation issues and require an A100 GPU with 40GB+ VRAM.
Infrastructure Status: Ready - All tools and test frameworks are prepared
Test Requirements
Hardware Requirements
- GPU: NVIDIA A100 (40GB+ VRAM) - Required
- System RAM: 64GB+ recommended
- CUDA: Version 12.0+
- Driver: Latest NVIDIA driver
Software Requirements
benchmark_concurrent.rs- Supports C=50, C=100 (currently fails due to memory)benchmark_performance.rs- Performance profiling tool- Flash Attention 2 enabled
- All dependencies installed
Test Cases
1. High Concurrency Testing (C=50, C=100)
1.1 Low Context Length (1K-4K tokens)
| Context Length | Concurrency | Expected Success Rate | Expected Latency |
|---|---|---|---|
| 1024 tokens | C=50 | ≥ 90% | < 2000ms (p95) |
| 1024 tokens | C=100 | ≥ 80% | < 3000ms (p95) |
| 4096 tokens | C=50 | ≥ 80% | < 15000ms (p95) |
| 4096 tokens | C=100 | ≥ 70% | < 20000ms (p95) |
Test Steps:
- Run
benchmark_concurrent.rswith C=50, C=100 - Test with 1K, 4K tokens
- Measure latency (mean, p50, p95, p99)
- Track success/error rates
- Document memory usage
Deliverables:
- Latency statistics for each concurrency level
- Success/error rates
- Memory usage profiles
- Throughput measurements
1.2 Medium Context Length (8K tokens)
| Context Length | Concurrency | Expected Success Rate | Expected Latency |
|---|---|---|---|
| 8192 tokens | C=50 | ≥ 70% | < 25000ms (p95) |
| 8192 tokens | C=100 | ≥ 60% | < 35000ms (p95) |
Test Steps:
- Run
benchmark_concurrent.rswith C=50, C=100 - Test with 8K tokens
- Measure latency (mean, p50, p95, p99)
- Track success/error rates
- Document memory usage and fragmentation
Deliverables:
- Latency statistics
- Success/error rates
- Memory fragmentation analysis
- Recommendations for production
2. Throughput Analysis
2.1 Requests Per Second (RPS)
Test Steps:
- Measure throughput at different concurrency levels
- Calculate RPS for C=50, C=100
- Compare with C=1, C=10 baseline
- Document throughput scaling
Deliverables:
- RPS measurements
- Throughput scaling analysis
- Bottleneck identification
3. Memory Fragmentation Analysis
3.1 Memory Management Under High Concurrency
Test Steps:
- Monitor GPU memory usage during high concurrency tests
- Track memory allocation/deallocation patterns
- Identify memory fragmentation issues
- Test memory cleanup strategies
- Document recommendations
Deliverables:
- Memory usage profiles
- Fragmentation analysis
- Cleanup strategy recommendations
4. Latency Distribution Analysis
4.1 P95/P99 Latency Under High Concurrency
Test Steps:
- Measure latency distribution at C=50, C=100
- Analyze p50, p95, p99 percentiles
- Identify latency spikes
- Document tail latency behavior
Deliverables:
- Latency distribution charts
- P95/P99 analysis
- Tail latency recommendations
5. Error Rate Analysis
5.1 OOM Error Patterns
Test Steps:
- Track OOM errors at different concurrency levels
- Analyze error patterns
- Identify failure modes
- Document error recovery strategies
Deliverables:
- Error rate analysis
- Failure mode documentation
- Recovery strategy recommendations