Version: 🚧 Next

Big Batch Test Plan (High Concurrency C=50+)

Project: Issue #995 - ModernBERT-base-32k Integration
Required: NVIDIA A100 GPU (40GB+ VRAM)

Overview

This test plan covers validation of ModernBERT-base-32k under high concurrency (big batch) scenarios. These tests cannot be completed with the current environment (NVIDIA L4 GPU, 23GB VRAM) due to memory fragmentation issues and require an A100 GPU with 40GB+ VRAM.

Infrastructure Status: Ready - All tools and test frameworks are prepared

Test Requirements

Hardware Requirements

GPU: NVIDIA A100 (40GB+ VRAM) - Required
System RAM: 64GB+ recommended
CUDA: Version 12.0+
Driver: Latest NVIDIA driver

Software Requirements

benchmark_concurrent.rs - Supports C=50, C=100 (currently fails due to memory)
benchmark_performance.rs - Performance profiling tool
Flash Attention 2 enabled
All dependencies installed

Test Cases

1. High Concurrency Testing (C=50, C=100)

1.1 Low Context Length (1K-4K tokens)

Context Length	Concurrency	Expected Success Rate	Expected Latency
1024 tokens	C=50	≥ 90%	< 2000ms (p95)
1024 tokens	C=100	≥ 80%	< 3000ms (p95)
4096 tokens	C=50	≥ 80%	< 15000ms (p95)
4096 tokens	C=100	≥ 70%	< 20000ms (p95)

Test Steps:

Run benchmark_concurrent.rs with C=50, C=100
Test with 1K, 4K tokens
Measure latency (mean, p50, p95, p99)
Track success/error rates
Document memory usage

Deliverables:

Latency statistics for each concurrency level
Success/error rates
Memory usage profiles
Throughput measurements

1.2 Medium Context Length (8K tokens)

Context Length	Concurrency	Expected Success Rate	Expected Latency
8192 tokens	C=50	≥ 70%	< 25000ms (p95)
8192 tokens	C=100	≥ 60%	< 35000ms (p95)

Test Steps:

Run benchmark_concurrent.rs with C=50, C=100
Test with 8K tokens
Measure latency (mean, p50, p95, p99)
Track success/error rates
Document memory usage and fragmentation

Deliverables:

Latency statistics
Success/error rates
Memory fragmentation analysis
Recommendations for production

2. Throughput Analysis

2.1 Requests Per Second (RPS)

Test Steps:

Measure throughput at different concurrency levels
Calculate RPS for C=50, C=100
Compare with C=1, C=10 baseline
Document throughput scaling

Deliverables:

RPS measurements
Throughput scaling analysis
Bottleneck identification

3. Memory Fragmentation Analysis

3.1 Memory Management Under High Concurrency

Test Steps:

Monitor GPU memory usage during high concurrency tests
Track memory allocation/deallocation patterns
Identify memory fragmentation issues
Test memory cleanup strategies
Document recommendations

Deliverables:

Memory usage profiles
Fragmentation analysis
Cleanup strategy recommendations

4. Latency Distribution Analysis

4.1 P95/P99 Latency Under High Concurrency

Test Steps:

Measure latency distribution at C=50, C=100
Analyze p50, p95, p99 percentiles
Identify latency spikes
Document tail latency behavior

Deliverables:

Latency distribution charts
P95/P99 analysis
Tail latency recommendations

5. Error Rate Analysis

5.1 OOM Error Patterns

Test Steps:

Track OOM errors at different concurrency levels
Analyze error patterns
Identify failure modes
Document error recovery strategies

Deliverables:

Error rate analysis
Failure mode documentation
Recovery strategy recommendations

Expected Outcomes

Success Criteria

C=50:
- 1K tokens: ≥ 90% success rate
- 4K tokens: ≥ 80% success rate
- 8K tokens: ≥ 70% success rate
C=100:
- 1K tokens: ≥ 80% success rate
- 4K tokens: ≥ 70% success rate
- 8K tokens: ≥ 60% success rate
Latency:
- P95 latency within acceptable limits
- No excessive tail latency
Memory:
- Memory usage within GPU limits
- No memory leaks
- Acceptable fragmentation

Infrastructure Readiness

Tools Ready

benchmark_concurrent.rs - Supports C=50, C=100 (currently fails due to memory)
benchmark_performance.rs - Performance profiling ready
Flash Attention 2 enabled
All dependencies installed

How to Run Tests

Ensure sufficient GPU memory (A100 40GB+)

Run benchmark with high concurrency:

cargo run --example benchmark_concurrent --release --features cuda,flash-attn

Monitor memory usage:
```
watch -n 1 nvidia-smi
```

Resource Estimates

Time Estimates

High Concurrency Testing: 6-8 hours
Throughput Analysis: 2-3 hours
Memory Fragmentation Analysis: 3-4 hours
Latency Distribution Analysis: 2-3 hours
Error Rate Analysis: 2-3 hours

Total: ~15-21 hours

Resource Requirements

GPU: A100 (40GB VRAM) - Required
System RAM: 64GB+ recommended
Storage: ~10GB for test data and results

Deliverables

Test Results Report
- Latency measurements (C=50, C=100)
- Success/error rates
- Throughput measurements
- Memory usage profiles
Performance Analysis
- Throughput scaling analysis
- Memory fragmentation analysis
- Latency distribution analysis
- Error rate patterns
Recommendations
- Production concurrency limits
- Memory management strategies
- Scaling recommendations
- Error handling strategies

Known Issues (From Current Testing)

Memory Fragmentation

Issue: GPU memory not released between tests
Impact: Subsequent high-concurrency tests fail
Workaround: Test with fresh GPU state or longer wait times
Solution: Investigate memory cleanup strategies on A100

OOM Errors at High Concurrency

Issue: OOM errors at C=50+ for 8K+ tokens
Impact: Cannot test high concurrency with current environment
Solution: A100 with 40GB+ VRAM required

References

Performance Validation: Performance Validation
Deployment Guide: Deployment Guide
Benchmark Tool: candle-binding/examples/benchmark_concurrent.rs
Performance Tool: candle-binding/examples/benchmark_performance.rs

Overview​

Test Requirements​

Hardware Requirements​

Software Requirements​

Test Cases​

1. High Concurrency Testing (C=50, C=100)​

1.1 Low Context Length (1K-4K tokens)​

1.2 Medium Context Length (8K tokens)​

2. Throughput Analysis​

2.1 Requests Per Second (RPS)​

3. Memory Fragmentation Analysis​

3.1 Memory Management Under High Concurrency​

4. Latency Distribution Analysis​

4.1 P95/P99 Latency Under High Concurrency​

5. Error Rate Analysis​

5.1 OOM Error Patterns​

Expected Outcomes​

Success Criteria​

Infrastructure Readiness​

Tools Ready​

How to Run Tests​

Resource Estimates​

Time Estimates​

Resource Requirements​

Deliverables​

Known Issues (From Current Testing)​

Memory Fragmentation​

OOM Errors at High Concurrency​

References​

Overview

Test Requirements

Hardware Requirements

Software Requirements

Test Cases

1. High Concurrency Testing (C=50, C=100)

1.1 Low Context Length (1K-4K tokens)

1.2 Medium Context Length (8K tokens)

2. Throughput Analysis

2.1 Requests Per Second (RPS)

3. Memory Fragmentation Analysis

3.1 Memory Management Under High Concurrency

4. Latency Distribution Analysis

4.1 P95/P99 Latency Under High Concurrency

5. Error Rate Analysis

5.1 OOM Error Patterns

Expected Outcomes

Success Criteria

Infrastructure Readiness

Tools Ready

How to Run Tests

Resource Estimates

Time Estimates

Resource Requirements

Deliverables

Known Issues (From Current Testing)

Memory Fragmentation

OOM Errors at High Concurrency

References