Version: 🚧 Next

Long Context Test Plan (16K-32K Tokens)

Project: Issue #995 - ModernBERT-base-32k Integration
Required: NVIDIA A100 GPU (40GB+ VRAM)

Overview

This test plan covers validation of ModernBERT-base-32k for long context sequences (16K-32K tokens). These tests cannot be completed with the current environment (NVIDIA L4 GPU, 23GB VRAM) and require an A100 GPU with 40GB+ VRAM.

Infrastructure Status: Ready - All tools and test frameworks are prepared

Test Requirements

Hardware Requirements

GPU: NVIDIA A100 (40GB+ VRAM) - Required
System RAM: 64GB+ recommended
CUDA: Version 12.0+
Driver: Latest NVIDIA driver

Software Requirements

benchmark_concurrent.rs - Supports 16K/32K (currently commented out)
benchmark_performance.rs - Performance profiling tool
Flash Attention 2 enabled
All dependencies installed

Test Cases

1. Basic Inference Testing

1.1 Single Request Latency (C=1)

Context Length	Expected Latency	Success Criteria
16384 tokens	< 10s	Latency < 10s
24576 tokens	< 15s	Latency < 15s
32768 tokens	< 20s	Latency < 20s

Test Steps:

Load ModernBERT-base-32k model
Create test sequences of 16K, 24K, 32K tokens
Measure inference latency for each
Verify no OOM errors
Document results

Deliverables:

Latency measurements for each context length
Memory usage profiles
Success/failure status

2. Concurrent Request Testing

2.1 Low Concurrency (C=1, C=10)

Context Length	Concurrency	Expected Success Rate
16384 tokens	C=1	100%
16384 tokens	C=10	≥ 80%
32768 tokens	C=1	100%
32768 tokens	C=10	≥ 80%

Test Steps:

Run benchmark_concurrent.rs with 16K, 32K tokens
Test with C=1 and C=10
Measure latency (mean, p50, p95, p99)
Track success/error rates
Document memory usage

Deliverables:

Latency statistics for each concurrency level
Success/error rates
Memory usage profiles

3. Performance Profiling

3.1 Component Breakdown

Test Steps:

Run benchmark_performance.rs for 16K, 32K tokens
Measure:
- Tokenization time
- Tensor creation time
- Forward pass time
- Total latency
Compare with Flash Attention 2 enabled/disabled
Document performance breakdown

Deliverables:

Performance breakdown by component
Flash Attention 2 impact
Bottleneck identification

4. Memory Profiling

4.1 Memory Usage Analysis

Test Steps:

Measure GPU memory usage for each context length
Track memory allocation patterns
Identify memory peaks
Document memory requirements

Deliverables:

Memory usage profiles
Peak memory requirements
Memory efficiency metrics

5. Accuracy Validation

5.1 Signal Extraction Accuracy

Test Steps:

Test domain classification accuracy at 16K, 32K tokens
Test PII detection accuracy at 16K, 32K tokens
Test jailbreak detection accuracy at 16K, 32K tokens
Compare with baseline (512 tokens)
Document accuracy degradation (if any)

Deliverables:

Accuracy measurements for each signal type
Comparison with baseline
Accuracy degradation analysis

6. Position Accuracy Testing

6.1 Information Retrieval at Different Positions

Test Steps:

Place test information at beginning, middle, end of sequence
Test with 16K, 32K tokens
Measure retrieval accuracy for each position
Compare with model card baseline
Document position accuracy

Deliverables:

Position accuracy results
Comparison with baseline
Recommendations

Expected Outcomes

Success Criteria

16K tokens:
- C=1 latency < 10s
- C=10 success rate ≥ 80%
- No OOM errors
32K tokens:
- C=1 latency < 20s
- C=10 success rate ≥ 80%
- No OOM errors
Accuracy:
- Signal extraction accuracy maintained (≥ 0.90 for domain, ≥ 0.85 for PII)
- Position accuracy comparable to model card baseline
Memory:
- Memory usage within GPU limits
- No memory leaks

Infrastructure Readiness

Tools Ready

benchmark_concurrent.rs - Supports 16K/32K (uncomment test cases)
benchmark_performance.rs - Performance profiling ready
Flash Attention 2 enabled
All dependencies installed

How to Run Tests

Uncomment test cases in benchmark_concurrent.rs:

let context_lengths = vec![
    1024,
    4096,
    8192,
    16384,  // Uncomment this
    32768,  // Uncomment this
];

Run benchmark:

cargo run --example benchmark_concurrent --release --features cuda,flash-attn

Run performance profiling:

cargo run --example benchmark_performance --release --features cuda,flash-attn

Resource Estimates

Time Estimates

Basic Inference Testing: 2-3 hours
Concurrent Request Testing: 4-6 hours
Performance Profiling: 2-3 hours
Memory Profiling: 1-2 hours
Accuracy Validation: 3-4 hours
Position Accuracy Testing: 2-3 hours

Total: ~14-21 hours

Resource Requirements

GPU: A100 (40GB VRAM) - Required
System RAM: 64GB+ recommended
Storage: ~10GB for test data and results

Deliverables

Test Results Report
- Latency measurements (16K, 32K tokens)
- Concurrency results
- Memory usage profiles
- Accuracy measurements
Performance Analysis
- Component breakdown
- Flash Attention 2 impact
- Bottleneck identification
Recommendations
- Deployment recommendations for 16K-32K tokens
- Chunking strategy recommendations
- Resource requirements

References

Performance Validation: Performance Validation
Deployment Guide: Deployment Guide
Benchmark Tool: candle-binding/examples/benchmark_concurrent.rs
Performance Tool: candle-binding/examples/benchmark_performance.rs

Overview​

Test Requirements​

Hardware Requirements​

Software Requirements​

Test Cases​

1. Basic Inference Testing​

1.1 Single Request Latency (C=1)​

2. Concurrent Request Testing​

2.1 Low Concurrency (C=1, C=10)​

3. Performance Profiling​

3.1 Component Breakdown​

4. Memory Profiling​

4.1 Memory Usage Analysis​

5. Accuracy Validation​

5.1 Signal Extraction Accuracy​

6. Position Accuracy Testing​

6.1 Information Retrieval at Different Positions​

Expected Outcomes​

Success Criteria​

Infrastructure Readiness​

Tools Ready​

How to Run Tests​

Resource Estimates​

Time Estimates​

Resource Requirements​

Deliverables​

References​

Overview

Test Requirements

Hardware Requirements

Software Requirements

Test Cases

1. Basic Inference Testing

1.1 Single Request Latency (C=1)

2. Concurrent Request Testing

2.1 Low Concurrency (C=1, C=10)

3. Performance Profiling

3.1 Component Breakdown

4. Memory Profiling

4.1 Memory Usage Analysis

5. Accuracy Validation

5.1 Signal Extraction Accuracy

6. Position Accuracy Testing

6.1 Information Retrieval at Different Positions

Expected Outcomes

Success Criteria

Infrastructure Readiness

Tools Ready

How to Run Tests

Resource Estimates

Time Estimates

Resource Requirements

Deliverables

References