Performance Tuning Guide

Overview

DevOps RAG has been optimized for production performance with comprehensive tuning across chunking strategies, retrieval parameters, and embedding models.

Recommended Configuration

CHUNK_SIZE = 512      # Optimal balance of precision and rank-1 accuracy  
CHUNK_OVERLAP = 64    # Minimal overlap for 100% rank-1 accuracy
TOP_K = 5             # Good balance; 3 works for simple queries
EMBED_MODEL = "text-embedding-3-small"  # Cost-effective, good quality

Key Performance Metrics

Based on evaluation across 18 runbooks and 20 test queries:

Source Hit Rate: 100% (always finds relevant documentation)
Rank-1 Accuracy: 100% (correct document ranked first)
Average Latency: 433ms
P95 Latency: 1683ms

Chunking Strategy Insights

512-token chunks achieve perfect rank-1 accuracy while maintaining manageable corpus size
64-token overlap is essential for quality - no-overlap drops rank-1 accuracy to 90%
Smaller chunks (256) have higher similarity scores but create 2x more chunks without improving ranking
Larger chunks (768+) sacrifice precision without improving recall

Production Scaling

For larger deployments, consider:

Metadata filtering by service, team, or severity to narrow retrieval scope
Hybrid search (BM25 + semantic) for exact-match queries like error codes
Cross-encoder reranking for top-k refinement
text-embedding-3-large for domains with high semantic overlap

Monitoring

Monitor these key metrics in production:

Query latency (target <500ms p95)
Source hit rate (should maintain >95%)
Embedding model costs vs. accuracy tradeoffs

Monitoring Use Cases