Performance Tuning Guide
Overview
DevOps RAG has been optimized for production performance with comprehensive tuning across chunking strategies, retrieval parameters, and embedding models.
Recommended Configuration
CHUNK_SIZE = 512 # Optimal balance of precision and rank-1 accuracy
CHUNK_OVERLAP = 64 # Minimal overlap for 100% rank-1 accuracy
TOP_K = 5 # Good balance; 3 works for simple queries
EMBED_MODEL = "text-embedding-3-small" # Cost-effective, good qualityKey Performance Metrics
Based on evaluation across 18 runbooks and 20 test queries:
- Source Hit Rate: 100% (always finds relevant documentation)
- Rank-1 Accuracy: 100% (correct document ranked first)
- Average Latency: 433ms
- P95 Latency: 1683ms
Chunking Strategy Insights
- 512-token chunks achieve perfect rank-1 accuracy while maintaining manageable corpus size
- 64-token overlap is essential for quality - no-overlap drops rank-1 accuracy to 90%
- Smaller chunks (256) have higher similarity scores but create 2x more chunks without improving ranking
- Larger chunks (768+) sacrifice precision without improving recall
Production Scaling
For larger deployments, consider:
- Metadata filtering by service, team, or severity to narrow retrieval scope
- Hybrid search (BM25 + semantic) for exact-match queries like error codes
- Cross-encoder reranking for top-k refinement
- text-embedding-3-large for domains with high semantic overlap
Monitoring
Monitor these key metrics in production:
- Query latency (target <500ms p95)
- Source hit rate (should maintain >95%)
- Embedding model costs vs. accuracy tradeoffs