Performance Tuning Guide

Overview

DevOps RAG has been optimized for production performance with comprehensive tuning across chunking strategies, retrieval parameters, and embedding models.

CHUNK_SIZE = 512      # Optimal balance of precision and rank-1 accuracy  
CHUNK_OVERLAP = 64    # Minimal overlap for 100% rank-1 accuracy
TOP_K = 5             # Good balance; 3 works for simple queries
EMBED_MODEL = "text-embedding-3-small"  # Cost-effective, good quality

Key Performance Metrics

Based on evaluation across 18 runbooks and 20 test queries:

  • Source Hit Rate: 100% (always finds relevant documentation)
  • Rank-1 Accuracy: 100% (correct document ranked first)
  • Average Latency: 433ms
  • P95 Latency: 1683ms

Chunking Strategy Insights

  1. 512-token chunks achieve perfect rank-1 accuracy while maintaining manageable corpus size
  2. 64-token overlap is essential for quality - no-overlap drops rank-1 accuracy to 90%
  3. Smaller chunks (256) have higher similarity scores but create 2x more chunks without improving ranking
  4. Larger chunks (768+) sacrifice precision without improving recall

Production Scaling

For larger deployments, consider:

  • Metadata filtering by service, team, or severity to narrow retrieval scope
  • Hybrid search (BM25 + semantic) for exact-match queries like error codes
  • Cross-encoder reranking for top-k refinement
  • text-embedding-3-large for domains with high semantic overlap

Monitoring

Monitor these key metrics in production:

  • Query latency (target <500ms p95)
  • Source hit rate (should maintain >95%)
  • Embedding model costs vs. accuracy tradeoffs