Knowledge Base Guide - DevOps RAG
How to write, organize, and optimize runbooks for maximum retrieval quality and accuracy. Best practices for content structure, chunking optimization, and continuous improvement.
📚 Runbook Writing Best Practices
Markdown Structure
DevOps RAG works best with well-structured Markdown documents. Follow these conventions:
# Runbook Title - Clear and Descriptive
## Problem Description
Brief overview of when to use this runbook.
## Prerequisites
- Required tools
- Access requirements
- Knowledge assumptions
## Step-by-Step Solution
### 1. Diagnosis Commands
```bash
# Commands to identify the issue
kubectl get pods --all-namespaces2. Root Cause Analysis
Explain common causes and how to identify them.
3. Resolution Steps
# Commands to fix the issue
kubectl scale deployment myapp --replicas=3Verification
How to confirm the issue is resolved.
Prevention
Steps to prevent recurrence.
Related Issues
Links to similar problems or follow-up actions.
### Content Guidelines
**DO:**
- ✅ Use clear, imperative headings ("Fix CrashLoopBackOff")
- ✅ Include actual commands and expected outputs
- ✅ Explain the "why" behind each step
- ✅ Add error message examples users might see
- ✅ Include verification steps
- ✅ Cross-reference related runbooks
**DON'T:**
- ❌ Use vague titles ("Kubernetes Issues")
- ❌ Skip command explanations
- ❌ Forget error examples
- ❌ Make assumptions about user knowledge
- ❌ Include outdated commands or URLs
---
## 🏗️ Chunking Optimization
### Optimal Chunk Boundaries
The system chunks documents at 512 tokens with 64-token overlap. Write content that aligns well with these boundaries:
**Good Chunking Example:**
```markdown
## Kubernetes Pod Troubleshooting
### CrashLoopBackOff Diagnosis
When a pod shows CrashLoopBackOff status, it indicates the container is
crashing repeatedly after startup.
**Diagnosis Commands:**
```bash
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previousCommon Causes:
- Application startup failures
- Misconfigured health checks
- Resource constraints
- Missing dependencies
Resolution Steps
[Next chunk starts here…]
**Why This Works:**
- Each section is ~400-600 tokens (good chunk size)
- Headers create natural boundaries
- Related information stays together
- Commands and context are grouped
### Avoid These Anti-Patterns
**❌ Wall of Text:**
```markdown
# Kubernetes Issues
This document covers all Kubernetes problems you might encounter including
CrashLoopBackOff ImagePullBackOff NodeNotReady ServiceUnavailable
PersistentVolumeClaimPending InsufficientResources OutOfMemory
DiskPressure NetworkPolicyViolation RBAC issues certificate expiration...Problem: No structure, poor chunking, topics mixed together
❌ Fragmented Commands:
Run this:
```bash
kubectl get podsThen this:
kubectl describe pod <name>Finally this:
kubectl logs <name>Problem: Commands split across chunks, missing context
📁 Directory Organization
Recommended Structure
runbooks/
├── 01-kubernetes/
│ ├── pod-troubleshooting.md
│ ├── networking-debug.md
│ ├── rbac-security.md
│ └── scaling-performance.md
├── 02-databases/
│ ├── postgresql-operations.md
│ ├── mysql-backup-restore.md
│ └── redis-troubleshooting.md
├── 03-monitoring/
│ ├── datadog-alerts.md
│ ├── prometheus-queries.md
│ └── log-analysis.md
├── 04-cloud/
│ ├── aws-incident-response.md
│ ├── eks-operations.md
│ └── terraform-debugging.md
└── 05-infrastructure/
├── linux-maintenance.md
├── ssl-certificate-renewal.md
└── backup-procedures.mdNaming Conventions
File Names:
- Use descriptive, hyphenated names
- Include the primary technology/service
- Be specific:
kubernetes-crashloop-debugging.mdnotk8s-issues.md
Headers:
- Start with the problem/goal: “Fix CrashLoopBackOff”
- Use consistent hierarchy (H1 for document, H2 for sections, H3 for steps)
- Include technology in headers: “PostgreSQL Backup Procedures”
🔍 Content Optimization for RAG
Keywords and Search Terms
Include variations of terms users might search for:
# PostgreSQL Database Backup and Recovery
## Overview
This runbook covers PostgreSQL (postgres, pg) database backup procedures,
including pg_dump, pg_basebackup, and WAL archiving for disaster recovery.
Common search terms: postgres backup, postgresql restore, pg_dump,
database recovery, WAL files, PITR (Point-in-Time Recovery)Error Message Examples
Always include actual error messages users will see:
## Symptoms
Users may encounter these errors:
Error: CrashLoopBackOff Warning BackOff pod/myapp-7d4f8c4c8-x9k2m Back-off restarting failed container
ERRO[0001] container died unexpectedly: exit code 137
This helps the RAG system match user queries with exact error text.Context and Prerequisites
Provide context so chunks are self-contained:
## Fix EKS Node NotReady Status
### Context
This applies to Amazon EKS worker nodes showing "NotReady" status in
`kubectl get nodes`. Common in EKS clusters using managed node groups
or self-managed nodes.
### Prerequisites
- kubectl configured for your EKS cluster
- AWS CLI configured with appropriate permissions
- Access to AWS Console (optional)📊 Quality Metrics and Testing
Test Your Runbooks
Use the built-in testing framework to validate retrieval quality:
# Test query coverage
python cli.py test --query "How do I fix CrashLoopBackOff?"
# Test all common scenarios
python cli.py test --comprehensive
# Generate quality report
python cli.py test --report quality_report.jsonRetrieval Quality Checklist
For each runbook, verify:
- Title matches common search queries
- Error messages included verbatim
- Commands include expected outputs
- Prerequisites clearly stated
- Verification steps provided
- Related issues cross-referenced
- Technology variations mentioned (k8s/kubernetes, pg/postgresql)
Common Query Patterns
Structure content to match how users ask questions:
User Query: “Pod keeps crashing”
Runbook Title: “Fix Pod CrashLoopBackOff - Container Restart Issues”
User Query: “Database backup failed”
Runbook Title: “PostgreSQL Backup Troubleshooting - pg_dump Errors”
User Query: “SSL certificate expired”
Runbook Title: “Renew SSL/TLS Certificates - Nginx and Apache”
🔄 Auto-Ingestion and Updates
Git Integration
Set up automated ingestion from your documentation repository:
# Watch for changes and auto-ingest
python cli.py watch --directory /app/runbooks --auto-ingest
# Ingest from Git repository
python cli.py ingest --source git \
--repo https://github.com/yourorg/runbooks.git \
--branch mainContinuous Improvement
Feedback Loop
-
Monitor Query Performance
# Identify poorly performing queries python cli.py analyze --poor-matches --threshold 0.6 -
Content Gap Analysis
# Find topics with no good matches python cli.py gaps --query-log /app/logs/queries.log -
Update Runbooks
- Add missing error messages
- Improve section headers
- Add cross-references
- Include more keyword variations
Version Control Integration
# .github/workflows/rag-update.yml
name: Update DevOps RAG
on:
push:
paths: ['runbooks/**']
jobs:
update-rag:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Trigger RAG Re-ingestion
run: |
curl -X POST $RAG_WEBHOOK_URL/ingest \
-H "Authorization: Bearer $RAG_API_KEY"📖 Content Templates
Troubleshooting Template
# [Technology] [Problem] - [Brief Description]
## Problem Description
What users experience and when to use this runbook.
## Symptoms
- Error messages (exact text)
- Observable behaviors
- System status indicators
## Prerequisites
- Required tools and access
- Skill level assumptions
- Environmental requirements
## Diagnosis
### Step 1: Initial Assessment
```bash
# Commands to gather information
command --statusExpected output:
Status: Error
Message: Connection refusedStep 2: Root Cause Analysis
Common causes include:
- Network connectivity issues
- Service configuration problems
- Resource constraints
Resolution
Option 1: Quick Fix
# Commands for immediate resolution
sudo systemctl restart serviceOption 2: Comprehensive Fix
# Commands for permanent resolution
sudo systemctl disable service
sudo systemctl enable service --nowVerification
# Verify the fix worked
command --health-checkExpected result: Status: Healthy
Prevention
- Monitoring recommendations
- Configuration best practices
- Maintenance procedures
Related Runbooks
Additional Resources
- Official documentation links
- Useful blog posts or guides
### Operations Template
```markdown
# [Technology] [Operation] - [Brief Description]
## Overview
When and why to perform this operation.
## Prerequisites
- Access requirements
- Backup recommendations
- Time estimates
## Preparation
### Step 1: Pre-operation Checks
```bash
# Validation commands
check-system-statusStep 2: Backup (if applicable)
# Backup commands
create-backup --target /safe/locationExecution
Step 1: [Operation Phase 1]
# Commands with explanations
operation-command --option valueStep 2: [Operation Phase 2]
# Next set of commands
next-operation --verifyVerification
# Confirm operation success
verify-operation --completeRollback (if needed)
# Commands to undo changes
rollback-operation --to-backupPost-Operation
- Cleanup tasks
- Documentation updates
- Notification requirements
Monitoring
What to watch for after completion.
---
## 🧪 Testing and Validation
### Content Quality Tests
```python
# test_runbooks.py
import pytest
from rag_engine import RAGEngine
class TestRunbookQuality:
def setup_class(self):
self.rag = RAGEngine()
def test_kubernetes_queries(self):
"""Test common Kubernetes troubleshooting queries"""
queries = [
"Pod keeps crashing CrashLoopBackOff",
"Kubernetes node not ready",
"Failed to pull image ImagePullBackOff",
"Service not accessible from outside cluster"
]
for query in queries:
result = self.rag.query(query, top_k=3)
assert result.top_score > 0.75, f"Low score for: {query}"
assert len(result.sources) > 0, f"No sources for: {query}"
def test_database_operations(self):
"""Test database operation queries"""
queries = [
"PostgreSQL backup with pg_dump",
"MySQL database restore procedure",
"Redis memory usage too high"
]
for query in queries:
result = self.rag.query(query)
assert "database" in result.sources[0].lower()
def test_error_message_matching(self):
"""Test exact error message retrieval"""
error_queries = [
"Error: CrashLoopBackOff",
"connection refused",
"certificate expired",
"permission denied"
]
for query in error_queries:
result = self.rag.query(query, top_k=1)
assert result.top_score > 0.8, f"Error matching failed: {query}"Run tests:
pytest test_runbooks.py -vPerformance Benchmarks
# Measure query performance across all runbooks
python cli.py benchmark --queries test_queries.txt --output benchmark.json
# Analyze retrieval quality
python cli.py analyze --metrics retrieval_quality.json📈 Advanced Optimization
Semantic Enhancement
Add semantic context to improve matching:
# Kubernetes Pod Troubleshooting Guide
<!-- Semantic keywords for better matching -->
Topics: container orchestration, pod lifecycle, application debugging
Technologies: Kubernetes, K8s, containers, Docker
Error codes: CrashLoopBackOff, ImagePullBackOff, Pending, Failed
Related terms: kubectl, deployment, replica set, service meshCross-Reference Network
Create a web of related content:
## Related Issues
- **Networking**: [Service Discovery Problems](kubernetes-service-debug.md)
- **Storage**: [PersistentVolume Issues](kubernetes-storage-debug.md)
- **Security**: [RBAC Permission Errors](kubernetes-rbac-debug.md)
- **Scaling**: [Resource Limits and Requests](kubernetes-resource-management.md)
## See Also
- [EKS-specific Pod Issues](aws-eks-pod-troubleshooting.md)
- [Docker Container Debugging](docker-container-debug.md)
- [Application Monitoring Setup](datadog-apm-setup.md)Metadata Enhancement
Add structured metadata for filtering:
---
title: "Kubernetes Pod CrashLoopBackOff Troubleshooting"
category: "kubernetes"
severity: "high"
technologies: ["kubernetes", "docker", "containers"]
platforms: ["aws", "gcp", "azure", "on-premise"]
audience: ["sre", "devops", "developer"]
last_updated: "2024-05-11"
reviewed_by: "sre-team"
---
# Content starts here...Next Steps:
- Configuration Guide - Tune chunking and retrieval
- Monitoring Guide - Track content performance
- Use Cases - Real-world implementation examples