DevOps RAGKnowledge Base

Knowledge Base Guide - DevOps RAG

How to write, organize, and optimize runbooks for maximum retrieval quality and accuracy. Best practices for content structure, chunking optimization, and continuous improvement.

📚 Runbook Writing Best Practices

Markdown Structure

DevOps RAG works best with well-structured Markdown documents. Follow these conventions:

# Runbook Title - Clear and Descriptive
 
## Problem Description
Brief overview of when to use this runbook.
 
## Prerequisites
- Required tools
- Access requirements  
- Knowledge assumptions
 
## Step-by-Step Solution
 
### 1. Diagnosis Commands
```bash
# Commands to identify the issue
kubectl get pods --all-namespaces

2. Root Cause Analysis

Explain common causes and how to identify them.

3. Resolution Steps

# Commands to fix the issue
kubectl scale deployment myapp --replicas=3

Verification

How to confirm the issue is resolved.

Prevention

Steps to prevent recurrence.

Links to similar problems or follow-up actions.


### Content Guidelines

**DO:**
- ✅ Use clear, imperative headings ("Fix CrashLoopBackOff")
- ✅ Include actual commands and expected outputs
- ✅ Explain the "why" behind each step
- ✅ Add error message examples users might see
- ✅ Include verification steps
- ✅ Cross-reference related runbooks

**DON'T:**
- ❌ Use vague titles ("Kubernetes Issues") 
- ❌ Skip command explanations
- ❌ Forget error examples
- ❌ Make assumptions about user knowledge
- ❌ Include outdated commands or URLs

---

## 🏗️ Chunking Optimization

### Optimal Chunk Boundaries

The system chunks documents at 512 tokens with 64-token overlap. Write content that aligns well with these boundaries:

**Good Chunking Example:**
```markdown
## Kubernetes Pod Troubleshooting

### CrashLoopBackOff Diagnosis
When a pod shows CrashLoopBackOff status, it indicates the container is 
crashing repeatedly after startup.

**Diagnosis Commands:**
```bash
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous

Common Causes:

  1. Application startup failures
  2. Misconfigured health checks
  3. Resource constraints
  4. Missing dependencies

Resolution Steps

[Next chunk starts here…]


**Why This Works:**
- Each section is ~400-600 tokens (good chunk size)
- Headers create natural boundaries
- Related information stays together
- Commands and context are grouped

### Avoid These Anti-Patterns

**❌ Wall of Text:**
```markdown
# Kubernetes Issues
This document covers all Kubernetes problems you might encounter including 
CrashLoopBackOff ImagePullBackOff NodeNotReady ServiceUnavailable 
PersistentVolumeClaimPending InsufficientResources OutOfMemory 
DiskPressure NetworkPolicyViolation RBAC issues certificate expiration...

Problem: No structure, poor chunking, topics mixed together

❌ Fragmented Commands:

Run this:
```bash
kubectl get pods

Then this:

kubectl describe pod <name>

Finally this:

kubectl logs <name>

Problem: Commands split across chunks, missing context


📁 Directory Organization

runbooks/
├── 01-kubernetes/
│   ├── pod-troubleshooting.md
│   ├── networking-debug.md
│   ├── rbac-security.md
│   └── scaling-performance.md
├── 02-databases/
│   ├── postgresql-operations.md
│   ├── mysql-backup-restore.md
│   └── redis-troubleshooting.md
├── 03-monitoring/
│   ├── datadog-alerts.md
│   ├── prometheus-queries.md
│   └── log-analysis.md
├── 04-cloud/
│   ├── aws-incident-response.md
│   ├── eks-operations.md
│   └── terraform-debugging.md
└── 05-infrastructure/
    ├── linux-maintenance.md
    ├── ssl-certificate-renewal.md
    └── backup-procedures.md

Naming Conventions

File Names:

  • Use descriptive, hyphenated names
  • Include the primary technology/service
  • Be specific: kubernetes-crashloop-debugging.md not k8s-issues.md

Headers:

  • Start with the problem/goal: “Fix CrashLoopBackOff”
  • Use consistent hierarchy (H1 for document, H2 for sections, H3 for steps)
  • Include technology in headers: “PostgreSQL Backup Procedures”

🔍 Content Optimization for RAG

Keywords and Search Terms

Include variations of terms users might search for:

# PostgreSQL Database Backup and Recovery
 
## Overview
This runbook covers PostgreSQL (postgres, pg) database backup procedures,
including pg_dump, pg_basebackup, and WAL archiving for disaster recovery.
 
Common search terms: postgres backup, postgresql restore, pg_dump, 
database recovery, WAL files, PITR (Point-in-Time Recovery)

Error Message Examples

Always include actual error messages users will see:

## Symptoms
Users may encounter these errors:
 

Error: CrashLoopBackOff Warning BackOff pod/myapp-7d4f8c4c8-x9k2m Back-off restarting failed container

ERRO[0001] container died unexpectedly: exit code 137


This helps the RAG system match user queries with exact error text.

Context and Prerequisites

Provide context so chunks are self-contained:

## Fix EKS Node NotReady Status
 
### Context
This applies to Amazon EKS worker nodes showing "NotReady" status in 
`kubectl get nodes`. Common in EKS clusters using managed node groups 
or self-managed nodes.
 
### Prerequisites
- kubectl configured for your EKS cluster
- AWS CLI configured with appropriate permissions
- Access to AWS Console (optional)

📊 Quality Metrics and Testing

Test Your Runbooks

Use the built-in testing framework to validate retrieval quality:

# Test query coverage
python cli.py test --query "How do I fix CrashLoopBackOff?"
 
# Test all common scenarios
python cli.py test --comprehensive
 
# Generate quality report
python cli.py test --report quality_report.json

Retrieval Quality Checklist

For each runbook, verify:

  • Title matches common search queries
  • Error messages included verbatim
  • Commands include expected outputs
  • Prerequisites clearly stated
  • Verification steps provided
  • Related issues cross-referenced
  • Technology variations mentioned (k8s/kubernetes, pg/postgresql)

Common Query Patterns

Structure content to match how users ask questions:

User Query: “Pod keeps crashing”
Runbook Title: “Fix Pod CrashLoopBackOff - Container Restart Issues”

User Query: “Database backup failed”
Runbook Title: “PostgreSQL Backup Troubleshooting - pg_dump Errors”

User Query: “SSL certificate expired”
Runbook Title: “Renew SSL/TLS Certificates - Nginx and Apache”


🔄 Auto-Ingestion and Updates

Git Integration

Set up automated ingestion from your documentation repository:

# Watch for changes and auto-ingest
python cli.py watch --directory /app/runbooks --auto-ingest
 
# Ingest from Git repository
python cli.py ingest --source git \
  --repo https://github.com/yourorg/runbooks.git \
  --branch main

Continuous Improvement

Feedback Loop

  1. Monitor Query Performance

    # Identify poorly performing queries
    python cli.py analyze --poor-matches --threshold 0.6
  2. Content Gap Analysis

    # Find topics with no good matches
    python cli.py gaps --query-log /app/logs/queries.log
  3. Update Runbooks

    • Add missing error messages
    • Improve section headers
    • Add cross-references
    • Include more keyword variations

Version Control Integration

# .github/workflows/rag-update.yml
name: Update DevOps RAG
on:
  push:
    paths: ['runbooks/**']
 
jobs:
  update-rag:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Trigger RAG Re-ingestion
        run: |
          curl -X POST $RAG_WEBHOOK_URL/ingest \
            -H "Authorization: Bearer $RAG_API_KEY"

📖 Content Templates

Troubleshooting Template

# [Technology] [Problem] - [Brief Description]
 
## Problem Description
What users experience and when to use this runbook.
 
## Symptoms
- Error messages (exact text)
- Observable behaviors
- System status indicators
 
## Prerequisites
- Required tools and access
- Skill level assumptions
- Environmental requirements
 
## Diagnosis
 
### Step 1: Initial Assessment
```bash
# Commands to gather information
command --status

Expected output:

Status: Error
Message: Connection refused

Step 2: Root Cause Analysis

Common causes include:

  1. Network connectivity issues
  2. Service configuration problems
  3. Resource constraints

Resolution

Option 1: Quick Fix

# Commands for immediate resolution
sudo systemctl restart service

Option 2: Comprehensive Fix

# Commands for permanent resolution
sudo systemctl disable service
sudo systemctl enable service --now

Verification

# Verify the fix worked
command --health-check

Expected result: Status: Healthy

Prevention

  • Monitoring recommendations
  • Configuration best practices
  • Maintenance procedures

Additional Resources

  • Official documentation links
  • Useful blog posts or guides

### Operations Template

```markdown
# [Technology] [Operation] - [Brief Description]

## Overview
When and why to perform this operation.

## Prerequisites
- Access requirements
- Backup recommendations
- Time estimates

## Preparation

### Step 1: Pre-operation Checks
```bash
# Validation commands
check-system-status

Step 2: Backup (if applicable)

# Backup commands
create-backup --target /safe/location

Execution

Step 1: [Operation Phase 1]

# Commands with explanations
operation-command --option value

Step 2: [Operation Phase 2]

# Next set of commands
next-operation --verify

Verification

# Confirm operation success
verify-operation --complete

Rollback (if needed)

# Commands to undo changes
rollback-operation --to-backup

Post-Operation

  • Cleanup tasks
  • Documentation updates
  • Notification requirements

Monitoring

What to watch for after completion.


---

## 🧪 Testing and Validation

### Content Quality Tests

```python
# test_runbooks.py
import pytest
from rag_engine import RAGEngine

class TestRunbookQuality:
    def setup_class(self):
        self.rag = RAGEngine()
    
    def test_kubernetes_queries(self):
        """Test common Kubernetes troubleshooting queries"""
        queries = [
            "Pod keeps crashing CrashLoopBackOff",
            "Kubernetes node not ready",
            "Failed to pull image ImagePullBackOff",
            "Service not accessible from outside cluster"
        ]
        
        for query in queries:
            result = self.rag.query(query, top_k=3)
            assert result.top_score > 0.75, f"Low score for: {query}"
            assert len(result.sources) > 0, f"No sources for: {query}"
    
    def test_database_operations(self):
        """Test database operation queries"""
        queries = [
            "PostgreSQL backup with pg_dump",
            "MySQL database restore procedure",
            "Redis memory usage too high"
        ]
        
        for query in queries:
            result = self.rag.query(query)
            assert "database" in result.sources[0].lower()
    
    def test_error_message_matching(self):
        """Test exact error message retrieval"""
        error_queries = [
            "Error: CrashLoopBackOff",
            "connection refused",
            "certificate expired",
            "permission denied"
        ]
        
        for query in error_queries:
            result = self.rag.query(query, top_k=1)
            assert result.top_score > 0.8, f"Error matching failed: {query}"

Run tests:

pytest test_runbooks.py -v

Performance Benchmarks

# Measure query performance across all runbooks
python cli.py benchmark --queries test_queries.txt --output benchmark.json
 
# Analyze retrieval quality
python cli.py analyze --metrics retrieval_quality.json

📈 Advanced Optimization

Semantic Enhancement

Add semantic context to improve matching:

# Kubernetes Pod Troubleshooting Guide
 
<!-- Semantic keywords for better matching -->
Topics: container orchestration, pod lifecycle, application debugging
Technologies: Kubernetes, K8s, containers, Docker
Error codes: CrashLoopBackOff, ImagePullBackOff, Pending, Failed
Related terms: kubectl, deployment, replica set, service mesh

Cross-Reference Network

Create a web of related content:

## Related Issues
- **Networking**: [Service Discovery Problems](kubernetes-service-debug.md)
- **Storage**: [PersistentVolume Issues](kubernetes-storage-debug.md) 
- **Security**: [RBAC Permission Errors](kubernetes-rbac-debug.md)
- **Scaling**: [Resource Limits and Requests](kubernetes-resource-management.md)
 
## See Also
- [EKS-specific Pod Issues](aws-eks-pod-troubleshooting.md)
- [Docker Container Debugging](docker-container-debug.md)
- [Application Monitoring Setup](datadog-apm-setup.md)

Metadata Enhancement

Add structured metadata for filtering:

---
title: "Kubernetes Pod CrashLoopBackOff Troubleshooting"
category: "kubernetes"
severity: "high"
technologies: ["kubernetes", "docker", "containers"]
platforms: ["aws", "gcp", "azure", "on-premise"]  
audience: ["sre", "devops", "developer"]
last_updated: "2024-05-11"
reviewed_by: "sre-team"
---
 
# Content starts here...

Next Steps: