Use Cases - DevOps RAG
Real-world applications and implementation patterns for DevOps RAG across different operational scenarios. Learn how teams use intelligent runbook retrieval to improve incident response, onboarding, and operational efficiency.
🚨 Use Case 1: On-Call Assistant
Transform your on-call experience with instant access to relevant troubleshooting procedures during incidents.
The Challenge
- 3 AM pages: Engineers need answers fast during incident response
- Context switching: Searching through wikis while troubleshooting live issues
- Knowledge silos: Critical procedures locked in individual team members’ heads
- Time pressure: Every minute counts during outages
The Solution
Slack Integration for Instant Answers:
# slack_rag_bot.py
import os
from slack_bolt import App
from devops_rag_client import DevOpsRAG
app = App(token=os.environ["SLACK_BOT_TOKEN"])
rag = DevOpsRAG("https://devops-rag.company.com")
@app.message("help")
def handle_help_requests(message, say):
"""Respond to help requests in incident channels"""
query = message["text"].replace("help", "").strip()
if query:
result = rag.ask(query, top_k=3)
response = f"🔍 **Found help for:** {query}\n\n"
response += f"**Answer:**\n{result['answer']}\n\n"
response += f"**Sources:** {', '.join(result['sources'])}\n"
response += f"**Confidence:** {result['top_score']:.1%}"
say(response)PagerDuty Integration:
# pagerduty-runbook-webhook.yml
apiVersion: v1
kind: ConfigMap
metadata:
name: pagerduty-webhook
data:
webhook.py: |
import requests
import json
def handle_incident(incident_data):
# Extract service and alert details
service = incident_data.get('service', {}).get('name', '')
alert_msg = incident_data.get('title', '')
# Query RAG for relevant runbooks
query = f"{service} {alert_msg} troubleshooting"
response = requests.post(
"https://devops-rag.company.com/ask",
json={"question": query, "top_k": 2}
)
if response.status_code == 200:
result = response.json()
# Add runbook links to incident notes
runbook_note = f"""
🤖 **Suggested Runbooks:**
{result['answer']}
**Sources:** {', '.join(result['sources'])}
"""
# Update PagerDuty incident
add_incident_note(incident_data['id'], runbook_note)Implementation Example
Team: SRE team at a fintech company
Problem: Average incident resolution time of 45 minutes
Solution: DevOps RAG integrated with Slack and PagerDuty
# During a Kubernetes incident
Engineer: "help pod crashloopbackoff payment-service"
RAG Bot: 🔍 Found help for: pod crashloopbackoff payment-service
**Answer:**
1. Check pod logs: `kubectl logs payment-service-xxx --previous`
2. Describe pod: `kubectl describe pod payment-service-xxx`
3. Common causes:
- Database connection failures
- Missing config maps
- Resource limits exceeded
**Next Steps:**
- Verify database connectivity
- Check payment-service ConfigMap
- Review resource requests/limits
**Sources:** kubernetes-troubleshooting.md, payment-service-runbook.md
**Confidence:** 94%Results:
- ⚡ Resolution time: 45min → 18min (60% improvement)
- 📚 Runbook usage: Up 300%
- 🎯 First-time fix rate: 65% → 87%
- 😴 Fewer escalations: 40% reduction in senior engineer callouts
🔄 Use Case 2: Incident Response Auto-Retrieval
Automatically surface relevant runbooks based on alert context and historical incident patterns.
The Challenge
- Alert fatigue: Too many alerts, not enough context
- Manual correlation: Engineers manually linking alerts to procedures
- Inconsistent response: Different approaches to similar incidents
- Knowledge gaps: New team members unsure how to respond
The Solution
Datadog Alert Integration:
# datadog_webhook_handler.py
from flask import Flask, request, jsonify
import requests
app = Flask(__name__)
@app.route('/webhook/datadog', methods=['POST'])
def handle_datadog_alert():
alert_data = request.json
# Extract alert context
alert_title = alert_data.get('title', '')
service = alert_data.get('tags', {}).get('service', '')
alert_type = alert_data.get('alert_type', '')
# Build context-aware query
if alert_type == 'metric alert':
if 'high_cpu' in alert_title:
query = f"{service} high CPU usage troubleshooting"
elif 'memory' in alert_title:
query = f"{service} memory leak investigation"
elif 'disk' in alert_title:
query = f"{service} disk space cleanup"
elif alert_type == 'log alert':
query = f"{service} {alert_title} error investigation"
# Get relevant runbooks
rag_response = requests.post(
"https://devops-rag.company.com/ask",
json={"question": query, "top_k": 3, "verbose": True}
)
if rag_response.status_code == 200:
result = rag_response.json()
# Enrich alert with runbook context
enriched_alert = {
"alert_id": alert_data['id'],
"suggested_runbooks": result['sources'],
"initial_steps": result['answer'][:500],
"confidence": result['top_score']
}
# Send to incident management system
create_incident_with_context(enriched_alert)
return jsonify({"status": "processed"})Prometheus AlertManager Integration:
# alertmanager.yml
groups:
- name: devops-rag-enrichment
rules:
- alert: HighMemoryUsage
expr: memory_usage_percent > 85
for: 5m
labels:
severity: warning
service: "{{ $labels.service }}"
annotations:
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
runbook_query: "{{ $labels.service }} high memory usage troubleshooting"
- alert: DatabaseConnectionErrors
expr: increase(db_connection_errors[5m]) > 10
labels:
severity: critical
service: "{{ $labels.service }}"
annotations:
description: "Database connection errors detected"
runbook_query: "{{ $labels.service }} database connection troubleshooting"
# Webhook configuration
receivers:
- name: devops-rag-webhook
webhook_configs:
- url: 'https://rag-webhook.company.com/prometheus'
send_resolved: trueImplementation Example
Team: Platform engineering team managing 50+ microservices
Problem: Inconsistent incident response, knowledge scattered across teams
Solution: Automated runbook retrieval integrated with monitoring stack
Alert Flow:
1. Prometheus Alert: "payment-api high error rate"
↓
2. AlertManager webhook to DevOps RAG
↓
3. Query: "payment-api error rate troubleshooting"
↓
4. RAG Response: Relevant runbooks + initial steps
↓
5. Incident created in PagerDuty with context
↓
6. On-call engineer gets alert + runbooksExample Enriched Alert:
{
"alert": {
"title": "Payment API High Error Rate",
"service": "payment-api",
"severity": "critical"
},
"rag_context": {
"suggested_actions": [
"Check payment gateway connectivity",
"Review recent deployments",
"Verify database connections",
"Check rate limiting configs"
],
"runbooks": [
"payment-api-troubleshooting.md",
"payment-gateway-connectivity.md",
"database-connection-debug.md"
],
"confidence": 0.91
}
}Results:
- 🎯 Faster triage: 23min → 8min average time to start investigation
- 📈 Response consistency: 85% of incidents follow standard procedures
- 🧠 Knowledge retention: New engineers 60% more effective in first month
- 📚 Runbook coverage: Identified 12 missing procedures from query patterns
👥 Use Case 3: New Engineer Onboarding
Accelerate new team member productivity with intelligent access to operational knowledge.
The Challenge
- Information overload: 200+ wiki pages, unclear what’s current
- Context missing: Procedures without background or reasoning
- Learning curve: Takes 6+ months to become operationally effective
- Mentorship bottleneck: Senior engineers spend 40% time answering questions
The Solution
Interactive Learning Assistant:
# onboarding_assistant.py
class OnboardingAssistant:
def __init__(self):
self.rag = DevOpsRAG()
self.user_progress = {}
def guided_learning(self, user_id, topic):
"""Provide structured learning path"""
# Progressive complexity queries
queries = [
f"{topic} basics and overview",
f"{topic} common operations",
f"{topic} troubleshooting guide",
f"{topic} advanced configuration"
]
learning_path = []
for query in queries:
result = self.rag.ask(query)
learning_path.append({
"level": query.split()[-1],
"content": result['answer'],
"sources": result['sources'],
"next_steps": self.extract_next_steps(result['answer'])
})
return learning_path
def answer_with_context(self, question, user_level="beginner"):
"""Tailor responses based on user experience"""
# Enhance query with experience level
enhanced_query = f"{question} {user_level} explanation"
result = self.rag.ask(enhanced_query)
if user_level == "beginner":
# Add more context and explanations
enhanced_answer = self.add_beginner_context(result['answer'])
else:
enhanced_answer = result['answer']
return {
"answer": enhanced_answer,
"sources": result['sources'],
"related_topics": self.find_related_topics(question),
"hands_on_exercises": self.suggest_exercises(question)
}Slack Learning Bot:
@app.message(re.compile(r"^learn (.*)"))
def handle_learning_request(message, say, context):
topic = context['matches'][0]
user_id = message['user']
# Get user's experience level
user_profile = get_user_profile(user_id)
experience_level = user_profile.get('level', 'beginner')
# Generate learning content
assistant = OnboardingAssistant()
learning_content = assistant.guided_learning(user_id, topic)
response = f"📚 **Learning Path: {topic.title()}**\n\n"
for i, level in enumerate(learning_content, 1):
response += f"**Level {i}: {level['level'].title()}**\n"
response += f"{level['content'][:300]}...\n"
response += f"📖 *Sources: {', '.join(level['sources'])}*\n\n"
say(response)
# Track progress
track_learning_progress(user_id, topic, learning_content)Implementation Example
Team: DevOps team at a scale-up (50 → 200 engineers)
Problem: New hires taking 4-6 months to become productive
Solution: RAG-powered learning assistant with progressive knowledge delivery
Learning Journey Example:
Week 1: Foundations
New Engineer: "learn kubernetes basics"
RAG Assistant: 📚 Learning Path: Kubernetes Basics
Level 1: Overview
Kubernetes is a container orchestration platform that automates deployment,
scaling, and management of containerized applications. Key concepts include:
- Pods: Smallest deployable units
- Services: Network access to pod groups
- Deployments: Declarative pod management
📖 Sources: kubernetes-overview.md, getting-started.md
Level 2: Common Operations
Basic kubectl commands you'll use daily:
- `kubectl get pods` - List running pods
- `kubectl describe pod <name>` - Get detailed pod info
- `kubectl logs <pod>` - View application logs
🎯 Try This: Run `kubectl get nodes` in our dev clusterWeek 3: Problem Solving
New Engineer: "pod showing CrashLoopBackOff what does that mean?"
RAG Assistant: 🔍 CrashLoopBackOff Analysis
**What it means:** Your container is crashing repeatedly after startup.
Kubernetes tries to restart it, but it keeps failing.
**Immediate steps:**
1. Check logs: `kubectl logs <pod> --previous`
2. Check events: `kubectl describe pod <pod>`
3. Look for common patterns:
- Exit code 137: Out of memory
- Exit code 1: Application error
- Connection refused: Service dependencies
**For beginners:** Start with logs - they usually tell you exactly
what's wrong. Don't worry about complex debugging yet.
📖 Sources: kubernetes-troubleshooting.md, pod-lifecycle.md
🎯 Next: Try fixing a broken pod in our sandbox environmentResults:
- ⚡ Time to productivity: 6 months → 6 weeks
- 💡 Self-service learning: 80% of questions answered without escalation
- 📊 Knowledge retention: 90% pass operational readiness tests
- 👥 Mentor efficiency: Senior engineers save 15 hours/week
🔍 Use Case 4: Post-Mortem Analysis
Extract insights from incident history to improve procedures and prevent recurrence.
The Challenge
- Pattern recognition: Hard to spot trends across hundreds of incidents
- Knowledge gaps: Post-mortems identify missing runbooks but creation is slow
- Procedure drift: Documented procedures diverge from actual practice
- Learning loops: Lessons learned don’t feed back into operational knowledge
The Solution
Incident Analysis Pipeline:
# postmortem_analyzer.py
class PostMortemAnalyzer:
def __init__(self):
self.rag = DevOpsRAG()
self.incident_db = IncidentDatabase()
def analyze_incident_patterns(self, days=90):
"""Identify common incident patterns and knowledge gaps"""
incidents = self.incident_db.get_incidents(days=days)
# Group incidents by service and error type
patterns = self.cluster_incidents(incidents)
analysis = {}
for pattern in patterns:
# Query RAG for existing coverage
query = f"{pattern['service']} {pattern['error_type']} troubleshooting"
rag_result = self.rag.ask(query)
pattern['rag_coverage'] = {
'score': rag_result['top_score'],
'sources': rag_result['sources'],
'has_good_coverage': rag_result['top_score'] > 0.8
}
# Identify gaps
if not pattern['rag_coverage']['has_good_coverage']:
pattern['needs_runbook'] = True
pattern['suggested_content'] = self.extract_resolution_patterns(
pattern['incidents']
)
analysis[pattern['id']] = pattern
return analysis
def suggest_runbook_improvements(self, incident_id):
"""Suggest improvements to existing runbooks based on incident"""
incident = self.incident_db.get_incident(incident_id)
resolution_steps = self.extract_resolution_steps(incident)
# Find related runbooks
query = f"{incident['service']} {incident['problem']} troubleshooting"
rag_result = self.rag.ask(query, top_k=3)
improvements = []
for source in rag_result['sources']:
current_content = self.load_runbook(source)
suggested_additions = self.diff_procedures(
current_content,
resolution_steps
)
if suggested_additions:
improvements.append({
'runbook': source,
'additions': suggested_additions,
'incident_reference': incident_id
})
return improvementsAutomated Runbook Generation:
def generate_runbook_from_incidents(incident_pattern):
"""Generate runbook draft from incident pattern analysis"""
template = """
# {service} {problem} Troubleshooting
## Problem Description
Based on {incident_count} recent incidents, this issue occurs when:
{problem_description}
## Common Symptoms
{symptoms}
## Root Causes Analysis
{root_causes}
## Resolution Steps
{resolution_steps}
## Prevention
{prevention_measures}
## Related Incidents
{incident_references}
"""
# Extract common elements from incident pattern
runbook_content = template.format(
service=incident_pattern['service'],
problem=incident_pattern['error_type'],
incident_count=len(incident_pattern['incidents']),
problem_description=extract_problem_description(incident_pattern),
symptoms=extract_common_symptoms(incident_pattern),
root_causes=extract_root_causes(incident_pattern),
resolution_steps=extract_resolution_steps(incident_pattern),
prevention_measures=extract_prevention_measures(incident_pattern),
incident_references=format_incident_references(incident_pattern)
)
return runbook_contentImplementation Example
Team: Site Reliability Engineering team at a cloud provider
Problem: 300+ incidents per quarter, limited pattern recognition
Solution: Automated post-mortem analysis with RAG-driven gap identification
Analysis Flow:
1. Weekly automated analysis of closed incidents
↓
2. Pattern clustering by service + error type
↓
3. RAG query for existing runbook coverage
↓
4. Gap identification and priority scoring
↓
5. Auto-generated runbook drafts for review
↓
6. Engineer review and publishingExample Analysis Output:
{
"pattern_id": "payment-api_database-timeout_2024q1",
"service": "payment-api",
"error_type": "database timeout",
"incidents": 12,
"frequency": "2.3 incidents/week",
"rag_coverage": {
"score": 0.42,
"sources": ["database-troubleshooting.md"],
"has_good_coverage": false
},
"needs_runbook": true,
"priority": "high",
"suggested_content": {
"title": "Payment API Database Timeout Troubleshooting",
"common_symptoms": [
"HTTP 500 errors in payment endpoints",
"Database connection pool exhaustion",
"Query execution time >30s"
],
"resolution_patterns": [
"Restart payment-api pods (90% effective)",
"Scale database read replicas (75% effective)",
"Clear connection pool (60% effective)"
]
}
}Generated Runbook Draft:
# Payment API Database Timeout Troubleshooting
## Problem Description
Based on 12 incidents in Q1 2024, database timeouts in the payment API
typically occur during high-traffic periods (Black Friday, end-of-month)
when connection pool exhaustion leads to query timeouts.
## Common Symptoms
- HTTP 500 errors on `/api/payments/*` endpoints
- Database logs show "connection timeout after 30s"
- Connection pool metrics show 100% utilization
- Payment processing latency >10s (normal: <200ms)
## Immediate Response
1. **Scale payment-api pods** (90% effective resolution)
```bash
kubectl scale deployment payment-api --replicas=8- Check database connection pool
kubectl exec payment-api-xxx -- curl localhost:8080/health/db
[Rest of auto-generated content…]
**Results:**
- 📈 **Knowledge coverage**: 60% → 85% of incident types have runbooks
- ⚡ **Runbook creation**: 6 weeks → 2 days average time
- 🔄 **Prevention**: 30% reduction in repeat incidents
- 📊 **Pattern recognition**: Identified 8 systemic issues in first quarter
---
## 🎯 Implementation Patterns
### Common Integration Points
1. **ChatOps (Slack/Discord)**
- Instant answers during incidents
- Progressive learning for new team members
- Knowledge sharing in team channels
2. **Monitoring & Alerting**
- Auto-attach runbooks to alerts
- Context-aware incident creation
- Trend analysis and gap identification
3. **Incident Management**
- PagerDuty runbook suggestions
- ServiceNow knowledge integration
- ITSM workflow automation
4. **CI/CD Pipelines**
- Deployment failure runbooks
- Rollback procedure automation
- Change management integration
### Success Metrics
**Operational Efficiency:**
- Mean Time to Resolution (MTTR)
- First-time fix rate
- Escalation frequency
- Knowledge base coverage
**Learning & Development:**
- Time to operational productivity
- Self-service question resolution
- Cross-team knowledge sharing
- Runbook usage patterns
**Quality Metrics:**
- Runbook accuracy scores
- User satisfaction ratings
- Content freshness tracking
- Gap identification effectiveness
---
**Next Steps:**
- [Deployment Guide](deployment.md) - Production implementation
- [Monitoring Guide](monitoring.md) - Track usage and effectiveness
- [Configuration Guide](configuration.md) - Optimize for your use case