Monitoring Guide - DevOps RAG
Comprehensive observability setup for DevOps RAG using Datadog. Track query performance, retrieval quality, system health, and user patterns to optimize your knowledge base.
📊 Monitoring Overview
DevOps RAG provides built-in Datadog integration for comprehensive observability across:
- Application Performance: Query latency, throughput, errors
- Retrieval Quality: Similarity scores, source relevance, answer accuracy
- System Health: Memory usage, index size, OpenAI API health
- User Patterns: Query frequency, topic trends, knowledge gaps
🏗️ Datadog Setup
Required Datadog Configuration
# Environment variables for Datadog integration
DD_API_KEY=your_datadog_api_key
DD_SITE=datadoghq.com
DD_SERVICE=devops-rag
DD_ENV=production
DD_VERSION=1.0.0
DD_TRACE_ENABLED=true
DD_TRACE_SAMPLE_RATE=0.1
DD_RUNTIME_METRICS_ENABLED=true
DD_PROFILING_ENABLED=true
DD_LOGS_ENABLED=trueDatadog Agent Configuration
# datadog-agent.yaml
logs_enabled: true
log_level: INFO
logs_config:
container_collect_all: true
processing_rules:
# Parse structured logs from DevOps RAG
- type: multi_line
name: rag_query_logs
pattern: "^\\d{4}-\\d{2}-\\d{2}"
# Extract query metrics
- type: mask_sequences
name: mask_api_keys
pattern: "sk-[a-zA-Z0-9]{48}"
replace_placeholder: "[REDACTED_API_KEY]"
apm_config:
enabled: true
env: production
# Custom metrics collection
dogstatsd_config:
enabled: true
port: 8125
non_local_traffic: trueApplication Metrics Configuration
DevOps RAG automatically emits custom metrics when Datadog is enabled:
# metrics.py - Built into DevOps RAG
from datadog import statsd
import time
import functools
def track_query_performance(func):
"""Decorator to track query performance metrics"""
@functools.wraps(func)
def wrapper(*args, **kwargs):
start_time = time.time()
try:
result = func(*args, **kwargs)
# Track successful query metrics
latency = (time.time() - start_time) * 1000
statsd.histogram('rag.query.latency', latency,
tags=[f'service:devops-rag', f'env:{DD_ENV}'])
statsd.histogram('rag.query.top_score', result.get('top_score', 0),
tags=[f'service:devops-rag'])
statsd.histogram('rag.query.chunks_retrieved',
result.get('context_chunks', 0),
tags=[f'service:devops-rag'])
statsd.increment('rag.query.success',
tags=[f'service:devops-rag'])
return result
except Exception as e:
# Track errors
statsd.increment('rag.query.error',
tags=[f'service:devops-rag',
f'error_type:{type(e).__name__}'])
raise
return wrapper📈 Key Metrics and Dashboards
Core Performance Metrics
| Metric | Description | Good Threshold | Alert Threshold |
|---|---|---|---|
rag.query.latency | End-to-end query time | <500ms p95 | >2000ms p95 |
rag.query.top_score | Highest similarity score | >0.75 avg | <0.60 avg |
rag.query.success_rate | Successful query ratio | >99% | <95% |
rag.index.size | Number of indexed chunks | Monitor growth | - |
rag.openai.token_usage | OpenAI API token consumption | Track costs | Budget limit |
Datadog Dashboard JSON
{
"title": "DevOps RAG - Performance Dashboard",
"description": "Monitor query performance, retrieval quality, and system health",
"widgets": [
{
"definition": {
"title": "Query Latency Distribution",
"type": "timeseries",
"requests": [
{
"q": "avg:rag.query.latency{service:devops-rag} by {host}",
"display_type": "line",
"style": {"palette": "dog_classic", "line_type": "solid"}
}
],
"yaxis": {"label": "Latency (ms)", "scale": "linear"},
"markers": [
{"value": "y = 500", "display_type": "error dashed"}
]
}
},
{
"definition": {
"title": "Retrieval Quality Scores",
"type": "timeseries",
"requests": [
{
"q": "avg:rag.query.top_score{service:devops-rag}",
"display_type": "line"
},
{
"q": "avg:rag.query.avg_score{service:devops-rag}",
"display_type": "line"
}
],
"yaxis": {"min": 0, "max": 1}
}
},
{
"definition": {
"title": "Query Volume",
"type": "timeseries",
"requests": [
{
"q": "sum:rag.query.success{service:devops-rag}.as_rate()",
"display_type": "bars"
}
]
}
},
{
"definition": {
"title": "Error Rate",
"type": "query_value",
"requests": [
{
"q": "100 * (sum:rag.query.error{service:devops-rag}.as_rate() / (sum:rag.query.success{service:devops-rag}.as_rate() + sum:rag.query.error{service:devops-rag}.as_rate()))",
"aggregator": "avg"
}
],
"conditional_formats": [
{"comparator": ">", "value": 1, "palette": "red_on_white"},
{"comparator": ">", "value": 0.1, "palette": "yellow_on_white"},
{"comparator": "<=", "value": 0.1, "palette": "green_on_white"}
]
}
}
]
}Import Dashboard
# Save dashboard JSON and import via Datadog API
curl -X POST "https://api.datadoghq.com/api/v1/dashboard" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: ${DD_API_KEY}" \
-H "DD-APPLICATION-KEY: ${DD_APP_KEY}" \
-d @devops-rag-dashboard.json🚨 Alerting Configuration
Critical Alerts
High Query Latency:
{
"name": "DevOps RAG - High Query Latency",
"type": "metric alert",
"query": "avg(last_10m):avg:rag.query.latency{service:devops-rag} > 2000",
"message": "Query latency is above 2 seconds. Check system load and OpenAI API status.\n\n@slack-devops-alerts",
"options": {
"thresholds": {
"critical": 2000,
"warning": 1000
},
"notify_no_data": true,
"no_data_timeframe": 20
}
}Poor Retrieval Quality:
{
"name": "DevOps RAG - Poor Retrieval Quality",
"type": "metric alert",
"query": "avg(last_30m):avg:rag.query.top_score{service:devops-rag} < 0.6",
"message": "Average retrieval quality is poor. Check index integrity and consider re-indexing.\n\n@slack-devops-alerts",
"options": {
"thresholds": {
"critical": 0.5,
"warning": 0.6
}
}
}High Error Rate:
{
"name": "DevOps RAG - High Error Rate",
"type": "metric alert",
"query": "avg(last_5m):100 * (sum:rag.query.error{service:devops-rag}.as_rate() / (sum:rag.query.success{service:devops-rag}.as_rate() + sum:rag.query.error{service:devops-rag}.as_rate())) > 5",
"message": "Error rate is above 5%. Check logs and OpenAI API status.\n\n@pagerduty-devops",
"options": {
"thresholds": {
"critical": 5,
"warning": 1
}
}
}OpenAI API Issues:
{
"name": "DevOps RAG - OpenAI API Errors",
"type": "log alert",
"query": "logs(\"service:devops-rag status:error @error.kind:openai_api_error\").index(\"main\").rollup(\"count\").last(\"5m\") > 10",
"message": "Multiple OpenAI API errors detected. Check API key and rate limits.\n\n@slack-devops-alerts"
}Deployment via Terraform
# datadog-alerts.tf
resource "datadog_monitor" "rag_query_latency" {
name = "DevOps RAG - High Query Latency"
type = "metric alert"
message = "Query latency is above 2 seconds @slack-devops-alerts"
query = "avg(last_10m):avg:rag.query.latency{service:devops-rag} > 2000"
monitor_thresholds {
warning = 1000
critical = 2000
}
notify_no_data = true
no_data_timeframe = 20
tags = ["service:devops-rag", "team:platform"]
}
resource "datadog_monitor" "rag_retrieval_quality" {
name = "DevOps RAG - Poor Retrieval Quality"
type = "metric alert"
message = "Average retrieval quality is poor @slack-devops-alerts"
query = "avg(last_30m):avg:rag.query.top_score{service:devops-rag} < 0.6"
monitor_thresholds {
warning = 0.6
critical = 0.5
}
tags = ["service:devops-rag", "team:platform"]
}📝 Log Analysis
Structured Logging Format
DevOps RAG emits structured JSON logs for easy analysis:
{
"timestamp": "2024-05-11T14:30:25.123Z",
"level": "INFO",
"service": "devops-rag",
"logger": "rag_engine",
"message": "Query processed successfully",
"trace_id": "abc123def456",
"span_id": "789xyz012",
"query": {
"text": "kubernetes pod troubleshooting",
"top_k": 5,
"user_id": "engineer@company.com"
},
"result": {
"top_score": 0.8456,
"avg_score": 0.7123,
"chunks_retrieved": 5,
"sources": ["kubernetes-troubleshooting.md"],
"latency_ms": 1247.8
},
"usage": {
"embedding_tokens": 12,
"generation_tokens": 456,
"cost_cents": 0.078
}
}Log-based Metrics
Query Pattern Analysis:
-- Most common query topics
SELECT
REGEXP_EXTRACT(query.text, r'(kubernetes|database|monitoring|aws)') as topic,
COUNT(*) as frequency
FROM logs
WHERE service = 'devops-rag'
AND @timestamp >= NOW() - INTERVAL 7 DAY
GROUP BY topic
ORDER BY frequency DESCPoor Quality Queries:
-- Identify queries with poor retrieval quality
SELECT
query.text,
result.top_score,
result.sources,
COUNT(*) as frequency
FROM logs
WHERE service = 'devops-rag'
AND result.top_score < 0.6
AND @timestamp >= NOW() - INTERVAL 1 DAY
GROUP BY query.text, result.top_score, result.sources
ORDER BY frequency DESCDatadog Log Monitors
Knowledge Gap Detection:
{
"name": "DevOps RAG - Knowledge Gap Detected",
"type": "log alert",
"query": "logs(\"service:devops-rag @result.top_score:<0.5\").index(\"main\").rollup(\"count\").last(\"1h\") > 5",
"message": "Multiple queries with poor retrieval quality detected. Potential knowledge gap:\n\n{{#logs}}\n- {{query.text}} (score: {{result.top_score}})\n{{/logs}}\n\n@slack-devops-content",
"options": {
"group_by": ["query.text"],
"new_host_delay": 300
}
}🔍 Quality Monitoring
Retrieval Quality Tracking
Similarity Score Distribution:
# quality_monitor.py
import numpy as np
from datadog import statsd
class QualityMonitor:
def __init__(self):
self.score_buckets = [0.3, 0.5, 0.7, 0.8, 0.9, 1.0]
def track_retrieval_quality(self, query_result):
"""Track detailed quality metrics"""
top_score = query_result.get('top_score', 0)
avg_score = query_result.get('avg_score', 0)
# Track score distribution
bucket = self._get_score_bucket(top_score)
statsd.increment(f'rag.quality.score_bucket.{bucket}',
tags=[f'service:devops-rag'])
# Track source diversity
unique_sources = len(set(query_result.get('sources', [])))
statsd.histogram('rag.quality.source_diversity', unique_sources)
# Track query complexity
query_length = len(query_result.get('query', '').split())
if query_length <= 3:
complexity = 'simple'
elif query_length <= 8:
complexity = 'medium'
else:
complexity = 'complex'
statsd.histogram(f'rag.quality.by_complexity.{complexity}', top_score)
def _get_score_bucket(self, score):
for i, threshold in enumerate(self.score_buckets):
if score <= threshold:
return f'bucket_{i}_{threshold}'
return 'bucket_max'User Satisfaction Tracking
Implicit Feedback Collection:
# feedback_tracker.py
class FeedbackTracker:
def track_query_satisfaction(self, query_id, user_actions):
"""Track user satisfaction based on behavior"""
satisfaction_score = 0
# Positive signals
if 'clicked_source' in user_actions:
satisfaction_score += 0.3
if 'copied_answer' in user_actions:
satisfaction_score += 0.4
if 'shared_answer' in user_actions:
satisfaction_score += 0.5
# Negative signals
if 'refined_query' in user_actions:
satisfaction_score -= 0.2
if 'searched_elsewhere' in user_actions:
satisfaction_score -= 0.4
# Normalize to 0-1
satisfaction_score = max(0, min(1, satisfaction_score + 0.5))
statsd.histogram('rag.satisfaction.implicit', satisfaction_score,
tags=[f'query_id:{query_id}'])
return satisfaction_score📊 Business Intelligence Dashboards
Usage Analytics Dashboard
{
"title": "DevOps RAG - Usage Analytics",
"widgets": [
{
"definition": {
"title": "Daily Active Users",
"type": "timeseries",
"requests": [
{
"q": "sum:rag.query.success{service:devops-rag}.rollup(sum, 86400) by {user_id}.count_nonzero()",
"display_type": "line"
}
]
}
},
{
"definition": {
"title": "Top Query Topics",
"type": "toplist",
"requests": [
{
"q": "top(sum:rag.query.success{service:devops-rag} by {topic}, 10, 'sum', 'desc')",
"style": {"palette": "dog_classic"}
}
]
}
},
{
"definition": {
"title": "Knowledge Base Coverage",
"type": "query_value",
"requests": [
{
"q": "100 * (sum:rag.query.success{service:devops-rag,top_score:>0.7}.as_count() / sum:rag.query.success{service:devops-rag}.as_count())",
"aggregator": "avg"
}
],
"title": "% High-Quality Matches"
}
},
{
"definition": {
"title": "Cost Tracking",
"type": "timeseries",
"requests": [
{
"q": "sum:rag.openai.cost{service:devops-rag}",
"display_type": "line"
}
]
}
}
]
}Team Productivity Dashboard
{
"title": "DevOps RAG - Team Impact",
"widgets": [
{
"definition": {
"title": "Self-Service Rate",
"type": "query_value",
"requests": [
{
"q": "100 * (sum:rag.query.success{service:devops-rag,top_score:>0.8}.as_count() / sum:support.ticket.created{category:operational}.as_count())"
}
],
"title": "Questions Answered Without Tickets"
}
},
{
"definition": {
"title": "Knowledge Sharing by Team",
"type": "timeseries",
"requests": [
{
"q": "sum:rag.query.success{service:devops-rag} by {team}",
"display_type": "line"
}
]
}
},
{
"definition": {
"title": "Onboarding Effectiveness",
"type": "timeseries",
"requests": [
{
"q": "avg:rag.query.success{service:devops-rag,user_tenure:<90days}.rollup(avg, 604800)",
"display_type": "line"
}
],
"title": "New Hire Query Success Rate"
}
}
]
}🔧 Performance Optimization Monitoring
Resource Usage Tracking
# performance_monitor.py
import psutil
from datadog import statsd
class PerformanceMonitor:
def __init__(self):
self.process = psutil.Process()
def track_system_resources(self):
"""Monitor system resource usage"""
# Memory usage
memory_info = self.process.memory_info()
statsd.gauge('rag.system.memory.rss', memory_info.rss,
tags=['service:devops-rag'])
# CPU usage
cpu_percent = self.process.cpu_percent()
statsd.gauge('rag.system.cpu.percent', cpu_percent,
tags=['service:devops-rag'])
# Index size monitoring
index_size = self._get_index_size()
statsd.gauge('rag.index.size_bytes', index_size,
tags=['service:devops-rag'])
def track_openai_usage(self, tokens_used, cost):
"""Track OpenAI API usage and costs"""
statsd.increment('rag.openai.tokens', tokens_used,
tags=['service:devops-rag'])
statsd.increment('rag.openai.cost', cost,
tags=['service:devops-rag'])Index Health Monitoring
def monitor_index_health():
"""Monitor vector index integrity and performance"""
try:
with open('/app/data/index.json', 'r') as f:
index_data = json.load(f)
# Track index metrics
chunk_count = len(index_data.get('chunks', []))
source_count = len(index_data.get('sources', []))
statsd.gauge('rag.index.chunks', chunk_count)
statsd.gauge('rag.index.sources', source_count)
# Check for corruption
for chunk in index_data['chunks'][:10]: # Sample check
if not chunk.get('embedding') or len(chunk['embedding']) != 1536:
statsd.increment('rag.index.corruption_detected')
logger.warning(f"Corrupted chunk detected: {chunk.get('id')}")
statsd.increment('rag.index.health_check.success')
except Exception as e:
statsd.increment('rag.index.health_check.error')
logger.error(f"Index health check failed: {e}")📱 Notification Channels
Slack Integration
# slack_notifier.py
import requests
def send_quality_alert(alert_data):
"""Send quality alerts to Slack"""
webhook_url = os.getenv('SLACK_WEBHOOK_URL')
message = {
"text": "🤖 DevOps RAG Quality Alert",
"attachments": [
{
"color": "warning" if alert_data['severity'] == 'warning' else "danger",
"fields": [
{
"title": "Alert Type",
"value": alert_data['type'],
"short": True
},
{
"title": "Current Value",
"value": alert_data['current_value'],
"short": True
},
{
"title": "Threshold",
"value": alert_data['threshold'],
"short": True
},
{
"title": "Suggested Action",
"value": alert_data['suggested_action'],
"short": False
}
]
}
]
}
requests.post(webhook_url, json=message)Email Reports
# email_reporter.py
def generate_weekly_report():
"""Generate weekly performance report"""
report_data = {
'period': 'Last 7 days',
'total_queries': get_metric('rag.query.success', period='7d'),
'avg_latency': get_metric('rag.query.latency', aggregation='avg', period='7d'),
'avg_quality': get_metric('rag.query.top_score', aggregation='avg', period='7d'),
'top_topics': get_top_metrics('rag.query.success', group_by='topic', period='7d'),
'quality_trends': get_quality_trends(period='7d')
}
html_report = render_template('weekly_report.html', **report_data)
send_email(
to=['devops-team@company.com'],
subject=f"DevOps RAG Weekly Report - {report_data['period']}",
html_body=html_report
)🚀 Continuous Improvement
A/B Testing Framework
# ab_testing.py
class RAGExperimentTracker:
def __init__(self):
self.experiments = {}
def track_experiment(self, experiment_name, variant, query_result):
"""Track A/B test results"""
statsd.histogram(
f'rag.experiment.{experiment_name}.latency',
query_result['latency_ms'],
tags=[f'variant:{variant}']
)
statsd.histogram(
f'rag.experiment.{experiment_name}.quality',
query_result['top_score'],
tags=[f'variant:{variant}']
)
def compare_variants(self, experiment_name):
"""Compare experiment variants"""
# Query Datadog for experiment results
results = self.query_datadog_metrics(experiment_name)
return {
'experiment': experiment_name,
'variants': results,
'statistical_significance': self.calculate_significance(results),
'recommendation': self.generate_recommendation(results)
}Automated Quality Improvement
# quality_improver.py
class QualityImprover:
def identify_improvement_opportunities(self):
"""Identify areas for quality improvement"""
# Find frequent low-quality queries
poor_queries = self.get_poor_quality_queries(threshold=0.6, min_frequency=5)
improvements = []
for query in poor_queries:
improvement = {
'query': query['text'],
'current_score': query['avg_score'],
'frequency': query['count'],
'suggested_actions': [
'Add missing runbook content',
'Improve existing runbook structure',
'Add keyword variations',
'Include error message examples'
]
}
improvements.append(improvement)
return improvements
def generate_content_suggestions(self, query_pattern):
"""Generate suggestions for missing content"""
return {
'suggested_title': f"Runbook for {query_pattern['topic']}",
'key_sections': query_pattern['common_terms'],
'priority': self.calculate_priority(query_pattern),
'estimated_impact': query_pattern['frequency'] * 0.3 # 30% quality improvement
}Implementation Summary:
- Setup Datadog integration with required environment variables
- Import pre-built dashboards for instant visibility
- Configure critical alerts for proactive monitoring
- Enable structured logging for detailed analysis
- Track quality metrics to drive continuous improvement
- Set up notification channels for team collaboration
Next Steps:
- Tuning Guide - Optimize based on monitoring insights
- Configuration Guide - Adjust settings for performance
- Use Cases - Apply monitoring to specific scenarios