Monitoring and Observability
MĀRGA provides comprehensive monitoring capabilities with first-class Datadog integration for enterprise observability.
Overview
MĀRGA exports detailed telemetry covering:
- Request routing decisions and latency
- Provider performance and errors
- Cost tracking and optimization
- Security and compliance metrics
- A/B testing results
Datadog Integration
Quick Setup
- Set Environment Variables
export DD_API_KEY="your_datadog_api_key"
export DD_APP_KEY="your_datadog_app_key" # For custom metrics
export DD_SITE="datadoghq.com" # or datadoghq.eu
export DD_SERVICE="marga"
export DD_ENV="production"
export DD_VERSION="1.0.0"- Enable in Configuration
monitoring:
datadog:
enable: true
trace_sample_rate: 1.0
profile_sample_rate: 0.1
custom_metrics: true
metrics:
enable: true
namespace: "marga"
tags:
- "service:llm-router"
- "team:platform"- Start with APM
# Using Docker
docker run -d \
--name marga \
-e DD_API_KEY=$DD_API_KEY \
-e DD_APM_ENABLED=true \
-e DD_LOGS_ENABLED=true \
-p 8080:8080 \
ghcr.io/gaurav21/marga:latest
# Using binary
DD_TRACE_ENABLED=1 ./marga-serverAPM Configuration
MĀRGA automatically instruments HTTP requests, database queries, and external API calls.
apm:
enable: true
service_name: "marga"
sample_rate: 1.0
# Custom span tags
span_tags:
- "provider"
- "model"
- "strategy"
- "cost_tier"
# Sensitive data filtering
obfuscation:
query_string: true
headers: ["authorization", "x-api-key"]
json_keys: ["api_key", "secret"]Log Configuration
logging:
level: "info"
format: "json"
datadog:
enable: true
source: "marga"
service: "llm-router"
# Structured logging
fields:
- "request_id"
- "provider"
- "model"
- "latency_ms"
- "tokens_used"
- "cost_usd"Metrics Reference
Core Metrics
Request Metrics
marga.requests.total(counter) - Total requests processedmarga.requests.duration(histogram) - Request processing timemarga.requests.errors(counter) - Failed requests by error type
Routing Metrics
marga.routing.decisions(counter) - Routing decisions by strategymarga.routing.fallbacks(counter) - Fallback activationsmarga.routing.cache_hits(counter) - Cache hit/miss ratio
Provider Metrics
marga.provider.latency(histogram) - Provider response timesmarga.provider.errors(counter) - Provider-specific errorsmarga.provider.tokens(counter) - Tokens consumed per providermarga.provider.cost(counter) - Cost by provider and model
Quality Metrics
marga.quality.score(histogram) - Response quality assessmentsmarga.quality.degradation(counter) - Quality degradation eventsmarga.ab_test.conversion(counter) - A/B test performance
Custom Metrics API
Access real-time metrics via the /v1/metrics endpoint:
# Get all metrics
curl http://localhost:8080/v1/metrics
# Filter by type
curl http://localhost:8080/v1/metrics?type=provider
# Time range query
curl "http://localhost:8080/v1/metrics?from=2024-01-01T00:00:00Z&to=2024-01-02T00:00:00Z"
# Specific metric with tags
curl "http://localhost:8080/v1/metrics/marga.provider.latency?provider=openai&model=gpt-4"Response format:
{
"metrics": [
{
"name": "marga.requests.total",
"type": "counter",
"value": 12847,
"timestamp": "2024-01-15T10:30:00Z",
"tags": {
"provider": "openai",
"model": "gpt-4",
"status": "success"
}
}
],
"metadata": {
"count": 1,
"time_range": {
"from": "2024-01-15T10:00:00Z",
"to": "2024-01-15T10:30:00Z"
}
}
}Datadog Dashboard
Automated Setup
Deploy the MĀRGA dashboard automatically:
curl -X POST "https://api.datadoghq.com/api/v1/dashboard" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY" \
-d @dashboard.jsonDashboard Configuration
{
"title": "MĀRGA - LLM Router Monitoring",
"description": "Comprehensive monitoring for MĀRGA LLM routing service",
"template_variables": [
{
"name": "provider",
"default": "*",
"prefix": "provider"
},
{
"name": "environment",
"default": "*",
"prefix": "env"
}
],
"layout_type": "ordered",
"widgets": [
{
"definition": {
"title": "Request Volume & Latency",
"type": "timeseries",
"requests": [
{
"q": "avg:marga.requests.duration{$provider,$environment} by {provider}",
"display_type": "line",
"style": {
"palette": "dog_classic",
"line_type": "solid",
"line_width": "normal"
}
}
],
"yaxis": {
"scale": "linear",
"label": "Latency (ms)"
}
}
},
{
"definition": {
"title": "Provider Error Rates",
"type": "query_value",
"requests": [
{
"q": "sum:marga.provider.errors{$provider,$environment}.as_rate()",
"aggregator": "avg"
}
],
"precision": 2
}
},
{
"definition": {
"title": "Cost Tracking",
"type": "toplist",
"requests": [
{
"q": "top(sum:marga.provider.cost{$provider,$environment} by {provider,model}, 10, 'sum', 'desc')"
}
]
}
},
{
"definition": {
"title": "Routing Decisions",
"type": "sunburst",
"requests": [
{
"q": "sum:marga.routing.decisions{$provider,$environment} by {strategy,provider,model}"
}
]
}
},
{
"definition": {
"title": "Quality Score Distribution",
"type": "distribution",
"requests": [
{
"q": "avg:marga.quality.score{$provider,$environment}"
}
]
}
}
]
}Key Dashboards Widgets
-
Request Overview
- Request volume and success rate
- Average latency by provider
- Error rate trends
-
Provider Performance
- Latency percentiles (p50, p95, p99)
- Error breakdown by provider/model
- Availability and uptime
-
Cost Analysis
- Cost per provider and model
- Daily/monthly spend trends
- Cost per successful request
-
Routing Intelligence
- Strategy effectiveness
- Fallback activation frequency
- Cache hit ratios
-
Quality Monitoring
- Response quality scores
- A/B test performance
- Quality degradation alerts
Alerting
Critical Alerts
alerts:
- name: "High Error Rate"
query: "avg(last_5m):sum:marga.requests.errors{*}.as_rate() > 0.05"
message: "MĀRGA error rate exceeded 5% @slack-platform-alerts"
priority: "high"
- name: "Provider Unavailable"
query: "avg(last_2m):sum:marga.provider.errors{*} by {provider}.as_rate() > 0.8"
message: "Provider {{provider.name}} is failing @pagerduty"
priority: "critical"
- name: "High Latency"
query: "avg(last_10m):avg:marga.requests.duration{*} > 5000"
message: "MĀRGA latency exceeded 5 seconds @slack-platform-alerts"
priority: "medium"
- name: "Budget Exceeded"
query: "sum(last_1h):sum:marga.provider.cost{*} > 50"
message: "Hourly spend exceeded $50 @slack-finance"
priority: "high"
- name: "Quality Degradation"
query: "avg(last_15m):avg:marga.quality.score{*} < 0.7"
message: "Response quality degraded below 70% @slack-platform-alerts"
priority: "medium"Alert Configuration
# Create alert via API
curl -X POST "https://api.datadoghq.com/api/v1/monitor" \
-H "Content-Type: application/json" \
-H "DD-API-KEY: $DD_API_KEY" \
-H "DD-APPLICATION-KEY: $DD_APP_KEY" \
-d '{
"name": "MĀRGA High Error Rate",
"type": "metric alert",
"query": "avg(last_5m):sum:marga.requests.errors{*}.as_rate() > 0.05",
"message": "MĀRGA error rate exceeded 5%",
"tags": ["service:marga", "team:platform"],
"options": {
"notify_audit": false,
"require_full_window": false,
"notify_no_data": true,
"no_data_timeframe": 10
}
}'Health Checks
Built-in Health Endpoints
MĀRGA provides comprehensive health checking:
# Basic health check
curl http://localhost:8080/health
# Response: {"status": "healthy", "timestamp": "2024-01-15T10:30:00Z"}
# Detailed health with provider status
curl http://localhost:8080/health/detailed
# Response includes provider connectivity and model availability
# Readiness check (for Kubernetes)
curl http://localhost:8080/ready
# Liveness check
curl http://localhost:8080/liveCustom Health Checks
Configure provider-specific health monitoring:
health_checks:
enable: true
interval: "30s"
timeout: "10s"
providers:
openai:
endpoint: "/v1/models"
expected_status: 200
anthropic:
endpoint: "/v1/messages"
method: "POST"
payload: '{"model":"claude-3-haiku-20240307","max_tokens":10,"messages":[{"role":"user","content":"test"}]}'
ollama:
endpoint: "/api/generate"
method: "POST"
payload: '{"model":"llama2","prompt":"test","stream":false}'
datadog_integration:
publish_metrics: true
service_checks: truePerformance Monitoring
Key Performance Indicators
Monitor these critical metrics:
- Request Latency - End-to-end response time
- Provider Latency - Time to get response from each provider
- Routing Latency - Time to make routing decisions
- Cache Performance - Hit ratio and cache latency
- Throughput - Requests per second sustained
- Error Rates - By provider, model, and error type
SLI/SLO Configuration
Define Service Level Indicators and Objectives:
slo:
- name: "API Availability"
target: 99.9
metric: "availability"
window: "30d"
- name: "Response Latency"
target: 95 # 95th percentile
threshold: 2000 # 2 seconds
metric: "latency"
window: "7d"
- name: "Error Rate"
target: 99 # 99% success rate
metric: "success_rate"
window: "24h"Troubleshooting
Common Issues
-
High Latency
- Check provider response times
- Verify network connectivity
- Review routing complexity
-
Request Failures
- Examine provider error logs
- Check API key validity
- Verify model availability
-
Cost Overruns
- Review routing decisions
- Check for retry loops
- Analyze token usage patterns
Debug Mode
Enable verbose logging and tracing:
debug:
enable: true
log_level: "debug"
trace_all_requests: true
include_request_body: false # Security: avoid logging sensitive data
include_response_body: falseLog Analysis
Key log patterns to monitor:
# High error rate pattern
grep "ERROR" /var/log/marga/app.log | grep -c "provider_error"
# Latency issues
grep "WARN.*latency" /var/log/marga/app.log
# Cost tracking
grep "cost_exceeded" /var/log/marga/app.logMĀRGA’s monitoring capabilities provide complete visibility into your LLM routing operations, enabling proactive optimization and reliable service delivery.