Monitoring and Observability

MĀRGA provides comprehensive monitoring capabilities with first-class Datadog integration for enterprise observability.

Overview

MĀRGA exports detailed telemetry covering:

Request routing decisions and latency
Provider performance and errors
Cost tracking and optimization
Security and compliance metrics
A/B testing results

Datadog Integration

Quick Setup

Set Environment Variables

export DD_API_KEY="your_datadog_api_key"
export DD_APP_KEY="your_datadog_app_key"  # For custom metrics
export DD_SITE="datadoghq.com"  # or datadoghq.eu
export DD_SERVICE="marga"
export DD_ENV="production"
export DD_VERSION="1.0.0"

Enable in Configuration

monitoring:
  datadog:
    enable: true
    trace_sample_rate: 1.0
    profile_sample_rate: 0.1
    custom_metrics: true
    
  metrics:
    enable: true
    namespace: "marga"
    tags:
      - "service:llm-router"
      - "team:platform"

Start with APM

# Using Docker
docker run -d \
  --name marga \
  -e DD_API_KEY=$DD_API_KEY \
  -e DD_APM_ENABLED=true \
  -e DD_LOGS_ENABLED=true \
  -p 8080:8080 \
  ghcr.io/gaurav21/marga:latest
 
# Using binary
DD_TRACE_ENABLED=1 ./marga-server

APM Configuration

MĀRGA automatically instruments HTTP requests, database queries, and external API calls.

apm:
  enable: true
  service_name: "marga"
  sample_rate: 1.0
  
  # Custom span tags
  span_tags:
    - "provider"
    - "model" 
    - "strategy"
    - "cost_tier"
    
  # Sensitive data filtering
  obfuscation:
    query_string: true
    headers: ["authorization", "x-api-key"]
    json_keys: ["api_key", "secret"]

Log Configuration

logging:
  level: "info"
  format: "json"
  
  datadog:
    enable: true
    source: "marga"
    service: "llm-router"
    
  # Structured logging
  fields:
    - "request_id"
    - "provider" 
    - "model"
    - "latency_ms"
    - "tokens_used"
    - "cost_usd"

Metrics Reference

Core Metrics

Request Metrics

marga.requests.total (counter) - Total requests processed
marga.requests.duration (histogram) - Request processing time
marga.requests.errors (counter) - Failed requests by error type

Routing Metrics

marga.routing.decisions (counter) - Routing decisions by strategy
marga.routing.fallbacks (counter) - Fallback activations
marga.routing.cache_hits (counter) - Cache hit/miss ratio

Provider Metrics

marga.provider.latency (histogram) - Provider response times
marga.provider.errors (counter) - Provider-specific errors
marga.provider.tokens (counter) - Tokens consumed per provider
marga.provider.cost (counter) - Cost by provider and model

Quality Metrics

marga.quality.score (histogram) - Response quality assessments
marga.quality.degradation (counter) - Quality degradation events
marga.ab_test.conversion (counter) - A/B test performance

Custom Metrics API

Access real-time metrics via the /v1/metrics endpoint:

# Get all metrics
curl http://localhost:8080/v1/metrics
 
# Filter by type
curl http://localhost:8080/v1/metrics?type=provider
 
# Time range query
curl "http://localhost:8080/v1/metrics?from=2024-01-01T00:00:00Z&to=2024-01-02T00:00:00Z"
 
# Specific metric with tags
curl "http://localhost:8080/v1/metrics/marga.provider.latency?provider=openai&model=gpt-4"

Response format:

{
  "metrics": [
    {
      "name": "marga.requests.total",
      "type": "counter", 
      "value": 12847,
      "timestamp": "2024-01-15T10:30:00Z",
      "tags": {
        "provider": "openai",
        "model": "gpt-4",
        "status": "success"
      }
    }
  ],
  "metadata": {
    "count": 1,
    "time_range": {
      "from": "2024-01-15T10:00:00Z",
      "to": "2024-01-15T10:30:00Z"
    }
  }
}

Datadog Dashboard

Automated Setup

Deploy the MĀRGA dashboard automatically:

curl -X POST "https://api.datadoghq.com/api/v1/dashboard" \
  -H "Content-Type: application/json" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -d @dashboard.json

Dashboard Configuration

{
  "title": "MĀRGA - LLM Router Monitoring",
  "description": "Comprehensive monitoring for MĀRGA LLM routing service",
  "template_variables": [
    {
      "name": "provider",
      "default": "*",
      "prefix": "provider"
    },
    {
      "name": "environment", 
      "default": "*",
      "prefix": "env"
    }
  ],
  "layout_type": "ordered",
  "widgets": [
    {
      "definition": {
        "title": "Request Volume & Latency",
        "type": "timeseries",
        "requests": [
          {
            "q": "avg:marga.requests.duration{$provider,$environment} by {provider}",
            "display_type": "line",
            "style": {
              "palette": "dog_classic",
              "line_type": "solid",
              "line_width": "normal"
            }
          }
        ],
        "yaxis": {
          "scale": "linear",
          "label": "Latency (ms)"
        }
      }
    },
    {
      "definition": {
        "title": "Provider Error Rates",
        "type": "query_value",
        "requests": [
          {
            "q": "sum:marga.provider.errors{$provider,$environment}.as_rate()",
            "aggregator": "avg"
          }
        ],
        "precision": 2
      }
    },
    {
      "definition": {
        "title": "Cost Tracking",
        "type": "toplist", 
        "requests": [
          {
            "q": "top(sum:marga.provider.cost{$provider,$environment} by {provider,model}, 10, 'sum', 'desc')"
          }
        ]
      }
    },
    {
      "definition": {
        "title": "Routing Decisions",
        "type": "sunburst",
        "requests": [
          {
            "q": "sum:marga.routing.decisions{$provider,$environment} by {strategy,provider,model}"
          }
        ]
      }
    },
    {
      "definition": {
        "title": "Quality Score Distribution", 
        "type": "distribution",
        "requests": [
          {
            "q": "avg:marga.quality.score{$provider,$environment}"
          }
        ]
      }
    }
  ]
}

Key Dashboards Widgets

Request Overview
- Request volume and success rate
- Average latency by provider
- Error rate trends
Provider Performance
- Latency percentiles (p50, p95, p99)
- Error breakdown by provider/model
- Availability and uptime
Cost Analysis
- Cost per provider and model
- Daily/monthly spend trends
- Cost per successful request
Routing Intelligence
- Strategy effectiveness
- Fallback activation frequency
- Cache hit ratios
Quality Monitoring
- Response quality scores
- A/B test performance
- Quality degradation alerts

Alerting

Critical Alerts

alerts:
  - name: "High Error Rate"
    query: "avg(last_5m):sum:marga.requests.errors{*}.as_rate() > 0.05"
    message: "MĀRGA error rate exceeded 5% @slack-platform-alerts"
    priority: "high"
    
  - name: "Provider Unavailable"
    query: "avg(last_2m):sum:marga.provider.errors{*} by {provider}.as_rate() > 0.8"
    message: "Provider {{provider.name}} is failing @pagerduty"
    priority: "critical"
    
  - name: "High Latency"
    query: "avg(last_10m):avg:marga.requests.duration{*} > 5000"
    message: "MĀRGA latency exceeded 5 seconds @slack-platform-alerts"
    priority: "medium"
    
  - name: "Budget Exceeded"
    query: "sum(last_1h):sum:marga.provider.cost{*} > 50"
    message: "Hourly spend exceeded $50 @slack-finance"
    priority: "high"
    
  - name: "Quality Degradation"
    query: "avg(last_15m):avg:marga.quality.score{*} < 0.7"
    message: "Response quality degraded below 70% @slack-platform-alerts"
    priority: "medium"

Alert Configuration

# Create alert via API
curl -X POST "https://api.datadoghq.com/api/v1/monitor" \
  -H "Content-Type: application/json" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "DD-APPLICATION-KEY: $DD_APP_KEY" \
  -d '{
    "name": "MĀRGA High Error Rate",
    "type": "metric alert",
    "query": "avg(last_5m):sum:marga.requests.errors{*}.as_rate() > 0.05",
    "message": "MĀRGA error rate exceeded 5%",
    "tags": ["service:marga", "team:platform"],
    "options": {
      "notify_audit": false,
      "require_full_window": false,
      "notify_no_data": true,
      "no_data_timeframe": 10
    }
  }'

Health Checks

Built-in Health Endpoints

MĀRGA provides comprehensive health checking:

# Basic health check
curl http://localhost:8080/health
# Response: {"status": "healthy", "timestamp": "2024-01-15T10:30:00Z"}
 
# Detailed health with provider status  
curl http://localhost:8080/health/detailed
# Response includes provider connectivity and model availability
 
# Readiness check (for Kubernetes)
curl http://localhost:8080/ready
 
# Liveness check  
curl http://localhost:8080/live

Custom Health Checks

Configure provider-specific health monitoring:

health_checks:
  enable: true
  interval: "30s"
  timeout: "10s"
  
  providers:
    openai:
      endpoint: "/v1/models"
      expected_status: 200
      
    anthropic:
      endpoint: "/v1/messages"  
      method: "POST"
      payload: '{"model":"claude-3-haiku-20240307","max_tokens":10,"messages":[{"role":"user","content":"test"}]}'
      
    ollama:
      endpoint: "/api/generate"
      method: "POST" 
      payload: '{"model":"llama2","prompt":"test","stream":false}'
 
  datadog_integration:
    publish_metrics: true
    service_checks: true

Performance Monitoring

Key Performance Indicators

Monitor these critical metrics:

Request Latency - End-to-end response time
Provider Latency - Time to get response from each provider
Routing Latency - Time to make routing decisions
Cache Performance - Hit ratio and cache latency
Throughput - Requests per second sustained
Error Rates - By provider, model, and error type

SLI/SLO Configuration

Define Service Level Indicators and Objectives:

slo:
  - name: "API Availability"
    target: 99.9
    metric: "availability"
    window: "30d"
    
  - name: "Response Latency"  
    target: 95  # 95th percentile
    threshold: 2000  # 2 seconds
    metric: "latency"
    window: "7d"
    
  - name: "Error Rate"
    target: 99  # 99% success rate
    metric: "success_rate" 
    window: "24h"

Troubleshooting

Common Issues

High Latency
- Check provider response times
- Verify network connectivity
- Review routing complexity
Request Failures
- Examine provider error logs
- Check API key validity
- Verify model availability
Cost Overruns
- Review routing decisions
- Check for retry loops
- Analyze token usage patterns

Debug Mode

Enable verbose logging and tracing:

debug:
  enable: true
  log_level: "debug"
  trace_all_requests: true
  include_request_body: false  # Security: avoid logging sensitive data
  include_response_body: false

Log Analysis

Key log patterns to monitor:

# High error rate pattern
grep "ERROR" /var/log/marga/app.log | grep -c "provider_error"
 
# Latency issues  
grep "WARN.*latency" /var/log/marga/app.log
 
# Cost tracking
grep "cost_exceeded" /var/log/marga/app.log

MĀRGA’s monitoring capabilities provide complete visibility into your LLM routing operations, enabling proactive optimization and reliable service delivery.

High Availability Use Cases