MĀRGA — LLM RouterHigh Availability

MĀRGA (मार्ग) High Availability & Load Balancing Guide

Architecture Overview

                    ┌─────────────┐
                    │   Clients   │
                    └──────┬──────┘

                    ┌──────▼──────┐
                    │ Nginx / LB  │  ← Layer 7 load balancer
                    │ (port 80)   │     least_conn + health checks
                    └──────┬──────┘

              ┌────────────┼────────────┐
              │            │            │
        ┌─────▼─────┐┌────▼─────┐┌─────▼─────┐
        │  MĀRGA-1  ││ MĀRGA-2  ││  MĀRGA-3  │  ← Stateless instances
        │  :8080    ││  :8080   ││   :8080   │     with circuit breakers
        └─────┬─────┘└────┬─────┘└─────┬─────┘
              │           │            │
              └───────────┼────────────┘

         ┌────────────────┼────────────────┐
         │                │                │
    ┌────▼────┐     ┌─────▼─────┐    ┌─────▼─────┐
    │ OpenAI  │     │ Anthropic │    │  Ollama   │
    │   API   │     │    API    │    │  (local)  │
    └─────────┘     └───────────┘    └───────────┘

Components

1. Load Balancer (Nginx)

The nginx reverse proxy distributes traffic across MĀRGA instances using least-connections algorithm, which is optimal for LLM routing since request durations vary widely.

Key features:

  • Least-connections load balancing
  • Automatic upstream health checks (max_fails=3, fail_timeout=30s)
  • Request retry on upstream failure (proxy_next_upstream)
  • Rate limiting (100 req/s per client, burst 50)
  • Extended timeouts for LLM generation (300s read timeout for streaming)
  • SSE streaming support (buffering disabled for /v1/chat/completions)
  • Keepalive connections to upstreams (32 connections)

2. Circuit Breakers (Per-Provider)

Each LLM provider has an independent circuit breaker that protects against cascading failures.

States:

StateBehavior
CLOSEDNormal operation. Requests pass through. Failures are counted.
OPENProvider is failing. Requests are rejected immediately and routed to fallback.
HALF_OPENRecovery probe. Limited requests (2) are allowed through to test if the provider recovered.

Configuration:

routing:
  circuit_breaker:
    failure_threshold: 5        # 5 consecutive failures → OPEN
    success_threshold: 3        # 3 consecutive successes in HALF_OPEN → CLOSED
    open_timeout: 30s           # Stay OPEN for 30s before probing
    sampling_window: 60s        # Track failure rate over 60s
    failure_rate_threshold: 0.5 # 50% failure rate in window → OPEN

API endpoints:

# View circuit breaker status
curl http://localhost:8080/v1/circuit-breakers
 
# Reset a stuck circuit breaker
curl -X POST http://localhost:8080/v1/circuit-breakers/openai/reset

3. MĀRGA Instances (Stateless)

Each instance is identical and stateless. They can be added or removed without coordination.

Request flow within each instance:

  1. Receive request from load balancer
  2. Select provider via routing strategy
  3. Check circuit breaker — if OPEN, skip to failover
  4. Forward to provider
  5. On failure: record in circuit breaker, attempt failover
  6. On success: record in circuit breaker, return response

4. Auto-Scaling

Scaling rules based on observed metrics:

MetricScale UpScale Down
CPU Utilization> 70% for 2 min< 30% for 5 min
Request Rate> 1000 rps< 200 rps
Active Connections> 500< 50
P95 Latency> 2s< 500ms
Error Rate> 5%< 1%

Deployment Options

Option A: Docker Compose (Development / Small Scale)

cd deploy/
docker compose -f docker-compose.ha.yml up -d
 
# Scale to 5 instances
docker compose -f docker-compose.ha.yml up -d --scale marga-1=1 --scale marga-2=1 --scale marga-3=1

Option B: Kubernetes (Production)

kubectl apply -f deploy/kubernetes/deployment.yaml
 
# The HPA will auto-scale between 3-15 pods based on:
# - CPU utilization (target: 70%)
# - Memory utilization (target: 80%)
# - In-flight requests (target: 100 per pod)

Kubernetes features:

  • HPA — scales pods 3→15 based on CPU, memory, and custom metrics
  • PDB — min 2 pods always available during disruptions
  • Anti-affinity — pods spread across nodes and zones
  • Rolling updates — maxUnavailable: 1, maxSurge: 1
  • Probes — startup (50s budget), liveness (15s interval), readiness (10s interval)

Option C: Single Instance with Built-in LB (Lightweight)

For routing across providers (not instances), use MĀRGA’s built-in load balancing:

routing:
  strategy: load_balance
  load_balance:
    algorithm: latency_weighted
    health_aware: true

Available algorithms:

  • round_robin — Equal distribution
  • weighted_round_robin — Weighted by provider priority
  • least_connections — Route to least-busy provider
  • latency_weighted — Favor lower-latency providers
  • ip_hash — Sticky sessions by client IP
  • random — Random selection

Load Balancing Algorithms

Round Robin

Best for: Uniform provider capabilities with similar latency.

Weighted Round Robin

Best for: Providers with different capacity. Assign higher weights to faster providers.

Least Connections

Best for: LLM routing where request durations vary significantly. Routes to the provider with the fewest in-flight requests.

Latency Weighted

Best for: Multi-region deployments. Automatically favors providers with lower observed latency using exponential moving average.

IP Hash

Best for: When you need session stickiness (e.g., conversation continuity with a specific provider).

Health Checks

Nginx → MĀRGA Instances

  • Path: /health
  • Interval: implicit via max_fails and fail_timeout
  • Behavior: 3 failures within 30s marks instance as down; automatically retries after timeout

MĀRGA → LLM Providers

  • Method: Minimal API call (1 token)
  • Interval: 30s (configurable)
  • Behavior: Updates provider health status; unhealthy providers are skipped during routing

Kubernetes Probes

  • Startup: /health — 10 attempts × 5s = 50s boot budget
  • Liveness: /health — every 15s, 3 failures = restart
  • Readiness: /health — every 10s, 2 failures = stop traffic

Failure Scenarios

Provider API Down

  1. Circuit breaker opens after 5 failures
  2. Subsequent requests skip the failed provider
  3. Failover routes to next-priority healthy provider
  4. After 30s, circuit breaker enters HALF_OPEN
  5. 2 probe requests test recovery
  6. 3 consecutive successes close the circuit breaker

MĀRGA Instance Crash

  1. Nginx detects failure (3 failed health checks)
  2. Instance removed from upstream pool
  3. Docker/K8s restart policy brings it back
  4. Nginx re-adds to pool after recovery

All Providers Down

  1. All circuit breakers open
  2. Health endpoint returns {"status": "degraded"}
  3. Requests return 503 Service Unavailable
  4. Monitoring alerts fire (Datadog)
  5. Circuit breakers periodically probe for recovery

Network Partition

  1. Affected instances lose provider connectivity
  2. Nginx routes traffic to healthy instances
  3. Pod anti-affinity ensures instances are on different nodes
  4. Zone spread constraints prevent single-zone failure

Monitoring

Key Metrics

MetricDescriptionAlert Threshold
marga_requests_totalTotal requests
marga_requests_in_flightActive requests> 200 per instance
marga_request_duration_secondsEnd-to-end latencyp95 > 5s
marga_provider_requests_totalRequests per provider
marga_provider_errors_totalErrors per provider> 10/min
marga_provider_healthProvider health (0/1)= 0

Circuit Breaker Monitoring

# Check all breakers
curl -s http://localhost:8080/v1/circuit-breakers | jq .
 
# Response:
{
  "circuit_breakers": [
    {
      "name": "openai",
      "state": "CLOSED",
      "total_failures": 2,
      "consecutive_failures": 0,
      "failure_rate": 0.02
    }
  ]
}

Configuration Reference

Full HA Config Example

server:
  port: 8080
  host: 0.0.0.0
  timeout: 30s
 
routing:
  strategy: failover
  
  circuit_breaker:
    enabled: true
    failure_threshold: 5
    success_threshold: 3
    open_timeout: 30s
    half_open_max_requests: 2
    sampling_window: 60s
    failure_rate_threshold: 0.5
  
  failover:
    max_retries: 3
    retry_delay: 1s
    health_check_interval: 30s
 
health:
  enabled: true
  path: /health
  check_providers: true
  timeout: 10s

MĀRGA (मार्ग) — The path that never fails.