MĀRGA (मार्ग) High Availability & Load Balancing Guide

Architecture Overview

                    ┌─────────────┐
                    │   Clients   │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │ Nginx / LB  │  ← Layer 7 load balancer
                    │ (port 80)   │     least_conn + health checks
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────▼─────┐┌────▼─────┐┌─────▼─────┐
        │  MĀRGA-1  ││ MĀRGA-2  ││  MĀRGA-3  │  ← Stateless instances
        │  :8080    ││  :8080   ││   :8080   │     with circuit breakers
        └─────┬─────┘└────┬─────┘└─────┬─────┘
              │           │            │
              └───────────┼────────────┘
                          │
         ┌────────────────┼────────────────┐
         │                │                │
    ┌────▼────┐     ┌─────▼─────┐    ┌─────▼─────┐
    │ OpenAI  │     │ Anthropic │    │  Ollama   │
    │   API   │     │    API    │    │  (local)  │
    └─────────┘     └───────────┘    └───────────┘

Components

1. Load Balancer (Nginx)

The nginx reverse proxy distributes traffic across MĀRGA instances using least-connections algorithm, which is optimal for LLM routing since request durations vary widely.

Key features:

Least-connections load balancing
Automatic upstream health checks (max_fails=3, fail_timeout=30s)
Request retry on upstream failure (proxy_next_upstream)
Rate limiting (100 req/s per client, burst 50)
Extended timeouts for LLM generation (300s read timeout for streaming)
SSE streaming support (buffering disabled for /v1/chat/completions)
Keepalive connections to upstreams (32 connections)

2. Circuit Breakers (Per-Provider)

Each LLM provider has an independent circuit breaker that protects against cascading failures.

States:

State	Behavior
CLOSED	Normal operation. Requests pass through. Failures are counted.
OPEN	Provider is failing. Requests are rejected immediately and routed to fallback.
HALF_OPEN	Recovery probe. Limited requests (2) are allowed through to test if the provider recovered.

Configuration:

routing:
  circuit_breaker:
    failure_threshold: 5        # 5 consecutive failures → OPEN
    success_threshold: 3        # 3 consecutive successes in HALF_OPEN → CLOSED
    open_timeout: 30s           # Stay OPEN for 30s before probing
    sampling_window: 60s        # Track failure rate over 60s
    failure_rate_threshold: 0.5 # 50% failure rate in window → OPEN

API endpoints:

# View circuit breaker status
curl http://localhost:8080/v1/circuit-breakers
 
# Reset a stuck circuit breaker
curl -X POST http://localhost:8080/v1/circuit-breakers/openai/reset

3. MĀRGA Instances (Stateless)

Each instance is identical and stateless. They can be added or removed without coordination.

Request flow within each instance:

Receive request from load balancer
Select provider via routing strategy
Check circuit breaker — if OPEN, skip to failover
Forward to provider
On failure: record in circuit breaker, attempt failover
On success: record in circuit breaker, return response

4. Auto-Scaling

Scaling rules based on observed metrics:

Metric	Scale Up	Scale Down
CPU Utilization	> 70% for 2 min	< 30% for 5 min
Request Rate	> 1000 rps	< 200 rps
Active Connections	> 500	< 50
P95 Latency	> 2s	< 500ms
Error Rate	> 5%	< 1%

Deployment Options

Option A: Docker Compose (Development / Small Scale)

cd deploy/
docker compose -f docker-compose.ha.yml up -d
 
# Scale to 5 instances
docker compose -f docker-compose.ha.yml up -d --scale marga-1=1 --scale marga-2=1 --scale marga-3=1

Option B: Kubernetes (Production)

kubectl apply -f deploy/kubernetes/deployment.yaml
 
# The HPA will auto-scale between 3-15 pods based on:
# - CPU utilization (target: 70%)
# - Memory utilization (target: 80%)
# - In-flight requests (target: 100 per pod)

Kubernetes features:

HPA — scales pods 3→15 based on CPU, memory, and custom metrics
PDB — min 2 pods always available during disruptions
Anti-affinity — pods spread across nodes and zones
Rolling updates — maxUnavailable: 1, maxSurge: 1
Probes — startup (50s budget), liveness (15s interval), readiness (10s interval)

Option C: Single Instance with Built-in LB (Lightweight)

For routing across providers (not instances), use MĀRGA’s built-in load balancing:

routing:
  strategy: load_balance
  load_balance:
    algorithm: latency_weighted
    health_aware: true

Available algorithms:

round_robin — Equal distribution
weighted_round_robin — Weighted by provider priority
least_connections — Route to least-busy provider
latency_weighted — Favor lower-latency providers
ip_hash — Sticky sessions by client IP
random — Random selection

Load Balancing Algorithms

Round Robin

Best for: Uniform provider capabilities with similar latency.

Weighted Round Robin

Best for: Providers with different capacity. Assign higher weights to faster providers.

Least Connections

Best for: LLM routing where request durations vary significantly. Routes to the provider with the fewest in-flight requests.

Latency Weighted

Best for: Multi-region deployments. Automatically favors providers with lower observed latency using exponential moving average.

IP Hash

Best for: When you need session stickiness (e.g., conversation continuity with a specific provider).

Health Checks

Nginx → MĀRGA Instances

Path: /health
Interval: implicit via max_fails and fail_timeout
Behavior: 3 failures within 30s marks instance as down; automatically retries after timeout

MĀRGA → LLM Providers

Method: Minimal API call (1 token)
Interval: 30s (configurable)
Behavior: Updates provider health status; unhealthy providers are skipped during routing

Kubernetes Probes

Startup: /health — 10 attempts × 5s = 50s boot budget
Liveness: /health — every 15s, 3 failures = restart
Readiness: /health — every 10s, 2 failures = stop traffic

Failure Scenarios

Provider API Down

Circuit breaker opens after 5 failures
Subsequent requests skip the failed provider
Failover routes to next-priority healthy provider
After 30s, circuit breaker enters HALF_OPEN
2 probe requests test recovery
3 consecutive successes close the circuit breaker

MĀRGA Instance Crash

Nginx detects failure (3 failed health checks)
Instance removed from upstream pool
Docker/K8s restart policy brings it back
Nginx re-adds to pool after recovery

All Providers Down

All circuit breakers open
Health endpoint returns {"status": "degraded"}
Requests return 503 Service Unavailable
Monitoring alerts fire (Datadog)
Circuit breakers periodically probe for recovery

Network Partition

Affected instances lose provider connectivity
Nginx routes traffic to healthy instances
Pod anti-affinity ensures instances are on different nodes
Zone spread constraints prevent single-zone failure

Monitoring

Key Metrics

Metric	Description	Alert Threshold
`marga_requests_total`	Total requests	—
`marga_requests_in_flight`	Active requests	> 200 per instance
`marga_request_duration_seconds`	End-to-end latency	p95 > 5s
`marga_provider_requests_total`	Requests per provider	—
`marga_provider_errors_total`	Errors per provider	> 10/min
`marga_provider_health`	Provider health (0/1)	= 0

Circuit Breaker Monitoring

# Check all breakers
curl -s http://localhost:8080/v1/circuit-breakers | jq .
 
# Response:
{
  "circuit_breakers": [
    {
      "name": "openai",
      "state": "CLOSED",
      "total_failures": 2,
      "consecutive_failures": 0,
      "failure_rate": 0.02
    }
  ]
}

Configuration Reference

Full HA Config Example

server:
  port: 8080
  host: 0.0.0.0
  timeout: 30s
 
routing:
  strategy: failover
  
  circuit_breaker:
    enabled: true
    failure_threshold: 5
    success_threshold: 3
    open_timeout: 30s
    half_open_max_requests: 2
    sampling_window: 60s
    failure_rate_threshold: 0.5
  
  failover:
    max_retries: 3
    retry_delay: 1s
    health_check_interval: 30s
 
health:
  enabled: true
  path: /health
  check_providers: true
  timeout: 10s

MĀRGA (मार्ग) — The path that never fails.

Deployment Monitoring