MĀRGA (मार्ग) High Availability & Load Balancing Guide
Architecture Overview
┌─────────────┐
│ Clients │
└──────┬──────┘
│
┌──────▼──────┐
│ Nginx / LB │ ← Layer 7 load balancer
│ (port 80) │ least_conn + health checks
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
┌─────▼─────┐┌────▼─────┐┌─────▼─────┐
│ MĀRGA-1 ││ MĀRGA-2 ││ MĀRGA-3 │ ← Stateless instances
│ :8080 ││ :8080 ││ :8080 │ with circuit breakers
└─────┬─────┘└────┬─────┘└─────┬─────┘
│ │ │
└───────────┼────────────┘
│
┌────────────────┼────────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ OpenAI │ │ Anthropic │ │ Ollama │
│ API │ │ API │ │ (local) │
└─────────┘ └───────────┘ └───────────┘Components
1. Load Balancer (Nginx)
The nginx reverse proxy distributes traffic across MĀRGA instances using least-connections algorithm, which is optimal for LLM routing since request durations vary widely.
Key features:
- Least-connections load balancing
- Automatic upstream health checks (
max_fails=3, fail_timeout=30s) - Request retry on upstream failure (
proxy_next_upstream) - Rate limiting (100 req/s per client, burst 50)
- Extended timeouts for LLM generation (300s read timeout for streaming)
- SSE streaming support (buffering disabled for
/v1/chat/completions) - Keepalive connections to upstreams (32 connections)
2. Circuit Breakers (Per-Provider)
Each LLM provider has an independent circuit breaker that protects against cascading failures.
States:
| State | Behavior |
|---|---|
| CLOSED | Normal operation. Requests pass through. Failures are counted. |
| OPEN | Provider is failing. Requests are rejected immediately and routed to fallback. |
| HALF_OPEN | Recovery probe. Limited requests (2) are allowed through to test if the provider recovered. |
Configuration:
routing:
circuit_breaker:
failure_threshold: 5 # 5 consecutive failures → OPEN
success_threshold: 3 # 3 consecutive successes in HALF_OPEN → CLOSED
open_timeout: 30s # Stay OPEN for 30s before probing
sampling_window: 60s # Track failure rate over 60s
failure_rate_threshold: 0.5 # 50% failure rate in window → OPENAPI endpoints:
# View circuit breaker status
curl http://localhost:8080/v1/circuit-breakers
# Reset a stuck circuit breaker
curl -X POST http://localhost:8080/v1/circuit-breakers/openai/reset3. MĀRGA Instances (Stateless)
Each instance is identical and stateless. They can be added or removed without coordination.
Request flow within each instance:
- Receive request from load balancer
- Select provider via routing strategy
- Check circuit breaker — if OPEN, skip to failover
- Forward to provider
- On failure: record in circuit breaker, attempt failover
- On success: record in circuit breaker, return response
4. Auto-Scaling
Scaling rules based on observed metrics:
| Metric | Scale Up | Scale Down |
|---|---|---|
| CPU Utilization | > 70% for 2 min | < 30% for 5 min |
| Request Rate | > 1000 rps | < 200 rps |
| Active Connections | > 500 | < 50 |
| P95 Latency | > 2s | < 500ms |
| Error Rate | > 5% | < 1% |
Deployment Options
Option A: Docker Compose (Development / Small Scale)
cd deploy/
docker compose -f docker-compose.ha.yml up -d
# Scale to 5 instances
docker compose -f docker-compose.ha.yml up -d --scale marga-1=1 --scale marga-2=1 --scale marga-3=1Option B: Kubernetes (Production)
kubectl apply -f deploy/kubernetes/deployment.yaml
# The HPA will auto-scale between 3-15 pods based on:
# - CPU utilization (target: 70%)
# - Memory utilization (target: 80%)
# - In-flight requests (target: 100 per pod)Kubernetes features:
- HPA — scales pods 3→15 based on CPU, memory, and custom metrics
- PDB — min 2 pods always available during disruptions
- Anti-affinity — pods spread across nodes and zones
- Rolling updates — maxUnavailable: 1, maxSurge: 1
- Probes — startup (50s budget), liveness (15s interval), readiness (10s interval)
Option C: Single Instance with Built-in LB (Lightweight)
For routing across providers (not instances), use MĀRGA’s built-in load balancing:
routing:
strategy: load_balance
load_balance:
algorithm: latency_weighted
health_aware: trueAvailable algorithms:
round_robin— Equal distributionweighted_round_robin— Weighted by provider priorityleast_connections— Route to least-busy providerlatency_weighted— Favor lower-latency providersip_hash— Sticky sessions by client IPrandom— Random selection
Load Balancing Algorithms
Round Robin
Best for: Uniform provider capabilities with similar latency.
Weighted Round Robin
Best for: Providers with different capacity. Assign higher weights to faster providers.
Least Connections
Best for: LLM routing where request durations vary significantly. Routes to the provider with the fewest in-flight requests.
Latency Weighted
Best for: Multi-region deployments. Automatically favors providers with lower observed latency using exponential moving average.
IP Hash
Best for: When you need session stickiness (e.g., conversation continuity with a specific provider).
Health Checks
Nginx → MĀRGA Instances
- Path:
/health - Interval: implicit via
max_failsandfail_timeout - Behavior: 3 failures within 30s marks instance as down; automatically retries after timeout
MĀRGA → LLM Providers
- Method: Minimal API call (1 token)
- Interval: 30s (configurable)
- Behavior: Updates provider health status; unhealthy providers are skipped during routing
Kubernetes Probes
- Startup:
/health— 10 attempts × 5s = 50s boot budget - Liveness:
/health— every 15s, 3 failures = restart - Readiness:
/health— every 10s, 2 failures = stop traffic
Failure Scenarios
Provider API Down
- Circuit breaker opens after 5 failures
- Subsequent requests skip the failed provider
- Failover routes to next-priority healthy provider
- After 30s, circuit breaker enters HALF_OPEN
- 2 probe requests test recovery
- 3 consecutive successes close the circuit breaker
MĀRGA Instance Crash
- Nginx detects failure (3 failed health checks)
- Instance removed from upstream pool
- Docker/K8s restart policy brings it back
- Nginx re-adds to pool after recovery
All Providers Down
- All circuit breakers open
- Health endpoint returns
{"status": "degraded"} - Requests return
503 Service Unavailable - Monitoring alerts fire (Datadog)
- Circuit breakers periodically probe for recovery
Network Partition
- Affected instances lose provider connectivity
- Nginx routes traffic to healthy instances
- Pod anti-affinity ensures instances are on different nodes
- Zone spread constraints prevent single-zone failure
Monitoring
Key Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
marga_requests_total | Total requests | — |
marga_requests_in_flight | Active requests | > 200 per instance |
marga_request_duration_seconds | End-to-end latency | p95 > 5s |
marga_provider_requests_total | Requests per provider | — |
marga_provider_errors_total | Errors per provider | > 10/min |
marga_provider_health | Provider health (0/1) | = 0 |
Circuit Breaker Monitoring
# Check all breakers
curl -s http://localhost:8080/v1/circuit-breakers | jq .
# Response:
{
"circuit_breakers": [
{
"name": "openai",
"state": "CLOSED",
"total_failures": 2,
"consecutive_failures": 0,
"failure_rate": 0.02
}
]
}Configuration Reference
Full HA Config Example
server:
port: 8080
host: 0.0.0.0
timeout: 30s
routing:
strategy: failover
circuit_breaker:
enabled: true
failure_threshold: 5
success_threshold: 3
open_timeout: 30s
half_open_max_requests: 2
sampling_window: 60s
failure_rate_threshold: 0.5
failover:
max_retries: 3
retry_delay: 1s
health_check_interval: 30s
health:
enabled: true
path: /health
check_providers: true
timeout: 10sMĀRGA (मार्ग) — The path that never fails.