Skip to main content

High Availability

HA Building Blocks

  • Gateway: horizontally scalable stateless wire endpoints
  • Storage Engine: replicated storage semantics with leader/replica behavior
  • Coordinator: stateless routing/orchestration replicas
  • etcd: metadata and coordination backend

Validation

For local cluster behavior validation:

make e2e-local-cluster

For production, run failure drills (node termination, network disruption, process restarts) and verify correctness plus latency SLOs.

Operational Signals

Track:

  • command error rate by service
  • leader/replica health indicators
  • request latency p95/p99 under failover
  • cursor/query retries and backpressure trends

Source of Truth

  • services/storage-engine/src/raft/mod.rs
  • services/coordinator/internal/region/failover.go
  • services/gateway/src/retry/mod.rs
  • tests/e2e/matrix/raft/test_failover.py