Scaling Ceph RGW: The Power of Keepalived for High Availability
When deploying Ceph Object Gateway (RGW), the gateway itself is essentially a stateless proxy. While you can scale horizontally by adding more RGW instances, you face a critical challenge: How do you present a single, reliable endpoint to your clients?
If your clients are hardcoded to a single RGW IP address and that instance goes down, your storage becomes inaccessible. This is where Keepalived becomes an essential component of your infrastructure.
The Architecture: Keepalived + RGW
Keepalived implements the Virtual Router Redundancy Protocol (VRRP). It allows you to configure a Virtual IP (VIP) that floats between your physical RGW nodes.
If the primary node hosting the VIP fails, Keepalived automatically detects the outage and shifts the VIP to a standby node within milliseconds.
Benefits: With vs. Without Keepalived
| Feature | Without Keepalived | With Keepalived |
| Availability | Single point of failure. | High availability via failover. |
| Client Configuration | Hardcoded to specific nodes. | Points to a single, stable VIP. |
| Maintenance | Requires client-side changes. | Transparent; move VIP to perform updates. |
| Complexity | Low initially, high during outage. | Moderate setup, high operational resilience. |
Implementation Example
In this setup, we assume two nodes (node-1 and node-2) both running RGW. We want a shared VIP: 192.168.1.100.
Keepalived Configuration (keepalived.conf)
Place this file in /etc/keepalived/keepalived.conf on both nodes.
Note: Ensure you adjust the priority (higher for master) and the interface name.
Why this is the "Gold Standard" for Ceph RGW
Seamless Failover: Because the VIP persists, clients (S3 browsers, SDKs, backup tools) do not need to be reconfigured or updated when a node needs a reboot or suffers a kernel panic.
Health Awareness: By using the vrrp_script shown above, Keepalived doesn't just check if the server is alive; it checks if the RGW service is actually running. If RGW crashes but the OS stays up, the VIP will still migrate, ensuring traffic stays directed toward a working gateway.
Cost-Effective: Unlike hardware load balancers (F5/Citrix) which can be incredibly expensive, Keepalived is open-source, lightweight, and runs directly on your existing RGW Linux nodes.
Final Considerations
Load Balancing: Keepalived handles high availability, but it does not perform load balancing across all RGWs. If you have 10+ RGW nodes, consider putting HAProxy or Nginx in front of your RGWs, and have Keepalived manage the VIP for those load balancers instead.
Networking: Ensure that your network environment allows gratuitous ARP, which is how Keepalived communicates the VIP movement to the rest of the network.
Are you looking to integrate this with a specific load balancer like HAProxy, or are you aiming for a simple two-node active-passive setup?