Failover IPs do not work during failure
... / Failover IPs do not work ...
BMPCreated with Sketch.BMPZIPCreated with Sketch.ZIPXLSCreated with Sketch.XLSTXTCreated with Sketch.TXTPPTCreated with Sketch.PPTPNGCreated with Sketch.PNGPDFCreated with Sketch.PDFJPGCreated with Sketch.JPGGIFCreated with Sketch.GIFDOCCreated with Sketch.DOC Error Created with Sketch.
Question

Failover IPs do not work during failure

by
tumml
Created on 2023-11-28 15:03:53 (edited on 2024-09-04 14:25:41) in Dedicated Servers

I just wanted the community to be aware of this, OVH still didn't respond as to why this happened.

Yesterday they had a switch failure:

https://bare-metal-servers.status-ovhcloud.com/incidents/ljc3sd037jxm

a full (or two) racks offline.

No worries, you would think, just migrate your failover IPs. Well aparrently, if there is a failure on OVH side you can't. Here is what happened:

We migrated all the ranges. The good news most actually pointed to the other instance, but one got stuck in the status todo. So one range couldn't be migrated for the full length (and longer) of the failure.
The ovh support didn't really answer much and was very difficult to reach at all.

When the switch now finally got replaced and the other instance became available, the missing range was still in todo state, but at least it was reachable again due to the original instance being reachable. But things got ***worse*** from here!

Now we noticed, that the IP range that were migrated had difficulties communicating with the internet. The reason: both instances, the old one and the target of the migration where receiving parts of the packets. So it appears like OVH was announcing this IP from two switches instead of one.

OVH didn't fix this at all for the next 10 hours. Silently after a good 10 hours later they finally exited the todo state and we could move all IPs back to their origin and all issues are resolved as of now.

But the learning is. Failover IPs don't actually do what they're supposed to. They can't failover, maybe if your instance fails, but definitely not if it is such a simple thing as a ToR switched died.

Still awaiting from the OVH support an explanation.