On 10/18/19 9:25 PM, Bryan Davis wrote:
Back to Arturo's question, I think I agree that if the IP in use is from the "wan-transport-eqiad" pool (which is a great name for a network and a horrible name for a pool), then the FQDN used for that IP should be in the wmcloud.org zone (or another zone dedicated to public IPs) and not the wikimedia.cloud zone.
On a second thought, Brooke suggested we use a floating IP for haproxy in fron of the API server + ingress. But the floating IP itself doesn't eliminate the single point of failure. We would need to implement what Jason suggested.
Moreover, I wonder if we care about this SPOF at all. We could use a cold-standby approach and create another VM with the same setup and only change them by means of DNS. This should be enough.
We have 3 options: * DNS failover * Floating IP failover * VRRP or other HA mechanisms
None of these mechanisms prevent clients from having to re-establish TCP connections in case of failover (because the TCP session information is not in the now-active node). The most simple option is DNS failover, so I would stick to that.
k8s.toolsbeta.eqiad1.wikimedia.cloud --> 176.16.x.10 (active VM) --> 172.16.x.20 (cold-standby VM)
In case of manual failover:
--> 176.16.x.10 (cold-standby VM) k8s.toolsbeta.eqiad1.wikimedia.cloud --> 172.16.x.20 (active VM)
Honestly I think should be enough for this setup.
The topic would be different if we wanted to allow connecting to the k8s API directly from the internet, but I don't think that's the case. We should only allow connecting to that FQDN from dynamicproxy, because SSL termination is there.