On 10/18/19 9:25 PM, Bryan Davis wrote:
Back to Arturo's question, I think I agree that if
the IP in use is
from the "wan-transport-eqiad" pool (which is a great name for a
network and a horrible name for a pool), then the FQDN used for that
IP should be in the
wmcloud.org zone (or another zone dedicated to
public IPs) and not the wikimedia.cloud zone.
On a second thought, Brooke suggested we use a floating IP for haproxy in fron
of the API server + ingress.
But the floating IP itself doesn't eliminate the single point of failure. We
would need to implement what Jason suggested.
Moreover, I wonder if we care about this SPOF at all. We could use a
cold-standby approach and create another VM with the same setup and only change
them by means of DNS. This should be enough.
We have 3 options:
* DNS failover
* Floating IP failover
* VRRP or other HA mechanisms
None of these mechanisms prevent clients from having to re-establish TCP
connections in case of failover (because the TCP session information is not in
the now-active node). The most simple option is DNS failover, so I would stick
to that.
k8s.toolsbeta.eqiad1.wikimedia.cloud --> 176.16.x.10 (active VM)
--> 172.16.x.20 (cold-standby VM)
In case of manual failover:
--> 176.16.x.10 (cold-standby VM)
k8s.toolsbeta.eqiad1.wikimedia.cloud --> 172.16.x.20 (active VM)
Honestly I think should be enough for this setup.
The topic would be different if we wanted to allow connecting to the k8s API
directly from the internet, but I don't think that's the case. We should only
allow connecting to that FQDN from dynamicproxy, because SSL termination is there.
--
Arturo Borrero Gonzalez
SRE / Wikimedia Cloud Services
Wikimedia Foundation