Hi,
I have a proposal.
The new k8s haproxy is in front of the api-server and the ingress [0]. In toolsbeta we have been using the following:
toolsbeta-k8s-master.toolsbeta.wmflabs.org:6443 (api-server) toolsbeta-k8s-master.toolsbeta.wmflabs.org:30000 (ingress)
This haproxy knows which k8s nodes/controllers are UP and proxy the queries for them. Right now, this FQDN is not using a floating IP, is a simple A record pointing to the haproxy VM. This record is in the 'toolsbeta' CloudVPS project.
I've been wondering which FQDN would be nice to have in the final deployment. We have 'toolforge.org', but `whatever.toolforge.org` is intended to be a tool webservice, so I've been re-reading our DNS domains plans [1] and my proposal is to introduce a new FQDN like this:
k8s.toolforge.wmcloud.org
Then we can use it this way:
k8s.toolforge.wmcloud.org:6443 (api-server) k8s.toolforge.wmcloud.org:30000 (ingress)
This is because 'wmcloud.org' is set to become the replacement for 'wmflabs.org' which is what we are currently using for 'toolsbeta-k8s-master.toolsbeta.wmflabs.org'. We could also create k8s.toolsbeta.wmcloud.org (or whatever) in case we want to retain the toolsbeta setup online.
I hope this proposal is not increasing our naming confusion and complexity. Ideally we would use something like `k8s.toolforge.org` but that seems even more confusing in the long term.
I already requested the wmcloud.org domains to be pointed to designate [2].
Let me know!
[0] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Networking_and_in... [1] https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhancemen... [2] https://phabricator.wikimedia.org/T235630
On Wed, Oct 16, 2019 at 6:30 AM Arturo Borrero Gonzalez aborrero@wikimedia.org wrote:
Hi,
I have a proposal.
The new k8s haproxy is in front of the api-server and the ingress [0]. In toolsbeta we have been using the following:
toolsbeta-k8s-master.toolsbeta.wmflabs.org:6443 (api-server) toolsbeta-k8s-master.toolsbeta.wmflabs.org:30000 (ingress)
This haproxy knows which k8s nodes/controllers are UP and proxy the queries for them. Right now, this FQDN is not using a floating IP, is a simple A record pointing to the haproxy VM. This record is in the 'toolsbeta' CloudVPS project.
I've been wondering which FQDN would be nice to have in the final deployment. We have 'toolforge.org', but `whatever.toolforge.org` is intended to be a tool webservice, so I've been re-reading our DNS domains plans [1] and my proposal is to introduce a new FQDN like this:
k8s.toolforge.wmcloud.org
Then we can use it this way:
k8s.toolforge.wmcloud.org:6443 (api-server) k8s.toolforge.wmcloud.org:30000 (ingress)
This is because 'wmcloud.org' is set to become the replacement for 'wmflabs.org' which is what we are currently using for 'toolsbeta-k8s-master.toolsbeta.wmflabs.org'. We could also create k8s.toolsbeta.wmcloud.org (or whatever) in case we want to retain the toolsbeta setup online.
I hope this proposal is not increasing our naming confusion and complexity. Ideally we would use something like `k8s.toolforge.org` but that seems even more confusing in the long term.
I already requested the wmcloud.org domains to be pointed to designate [2].
Let me know!
[0] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Networking_and_in... [1] https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhancemen... [2] https://phabricator.wikimedia.org/T235630
Using *.tools.wmcloud.org (or *.tools.eqiad1.wmcloud.org?) for these names might be a better match for the future plans for wmcloud.org. 'tools' is the name of the Cloud VPS project for Toolforge. It is not likely that we will grant a new Cloud VPS project named 'toolforge' due to the very real confusion it would cause to us and others, but breaking from the project name == subdomain convention seems like it would also be confusing.
Another option would be *.tools.wikimedia.cloud (or *.tools.eqiad1.wikimedia.cloud?) which I believe is the naming convention we have agreed on for replacing *.wmflabs internal DNS.
Bryan
The k8s haproxy being the replacement for the current k8s control plane, I think it should follow the same pattern. The current one is k8s-master.tools.wmflabs.org http://k8s-master.tools.wmflabs.org/ in tools. That suggests the next one would be k8s-control.tools.wmflabs.org http://k8s-control.tools.wmflabs.org/ or wmcloud.org http://wmcloud.org/ In toolsbeta it would be the same pattern, which is what we were doing, but we haven’t changed master to control yet. That seems to follow the rest of our tooling fine?
Brooke Storm Senior SRE Wikimedia Cloud Services bstorm@wikimedia.org mailto:bstorm@wikimedia.org IRC: bstorm_
On Oct 16, 2019, at 7:56 AM, Bryan Davis bd808@wikimedia.org wrote:
On Wed, Oct 16, 2019 at 6:30 AM Arturo Borrero Gonzalez <aborrero@wikimedia.org mailto:aborrero@wikimedia.org> wrote:
Hi,
I have a proposal.
The new k8s haproxy is in front of the api-server and the ingress [0]. In toolsbeta we have been using the following:
toolsbeta-k8s-master.toolsbeta.wmflabs.org:6443 (api-server) toolsbeta-k8s-master.toolsbeta.wmflabs.org:30000 (ingress)
This haproxy knows which k8s nodes/controllers are UP and proxy the queries for them. Right now, this FQDN is not using a floating IP, is a simple A record pointing to the haproxy VM. This record is in the 'toolsbeta' CloudVPS project.
I've been wondering which FQDN would be nice to have in the final deployment. We have 'toolforge.org', but `whatever.toolforge.org` is intended to be a tool webservice, so I've been re-reading our DNS domains plans [1] and my proposal is to introduce a new FQDN like this:
k8s.toolforge.wmcloud.org
Then we can use it this way:
k8s.toolforge.wmcloud.org:6443 (api-server) k8s.toolforge.wmcloud.org:30000 (ingress)
This is because 'wmcloud.org' is set to become the replacement for 'wmflabs.org' which is what we are currently using for 'toolsbeta-k8s-master.toolsbeta.wmflabs.org'. We could also create k8s.toolsbeta.wmcloud.org (or whatever) in case we want to retain the toolsbeta setup online.
I hope this proposal is not increasing our naming confusion and complexity. Ideally we would use something like `k8s.toolforge.org` but that seems even more confusing in the long term.
I already requested the wmcloud.org domains to be pointed to designate [2].
Let me know!
[0] https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Networking_and_in... [1] https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/Enhancemen... [2] https://phabricator.wikimedia.org/T235630
Using *.tools.wmcloud.org http://tools.wmcloud.org/ (or *.tools.eqiad1.wmcloud.org http://tools.eqiad1.wmcloud.org/?) for these names might be a better match for the future plans for wmcloud.org http://wmcloud.org/. 'tools' is the name of the Cloud VPS project for Toolforge. It is not likely that we will grant a new Cloud VPS project named 'toolforge' due to the very real confusion it would cause to us and others, but breaking from the project name == subdomain convention seems like it would also be confusing.
Another option would be *.tools.wikimedia.cloud (or *.tools.eqiad1.wikimedia.cloud?) which I believe is the naming convention we have agreed on for replacing *.wmflabs internal DNS.
Bryan
Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808
On 10/16/19 7:01 PM, Brooke Storm wrote:
The k8s haproxy being the replacement for the current k8s control plane, I think it should follow the same pattern. The current one is k8s-master.tools.wmflabs.org http://k8s-master.tools.wmflabs.org in tools. That suggests the next one would be k8s-control.tools.wmflabs.org http://k8s-control.tools.wmflabs.org or wmcloud.org http://wmcloud.org In toolsbeta it would be the same pattern, which is what we were doing, but we haven’t changed master to control yet. That seems to follow the rest of our tooling fine?
So we have several options:
* decide the prefix: either "k8s", or "k8s-control". * decide subdomain: either ".tools.eqiad1." or ".tools." * decide on the domain: "wikimedia.cloud" or "wmcloud.org"
Complete FQDNs options would be:
* k8s.eqiad1.tools.wikimedia.cloud * k8s-control.tools.eqiad1.wikimedia.cloud * k8s.tools.wikimedia.cloud * k8s-control.tools.wikimedia.cloud
* k8s.eqiad1.tools.wmcloud.org * k8s-control.tools.eqiad1.wmcloud.org * k8s.tools.wmcloud.org * k8s-control.tools.wmcloud.org
Is not clear to me if this should be wmcloud.org or wikimedia.cloud. Since this was previously wmflabs.org I would bet for wmcloud.org. But on the other hand, this is not a public IP, and from that point of view "wikimedia.cloud" seems to be a better fit. Other thinking is that "wikimedia.cloud" is only for instances (like we do with eqiad.wmflabs) and <project>.wmcloud.org something we can create/delegate per-project.
At first, I don't see any special value in having "eqiad1" in the wmcloud.org domain. So I guess I'm voting for any of these options:
* k8s.tools.wmcloud.org * k8s-control.tools.wmcloud.org
@bstorm, the only reason I don't like about having "k8s-control" is that we serve other stuff, not just the API (ingress too). So my final +1 is for:
* k8s.tools.wmcloud.org
thoughts?
I think we should update our DNS document to mention whatever pattern we decide.
On 10/17/19 10:39 AM, Arturo Borrero Gonzalez wrote:
Complete FQDNs options would be:
k8s.eqiad1.tools.wikimedia.cloud
k8s-control.tools.eqiad1.wikimedia.cloud
k8s.tools.wikimedia.cloud
k8s-control.tools.wikimedia.cloud
k8s.eqiad1.tools.wmcloud.org
k8s-control.tools.eqiad1.wmcloud.org
k8s.tools.wmcloud.org
k8s-control.tools.wmcloud.org
It seems I'm trying to do this more confusing :-P I just realized many of the options are wrong.
Hope you get my point anyway.
@Bryan, we just had a conversation on this and I already forgot what were our conclusions for this particular case.
On Thu, Oct 17, 2019 at 9:33 AM Arturo Borrero Gonzalez aborrero@wikimedia.org wrote:
On 10/17/19 10:39 AM, Arturo Borrero Gonzalez wrote:
Complete FQDNs options would be:
k8s.eqiad1.tools.wikimedia.cloud
k8s-control.tools.eqiad1.wikimedia.cloud
k8s.tools.wikimedia.cloud
k8s-control.tools.wikimedia.cloud
k8s.eqiad1.tools.wmcloud.org
k8s-control.tools.eqiad1.wmcloud.org
k8s.tools.wmcloud.org
k8s-control.tools.wmcloud.org
It seems I'm trying to do this more confusing :-P I just realized many of the options are wrong.
Hope you get my point anyway.
@Bryan, we just had a conversation on this and I already forgot what were our conclusions for this particular case.
<hostname>.<project>.<deployment>.wikimedia.cloud is the FQDN scheme I would expect for instances in a project under the new naming system. The current equivalent is <hostname>.<project>.<datacenter>.wmflabs. In both cases these are Designate managed DNS entries, and service aliases can be managed in Horizon as either CNAME or A records in the project's zone.
I think this would mean that the "right" service name for a load balancer in front of the new k8s API would be one of: * <hostname>.tools.eqiad1.wikimedia.cloud * <hostname>.tools.eqiad.wmflabs
I have no strong opinion about the <hostname> to use here.
I'm all for starting to introduce the wikimedia.cloud domain, but really only if we have time and energy right now to get it set up. The whois for wikimedia.cloud shows the current top level NS being pointed to ns{1,2,3}.wikimedia.org, so getting the basics for it going in Designate should be something like:
* Create 'eqiad1.wikimedia.cloud.' zone in eqiad1's Designate * Create 'wikimedia.cloud.' zone in operations/dns.git * Delegate NS for 'eqiad1.wikimedia.cloud.' to cloud-ns{0,1}.wikimedia.org in operations/dns.git * Repeat for Designate and delegation steps for 'codfw1dev.wikimedia.cloud.'
What I am not sure about is how much confusion it will make for us and others to have mixed usage of *.<datacenter>.wmflabs and *.<deployment>.wikimedia.cloud without a bigger plan to completely remove (or at least deprecate) *.<datacenter>.wmflabs. Does anyone have strong feelings about that?
Bryan
The only thing I might caution is that using a floating IP might be ideal for being able to quickly fail over to a new load balancer, if needed (and a bit nicer than DNS in general). I believe that is the whole rationale behind the current name. It’s a standard name added in OpenStack with a floating IP, which makes it fairly easy to work with and reason about for any Toolforge admin (when it’s documented…obviously when we didn’t know about it, we had a lovely outage on trying to move to the new region).
I don’t care too much about the name per se. I do care about whether it is straightforward. That’s the only reason I was thinking of the wmcloud.org http://wmcloud.org/ domain. I don’t know if that’s doable with the other one or not.
Brooke Storm Senior SRE Wikimedia Cloud Services bstorm@wikimedia.org mailto:bstorm@wikimedia.org IRC: bstorm_
On Oct 17, 2019, at 1:21 PM, Bryan Davis bd808@wikimedia.org wrote:
On Thu, Oct 17, 2019 at 9:33 AM Arturo Borrero Gonzalez <aborrero@wikimedia.org mailto:aborrero@wikimedia.org> wrote:
On 10/17/19 10:39 AM, Arturo Borrero Gonzalez wrote:
Complete FQDNs options would be:
k8s.eqiad1.tools.wikimedia.cloud
k8s-control.tools.eqiad1.wikimedia.cloud
k8s.tools.wikimedia.cloud
k8s-control.tools.wikimedia.cloud
k8s.eqiad1.tools.wmcloud.org
k8s-control.tools.eqiad1.wmcloud.org
k8s.tools.wmcloud.org
k8s-control.tools.wmcloud.org
It seems I'm trying to do this more confusing :-P I just realized many of the options are wrong.
Hope you get my point anyway.
@Bryan, we just had a conversation on this and I already forgot what were our conclusions for this particular case.
<hostname>.<project>.<deployment>.wikimedia.cloud is the FQDN scheme I would expect for instances in a project under the new naming system. The current equivalent is <hostname>.<project>.<datacenter>.wmflabs. In both cases these are Designate managed DNS entries, and service aliases can be managed in Horizon as either CNAME or A records in the project's zone.
I think this would mean that the "right" service name for a load balancer in front of the new k8s API would be one of:
- <hostname>.tools.eqiad1.wikimedia.cloud
- <hostname>.tools.eqiad.wmflabs
I have no strong opinion about the <hostname> to use here.
I'm all for starting to introduce the wikimedia.cloud domain, but really only if we have time and energy right now to get it set up. The whois for wikimedia.cloud shows the current top level NS being pointed to ns{1,2,3}.wikimedia.org http://wikimedia.org/, so getting the basics for it going in Designate should be something like:
- Create 'eqiad1.wikimedia.cloud.' zone in eqiad1's Designate
- Create 'wikimedia.cloud.' zone in operations/dns.git
- Delegate NS for 'eqiad1.wikimedia.cloud.' to
cloud-ns{0,1}.wikimedia.org http://wikimedia.org/ in operations/dns.git
- Repeat for Designate and delegation steps for 'codfw1dev.wikimedia.cloud.'
What I am not sure about is how much confusion it will make for us and others to have mixed usage of *.<datacenter>.wmflabs and *.<deployment>.wikimedia.cloud without a bigger plan to completely remove (or at least deprecate) *.<datacenter>.wmflabs. Does anyone have strong feelings about that?
Bryan
Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808
On 10/18/19 12:53 AM, Brooke Storm wrote:
The only thing I might caution is that using a floating IP might be ideal for being able to quickly fail over to a new load balancer, if needed (and a bit nicer than DNS in general). I believe that is the whole rationale behind the current name. It’s a standard name added in OpenStack with a floating IP, which makes it fairly easy to work with and reason about for any Toolforge admin (when it’s documented…obviously when we didn’t know about it, we had a lovely outage on trying to move to the new region).
I don’t care too much about the name per se. I do care about whether it is straightforward. That’s the only reason I was thinking of the wmcloud.org http://wmcloud.org domain. I don’t know if that’s doable with the other one or not.
Ok for the floating IP!
The domain name, I would use what Bryan suggests:
* k8s.tools.eqiad1.wikimedia.cloud * k8s.tools.eqiad.wmflabs
And the counterparts:
* k8s.toolsbeta.eqiad1.wikimedia.cloud * k8s.toolsbeta.eqiad.wmflabs
regards.
On 10/18/19 2:19 PM, Arturo Borrero Gonzalez wrote:
On 10/18/19 12:53 AM, Brooke Storm wrote:
The only thing I might caution is that using a floating IP might be ideal for being able to quickly fail over to a new load balancer, if needed (and a bit nicer than DNS in general). I believe that is the whole rationale behind the current name. It’s a standard name added in OpenStack with a floating IP, which makes it fairly easy to work with and reason about for any Toolforge admin (when it’s documented…obviously when we didn’t know about it, we had a lovely outage on trying to move to the new region).
I don’t care too much about the name per se. I do care about whether it is straightforward. That’s the only reason I was thinking of the wmcloud.org http://wmcloud.org domain. I don’t know if that’s doable with the other one or not.
Ok for the floating IP!
The domain name, I would use what Bryan suggests:
- k8s.tools.eqiad1.wikimedia.cloud
- k8s.tools.eqiad.wmflabs
And the counterparts:
- k8s.toolsbeta.eqiad1.wikimedia.cloud
- k8s.toolsbeta.eqiad.wmflabs
regards.
The final question. I may have found a contradiction.
If we are going to use a floating IP for this (we just agreed on that), then using 'tools.eqiad1.wikimedia.cloud' may be wrong, since that's meant to hold private IP addresses. Shall we step back and consider the 'wmcloud.org' domain?
I miss information on how to handle floating IPs wrt. the new domains. Just asking for the additional clarification.
I think is important we are having this debate (and clarifying the examples in the wiki), this will make things easier in the future.
regards.
On Fri, Oct 18, 2019 at 10:49 AM Arturo Borrero Gonzalez aborrero@wikimedia.org wrote:
On 10/18/19 2:19 PM, Arturo Borrero Gonzalez wrote:
Ok for the floating IP!
The domain name, I would use what Bryan suggests:
- k8s.tools.eqiad1.wikimedia.cloud
- k8s.tools.eqiad.wmflabs
And the counterparts:
- k8s.toolsbeta.eqiad1.wikimedia.cloud
- k8s.toolsbeta.eqiad.wmflabs
regards.
The final question. I may have found a contradiction.
If we are going to use a floating IP for this (we just agreed on that), then using 'tools.eqiad1.wikimedia.cloud' may be wrong, since that's meant to hold private IP addresses. Shall we step back and consider the 'wmcloud.org' domain?
I miss information on how to handle floating IPs wrt. the new domains. Just asking for the additional clarification.
I think is important we are having this debate (and clarifying the examples in the wiki), this will make things easier in the future.
Today the only floating IP pool we have in eqiad1 is the IPv4 publically routable address space labeled "wan-transport-eqiad" (185.15.56.0/25)[0]. An address drawn from this pool today would typically be associated with a *.wmflabs.org FQDN. One example: 185.15.56.18 is a floating IP associated with the mx-out01 instance in the cloudinfra project with an A record of mx-out01.wmflabs.org and PTR records of mx-out01.wmflabs.org and instance-mx-out01.cloudinfra.wmflabs.org.
The agreed replacement for *.wmflabs.org is *.wmcloud.org.
I believe that these are statements of fact. These statements of fact lead me to ask two new questions as we look to a future with more usage of HAProxy or some other LBaaS solution being applied more commonly inside Cloud VPS projects:
* It seems reasonable that DNS records for *.wmcloud.org should all relate to publicly routable addresses. Does it also seem reasonable that *.wikimedia.cloud DNS records should all relate to non-publicly routable addresses? If an address from a publically routable IPv4 space is being used only to provide a service IP for a service that is 100% internal to Cloud VPS (like the example of the HAProxy cluster fronting a Kubernetes cluster inside the tools project) is it acceptable to create an A record in the *.wikimedia.cloud zone for that address?
* Should we have a floating IP pool of non-publicly routable IPv4 addresses for use cases where the service that is being provided is only intended to be internal to a single project or the Cloud VPS tenant network? Routable IPv4 addresses are a limited commodity, and currently Cloud VPS has a very small number of them available.
One of the reasons that we are spending time discussing these things is that we hope to decide on a set of standards and practices which will make it easier to reason about use and maintenance of Cloud VPS. Towards that end, I think I would propose these answers to the new questions:
* The *.wikimedia.cloud zone should only contain A records pointing to non-publicly routable IPv4 records. My reasoning for this is that it makes it easier to quickly think about the general threat model for a FQDN in this zone. If the FQDN ends in wikimedia.cloud, then the IP associated with that FQDN is not publically routable.
* We should create a pool of floating ips using non-publicly routable IPv4 addresses for the explicit purpose of being used as service IPs for HA/LB systems within the Cloud VPS internal network. These service IPs would then be given FQDNs in the *.wikimedia.cloud zone without breaking the rule that all FQDNs ending in wikimedia.cloud are not publically routable.
We may have to adjust a lot of security groups if the new floating pool of private IPS is outside of the 172.16.0.0/21 subnet. If we carve it out as a subset of that CIDR then I *think* it requires careful changes to the existing "cloud-instances2-b-eqiad" subnet to cordon off a /24 or /25 block into a new neutron subnet that would be the source of the new pool and not part of the subnet that nova/neutron use to give out fixed IPs to instances. Hopefully Arturo or Jason can reason about the work and impact of this better than I can.
Back to Arturo's question, I think I agree that if the IP in use is from the "wan-transport-eqiad" pool (which is a great name for a network and a horrible name for a pool), then the FQDN used for that IP should be in the wmcloud.org zone (or another zone dedicated to public IPs) and not the wikimedia.cloud zone.
I am also starting to feel like this is all too complicated somehow. Maybe it is just that using valid, registered TLDs is somehow adding cognitive burden for me that the legacy 'fake' TLDs did not?
[0]: It seems we have a buggy version of python-openstackclient that makes digging this information out of the system more difficult than it should be. https://bugs.launchpad.net/python-openstackclient/+bug/1616129
Bryan
On Fri, Oct 18, 2019 at 2:25 PM Bryan Davis bd808@wikimedia.org wrote:
- It seems reasonable that DNS records for *.wmcloud.org should all
relate to publicly routable addresses. Does it also seem reasonable that *.wikimedia.cloud DNS records should all relate to non-publicly routable addresses? If an address from a publically routable IPv4 space is being used only to provide a service IP for a service that is 100% internal to Cloud VPS (like the example of the HAProxy cluster fronting a Kubernetes cluster inside the tools project) is it acceptable to create an A record in the *.wikimedia.cloud zone for that address?
yeah, that all sounds good to me.
- Should we have a floating IP pool of non-publicly routable IPv4
addresses for use cases where the service that is being provided is only intended to be internal to a single project or the Cloud VPS tenant network? Routable IPv4 addresses are a limited commodity, and currently Cloud VPS has a very small number of them available.
One of the reasons that we are spending time discussing these things is that we hope to decide on a set of standards and practices which will make it easier to reason about use and maintenance of Cloud VPS. Towards that end, I think I would propose these answers to the new questions:
- The *.wikimedia.cloud zone should only contain A records pointing to
non-publicly routable IPv4 records. My reasoning for this is that it makes it easier to quickly think about the general threat model for a FQDN in this zone. If the FQDN ends in wikimedia.cloud, then the IP associated with that FQDN is not publically routable.
- We should create a pool of floating ips using non-publicly routable
IPv4 addresses for the explicit purpose of being used as service IPs for HA/LB systems within the Cloud VPS internal network. These service IPs would then be given FQDNs in the *.wikimedia.cloud zone without breaking the rule that all FQDNs ending in wikimedia.cloud are not publically routable.
· We may have to adjust a lot of security groups if the new floating pool of private IPS is outside of the 172.16.0.0/21 subnet. If we carve it out as a subset of that CIDR then I *think* it requires careful changes to the existing "cloud-instances2-b-eqiad" subnet to cordon off a /24 or /25 block into a new neutron subnet that would be the source of the new pool and not part of the subnet that nova/neutron use to give out fixed IPs to instances. Hopefully Arturo or Jason can reason about the work and impact of this better than I can.
I wouldn't create another floating IP pool or make any changes to `cloud-instances2-b-eqiad` `172.16.0.0/21`. Reconfiguring this as an external network (floating IP requirement) would bring a lot of complexity with host routing, NAT rules, subnets and security groups.
Instead of using floating IPs for non-publicly routed subnets, I'd pre-allocate an IP address from that subnet, configure the front-end load balancer host's neutron ports with allowed address pairs, and use VRRP + keepalived to manage which host has the active virtual IP (VIP).
I can put together a simple prototype of this if there's any interest going down that path.
Regards, Jason
I put together an example and some notes at https://wikitech.wikimedia.org/wiki/User:Jhedden/notes/keepalived
Feel free to login to these instances and try things out.
Regards, Jason
On Fri, Oct 18, 2019 at 4:12 PM Jason Hedden jhedden@wikimedia.org wrote:
On Fri, Oct 18, 2019 at 2:25 PM Bryan Davis bd808@wikimedia.org wrote:
- It seems reasonable that DNS records for *.wmcloud.org should all
relate to publicly routable addresses. Does it also seem reasonable that *.wikimedia.cloud DNS records should all relate to non-publicly routable addresses? If an address from a publically routable IPv4 space is being used only to provide a service IP for a service that is 100% internal to Cloud VPS (like the example of the HAProxy cluster fronting a Kubernetes cluster inside the tools project) is it acceptable to create an A record in the *.wikimedia.cloud zone for that address?
yeah, that all sounds good to me.
- Should we have a floating IP pool of non-publicly routable IPv4
addresses for use cases where the service that is being provided is only intended to be internal to a single project or the Cloud VPS tenant network? Routable IPv4 addresses are a limited commodity, and currently Cloud VPS has a very small number of them available.
One of the reasons that we are spending time discussing these things is that we hope to decide on a set of standards and practices which will make it easier to reason about use and maintenance of Cloud VPS. Towards that end, I think I would propose these answers to the new questions:
- The *.wikimedia.cloud zone should only contain A records pointing to
non-publicly routable IPv4 records. My reasoning for this is that it makes it easier to quickly think about the general threat model for a FQDN in this zone. If the FQDN ends in wikimedia.cloud, then the IP associated with that FQDN is not publically routable.
- We should create a pool of floating ips using non-publicly routable
IPv4 addresses for the explicit purpose of being used as service IPs for HA/LB systems within the Cloud VPS internal network. These service IPs would then be given FQDNs in the *.wikimedia.cloud zone without breaking the rule that all FQDNs ending in wikimedia.cloud are not publically routable.
· We may have to adjust a lot of security groups if the new floating pool of private IPS is outside of the 172.16.0.0/21 subnet. If we carve it out as a subset of that CIDR then I *think* it requires careful changes to the existing "cloud-instances2-b-eqiad" subnet to cordon off a /24 or /25 block into a new neutron subnet that would be the source of the new pool and not part of the subnet that nova/neutron use to give out fixed IPs to instances. Hopefully Arturo or Jason can reason about the work and impact of this better than I can.
I wouldn't create another floating IP pool or make any changes to `cloud-instances2-b-eqiad` `172.16.0.0/21` http://172.16.0.0/21. Reconfiguring this as an external network (floating IP requirement) would bring a lot of complexity with host routing, NAT rules, subnets and security groups.
Instead of using floating IPs for non-publicly routed subnets, I'd pre-allocate an IP address from that subnet, configure the front-end load balancer host's neutron ports with allowed address pairs, and use VRRP + keepalived to manage which host has the active virtual IP (VIP).
I can put together a simple prototype of this if there's any interest going down that path.
Regards, Jason
On Fri, Oct 18, 2019 at 4:15 PM Jason Hedden jhedden@wikimedia.org wrote:
I put together an example and some notes at https://wikitech.wikimedia.org/wiki/User:Jhedden/notes/keepalived
Feel free to login to these instances and try things out.
Amazing OpenStack ninja wizardry! The process of getting neutron setup correctly looks like something we could make some script to help with if it ends up being a common thing.
Bryan
On 10/18/19 9:25 PM, Bryan Davis wrote:
Back to Arturo's question, I think I agree that if the IP in use is from the "wan-transport-eqiad" pool (which is a great name for a network and a horrible name for a pool), then the FQDN used for that IP should be in the wmcloud.org zone (or another zone dedicated to public IPs) and not the wikimedia.cloud zone.
On a second thought, Brooke suggested we use a floating IP for haproxy in fron of the API server + ingress. But the floating IP itself doesn't eliminate the single point of failure. We would need to implement what Jason suggested.
Moreover, I wonder if we care about this SPOF at all. We could use a cold-standby approach and create another VM with the same setup and only change them by means of DNS. This should be enough.
We have 3 options: * DNS failover * Floating IP failover * VRRP or other HA mechanisms
None of these mechanisms prevent clients from having to re-establish TCP connections in case of failover (because the TCP session information is not in the now-active node). The most simple option is DNS failover, so I would stick to that.
k8s.toolsbeta.eqiad1.wikimedia.cloud --> 176.16.x.10 (active VM) --> 172.16.x.20 (cold-standby VM)
In case of manual failover:
--> 176.16.x.10 (cold-standby VM) k8s.toolsbeta.eqiad1.wikimedia.cloud --> 172.16.x.20 (active VM)
Honestly I think should be enough for this setup.
The topic would be different if we wanted to allow connecting to the k8s API directly from the internet, but I don't think that's the case. We should only allow connecting to that FQDN from dynamicproxy, because SSL termination is there.
cloud-admin@lists.wikimedia.org