Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

19 Nov 2020

Hey MA,

I personally think given your knowledge and experience and what GUC and
XTools Global Contribs do, your approach of using those implementation
details to get better performance makes sense. The outline you present is
very clear and seems reasonable to me. You also mention
programmatically reading the sections and db-lists which will make the
implementation more resilient to changes.

Still though, most tools shouldn't care about these and it is better if
they do not rely on them to avoid future headaches. I think as a rule of
thumb, relying on implementation details should be avoided by most
developers.

Does that make sense?

On Thu, Nov 19, 2020 at 6:44 AM MusikAnimal &lt;musikanimal(a)gmail.com&gt; wrote:

...
  Hello Joaquin!

 Hey MA, I've checked, and while not explicitly disallowed, the fact that
  this could work is more of an implementation
detail that shouldn't really
 be relied on.

 The sections and where the instances are on them are organized to
 maintain the service, and are not supposed to be depended on since they
 could change.

 Even if the mappings are public and fairly stable, there could be a point
 where a change in the implementation/organization is made -like with this
 new architecture- and those in-section cross-db joins would stop working.

 I'm not saying I will blindly construct cross-wiki queries. Rather, I will
 only do it after fetching from the db-lists to confirm which ones can be
 queried cross-wiki. In the case of GUC and XTools Global Contribs, this
 could mean massive performance improvements. Allow me to paint a picture;
 we have an account attached to 100 wikis, I want a list of all global edits
 ordered chronologically. For day-to-day steward life, this is essential so
 I'd like to find the most efficient route possible, even if it's a little
 hacky :) So going off of what we're doing now, my high-level vision would
 be:

 1) check db-lists (or from cached result)
 2) Check CentralAuth to see which wikis the user has edits on. Here we
 find there are 100 wikis.
 2) Cross-referencing the db-lists, I now know that 75 of the wikis I want
 to query are on s1, and 25 on s2.
 3) For each wiki, I have a subquery to grab *all* edits by that user on
 that specific wiki within that slice (may along add WHERE clauses for
 rev_timestmap, etc.)
 4) Take each of those subqueries and wrap it like: (SELECT * FROM (
 [subquery1] ) UNION ( [subquery2] ) … ) a ORDER BY rev_timestamp DESC LIMIT
 50
 5) Do the same for each of the other slices
 6) Combine the results from each slice and resequence the edits
 chronologically, stopping at 50 (the first page of edits to show to the
 user).

 That sounds not like the most fun, but I think it would work. With the
 current 8 slices, it shouldn't slow it down too terribly (some slices will
 be faster than others).

 Are you discouraging this approach? If I *have* to open and use a separate
 connection to each of those 100 databases, regardless of the slice, the
 processing may become much slower. Let's move on to IPs, where we have to
 check *every* wiki. 900+ separation connections. Again, I'm not sure how
 I'd get this even set up on my local, as presumably I'd need 900+ open SSH
 tunnels. Maybe a bash script?

 I just want to make sure I've got this right before I start cording. In
 the end hopefully I'll have a working strategy that I can share with others.

 Thanks,

 ~ MA
 _______________________________________________
 Wikimedia Cloud Services mailing list
 Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
 https://lists.wikimedia.org/mailman/listinfo/cloud

-- 
Joaquin Oltra Hernandez
Developer Advocate - Wikimedia Foundation

2024

2023

2022

2021

2020

2019

2018

2017

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign