Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

15 Nov 2020

...
   The Innodb buffer pool efficiency for labs dbs is
around 99% (two nines), while production databases (similar hardware but split into
eight
different sections) is 99.99% (four nines), these two orders of magnitude
difference is mostly because of cache locality which I hope we would
achieve if these changes get done

It of course helps if there is more hardware and memory per database, but
the same level of speed than in production is not really realistic if the
major part of the query complexity comes from data sanitization views which
will be still there.

...
    It's not just speed though, the updates coming
in to replicas would be split too so it wouldn't saturate the network and less
heavy I/O in memory
and disk meaning better scalability (adding commons/wikidata on each
section would be the exact opposite of that and even if we do it now, we
eventually have to pull the plug as wikis are growing and we are not the
same size or growth speed we used to be years ago).

Yes, replicating 1:1 clone of commons/wikidata on each section is not
useful. I still would like to see the ability of doing queries over
multiple databases as an important feature.

Br,
-- Kimmo Virtanen, Zache

On Sat, Nov 14, 2020 at 9:52 PM Amir Sarabadani &lt;ladsgroup(a)gmail.com&gt; wrote:

...
  Hello,
 I actually welcome the change and am quite happy about it. It might break
 several tools (including some of mine) but as a database nerd, I can see
 the benefits outweighing the problems (and I wish benefits would have been
 communicated in the announcement).

 The short version is that this change would make labs replicas blazing
 fast.

 The long version: Database of all of wikis is currently being replicated
 to a set of giant "cloud" or "labs" replicas. IIRC correctly, these
dbs
 have 512GB memory (while being massive is not big enough to hold
 everything), the space left for InnoDB Buffer pool should be around 350GB
 and storing everything in there is impossible (the rest would be for
 temporary tables and other critical functions), so I assume when you query
 quarry (sorry, I had to make the pun), most it is actually coming from
 reading disk which is ten times slower. Looking at graphs

<https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=13&orgId=1&var-server=labsdb1011&var-port=9104&from=now-7d&to=now>,
 The Innodb buffer pool efficiency for labs dbs is around 99% (two nines),
 while production databases (similar hardware but split into eight different
 sections) is 99.99% (four nines), these two orders of magnitude difference
 is mostly because of cache locality which I hope we would achieve if these
 changes get done (unless the new hardware will be commodity hardware
 instead of beefy servers but I doubt that, correct me if I'm wrong).
 Meaning less timeouts, less slow apps and tools, etc. It's not just speed
 though, the updates coming in to replicas would be split too so it wouldn't
 saturate the network and less heavy I/O in memory and disk meaning better
 scalability (adding commons/wikidata on each section would be the exact
 opposite of that and even if we do it now, we eventually have to pull the
 plug as wikis are growing and we are not the same size or growth speed we
 used to be years ago).

 I understand it would break tools and queries but I have a feeling that
 lots of them should be already split into multiple queries, or should read
 dumps instead or sometimes it's more of an x/y problem
 <https://en.wikipedia.org/wiki/XY_problem>

 I think this is great and a big thank you for doing it.

 On Fri, Nov 13, 2020 at 11:39 AM Kimmo Virtanen &lt;kimmo.virtanen(a)gmail.com&gt;
 wrote:

  As a follow up comment.

 If I understand correctly the main problems are a) databases are growing
 too big to be stored in single instances and b) query complexity is
 growing.

 a) the growth of the data is not going away as the major drivers for the
 growth are automated edits from Wikidata and Structured data on Commons.
 They are generating new data with increasing speed faster than humans ever
 could. So the longer term answer is to store the data to separate instances
 and use something like federated queries. This is how the access to the
 commonwiki replica was originally done when toolserver moved to toollabs in
 2014.[1] Another long term solution to make databases smaller is to
 replicate only the current state of the wikidata/commonswiki and leave for
 example the revision history out.

 b) a major factor for query complexity which affects the query execution
 times is afaik the actor migration and the data sanitization which executes
 the queries through the multiple views.[2,3]  I have no idea how bad the
 problem currently is, but one could think that replication could be
 implemented with lighter sanitation by leaving some of the problematic data
 out altogether from replication.

 Anyway, my question is, are there more detailed plans for the *Wiki
 Replicas 2020 Redesign *than what is on the wikipage[4] or tickets
 linked from it? I guess there is if the plan is to buy new hardware in
 October and now we are in the implementation phase? Also is there
 information on the actual bottlenecks at table level? I.e., which tables
 (in which databases) are the too big ones, hard to keep up in replication
 and slow in terms of query time?

 [1]

https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Migration_of_Toolse…
 ?
 [2]
 https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_…
 [3] https://phabricator.wikimedia.org/T215445
 [4] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign

 Br,
 -- Kimmo Virtanen, Zache

 On Fri, Nov 13, 2020 at 8:51 AM Kimmo Virtanen &lt;kimmo.virtanen(a)gmail.com&gt;
 wrote:

    Maarten:
Having 6 servers with each one having a slice + s4 (Commons)  + s8 (Wikidata) might
be a good compromise.
  Martin: Another idea is to have the database
structured as-planned,  but add a server with *all* databases that would be
slower/less stable,
 but will provide a solution for those who really need cross database joins

 From the point of view of a person who is using cross database joins on
 both tools and analysis queries I would say that both ideas would be
 suitable. I think that 90%  of my crosswiki queries are written against
 *wiki + wikidata/commons. However, I would not say that it is only for
 those who really need it but more like that cross database joins are an
 awesome feature for everybody and it is a loss if it will be gone.

 In older times we had also ability to do joins between user databases
 and replica databases, which was removed in 2017 if I googled correctly.[1]
 My guess is that one reason for the increasing query complexity is that
 there is no possibility for creating tmp tables or joining to preselected
 data so everything is done in single queries.  In any case, if the solution
 is what Martin suggests to move cross joinable databases to a single server
 and the original problem was that it was hard to keep in sync multiple
 servers then we could reintroduce the user database joins as well.

 [1]

https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_serve…

 Br,
 -- Kimmo Virtanen, Zache

 On Fri, Nov 13, 2020 at 2:17 AM Martin Urbanec <
 martin.urbanec(a)wikimedia.cz&gt; wrote:

  +1 to Marteen

 Another idea is to have the database structured as-planned, but add a
 server with *all* databases that would be slower/less stable, but will
 provide a solution for those who really need cross database joins

 Martin

 pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers &lt;maarten(a)mdammers.nl&gt;
 napsal:

> I recall some point in time (Toolserver maybe?) when all the slices
> (overview at https://tools-info.toolforge.org/?listmetap ) were at
> different servers, but the Commons slice (s4) was on every server.
> At some point new fancy database servers were introduced with all the
> slices on all servers. Having 6 servers with each one having a slice + s4
> (Commons) + s8 (Wikidata) might be a good compromise.
> On 12-11-2020 00:58, John wrote:
>
> I’ll throw my hat in this too. Moving it to the application layer will
> make a number of queries just not feasible any longer. It might make sense
> from the administration side, but from the user perspective it beaks one of
> the biggest features that toolforge has.
>
> On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec <
> martin.urbanec(a)wikimedia.cz&gt; wrote:
>
>> MusikAnimal is right, however, Wikidata and Commons either have a sui
>> generis slice, or they share it with a few very large wikis. Tools that do
>> any kind of crosswiki analysis would instantly break, as most of them
>> utilise joining by Wikidata items at the very least.
>>
>> I second Maarten here. This would mean a lot of things that currently
>> require a (relatively simple) SQL query would need a full script, which
>> would do the join at the application level.
>>
>> I fully understand the reasoning, but there needs to be some
>> replacement. Intentionally introduce breaking changes while providing no
>> "new standard" is a bad pattern in a community environment.
>>
>> Martin
>>
>> On Wed, Nov 11, 2020, 10:31 PM MusikAnimal &lt;musikanimal(a)gmail.com&gt;
>> wrote:
>>
>>> Technically, cross-wiki joins aren't completely disallowed, you just
>>> have to make sure each of the db names are on the same slice/section,
>>> right?
>>>
>>> ~ MA
>>>
>>> On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers &lt;maarten(a)mdammers.nl&gt;
>>> wrote:
>>>
>>>> Hi Joaquin,
>>>> On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:
>>>>
>>>> TLDR: Wiki Replicas' architecture is being redesigned for stability
>>>> and performance. Cross database JOINs will not be available and a host
>>>> connection will only allow querying its associated DB. See [1]
>>>>
<https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign>
>>>> for more details.
>>>>
>>>> If you only think of Wikipedia, not a lot will break probably, but
>>>> if you take into account Commons and Wikidata a lot will break. A quick
>>>> grep in my folder with Commons queries returns 123 lines with cross
>>>> database joins. So yes, stuff will break and tools will be abandoned.
This
>>>> follows the practice that seems to have become standard for the WMF
these
>>>> days: Decisions are made with a small group within the WMF without any
>>>> community involved. Only after the decision has been made, it's
announced.
>>>>
>>>> Unhappy and disappointed,
>>>>
>>>> Maarten
>>>> _______________________________________________
>>>> Wikimedia Cloud Services mailing list
>>>> Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>
>>> _______________________________________________
>>> Wikimedia Cloud Services mailing list
>>> Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>
>> _______________________________________________
>> Wikimedia Cloud Services mailing list
>> Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>
>
> _______________________________________________
> Wikimedia Cloud Services mailing listCloud(a)lists.wikimedia.org (formerly
labs-l@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud
>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/cloud
>
 _______________________________________________
 Wikimedia Cloud Services mailing list
 Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
 https://lists.wikimedia.org/mailman/listinfo/cloud

 _______________________________________________  Wikimedia Cloud Services
mailing list
 Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
 https://lists.wikimedia.org/mailman/listinfo/cloud

 --
 Amir (he/him)

 _______________________________________________
 Wikimedia Cloud Services mailing list
 Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
 https://lists.wikimedia.org/mailman/listinfo/cloud

2024

2023

2022

2021

2020

2019

2018

2017

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign