Hello,

I actually welcome the change and am quite happy about it. It might break several tools (including some of mine) but as a database nerd, I can see the benefits outweighing the problems (and I wish benefits would have been communicated in the announcement).

The short version is that this change would make labs replicas blazing fast.

The long version: Database of all of wikis is currently being replicated to a set of giant "cloud" or "labs" replicas. IIRC correctly, these dbs have 512GB memory (while being massive is not big enough to hold everything), the space left for InnoDB Buffer pool should be around 350GB and storing everything in there is impossible (the rest would be for temporary tables and other critical functions), so I assume when you query quarry (sorry, I had to make the pun), most it is actually coming from reading disk which is ten times slower. Looking at graphs, The Innodb buffer pool efficiency for labs dbs is around 99% (two nines), while production databases (similar hardware but split into eight different sections) is 99.99% (four nines), these two orders of magnitude difference is mostly because of cache locality which I hope we would achieve if these changes get done (unless the new hardware will be commodity hardware instead of beefy servers but I doubt that, correct me if I'm wrong). Meaning less timeouts, less slow apps and tools, etc. It's not just speed though, the updates coming in to replicas would be split too so it wouldn't saturate the network and less heavy I/O in memory and disk meaning better scalability (adding commons/wikidata on each section would be the exact opposite of that and even if we do it now, we eventually have to pull the plug as wikis are growing and we are not the same size or growth speed we used to be years ago).

I understand it would break tools and queries but I have a feeling that lots of them should be already split into multiple queries, or should read dumps instead or sometimes it's more of an x/y problem

I think this is great and a big thank you for doing it.

On Fri, Nov 13, 2020 at 11:39 AM Kimmo Virtanen <kimmo.virtanen@gmail.com> wrote:

As a follow up comment.

If I understand correctly the main problems are a) databases are growing too big to be stored in single instances and b) query complexity is growing.

a) the growth of the data is not going away as the major drivers for the growth are automated edits from Wikidata and Structured data on Commons. They are generating new data with increasing speed faster than humans ever could. So the longer term answer is to store the data to separate instances and use something like federated queries. This is how the access to the commonwiki replica was originally done when toolserver moved to toollabs in 2014.[1] Another long term solution to make databases smaller is to replicate only the current state of the wikidata/commonswiki and leave for example the revision history out.

b) a major factor for query complexity which affects the query execution times is afaik the actor migration and the data sanitization which executes the queries through the multiple views.[2,3] I have no idea how bad the problem currently is, but one could think that replication could be implemented with lighter sanitation by leaving some of the problematic data out altogether from replication.

Anyway, my question is, are there more detailed plans for the Wiki Replicas 2020 Redesign than what is on the wikipage[4] or tickets linked from it? I guess there is if the plan is to buy new hardware in October and now we are in the implementation phase? Also is there information on the actual bottlenecks at table level? I.e., which tables (in which databases) are the too big ones, hard to keep up in replication and slow in terms of query time?

[1] https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Migration_of_Toolserver_tools#Will_the_commons_database_be_replicated_to_all_clusters,_like_it_is_on_the_Toolserver?
[2] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas
[3] https://phabricator.wikimedia.org/T215445
[4] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign

Br,
-- Kimmo Virtanen, Zache
On Fri, Nov 13, 2020 at 8:51 AM Kimmo Virtanen <kimmo.virtanen@gmail.com> wrote:
> Maarten: Having 6 servers with each one having a slice + s4 (Commons) + s8 (Wikidata) might be a good compromise.
> Martin: Another idea is to have the database structured as-planned, but add a server with all databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

From the point of view of a person who is using cross database joins on both tools and analysis queries I would say that both ideas would be suitable. I think that 90% of my crosswiki queries are written against *wiki + wikidata/commons. However, I would not say that it is only for those who really need it but more like that cross database joins are an awesome feature for everybody and it is a loss if it will be gone.

In older times we had also ability to do joins between user databases and replica databases, which was removed in 2017 if I googled correctly.[1] My guess is that one reason for the increasing query complexity is that there is no possibility for creating tmp tables or joining to preselected data so everything is done in single queries. In any case, if the solution is what Martin suggests to move cross joinable databases to a single server and the original problem was that it was hard to keep in sync multiple servers then we could reintroduce the user database joins as well.

[1] https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/

Br,
-- Kimmo Virtanen, Zache
On Fri, Nov 13, 2020 at 2:17 AM Martin Urbanec <martin.urbanec@wikimedia.cz> wrote:
+1 to Marteen

Another idea is to have the database structured as-planned, but add a server with all databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

Martin
pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers <maarten@mdammers.nl> napsal:
I recall some point in time (Toolserver maybe?) when all the slices (overview at https://tools-info.toolforge.org/?listmetap ) were at different servers, but the Commons slice (s4) was on every server.
At some point new fancy database servers were introduced with all the slices on all servers. Having 6 servers with each one having a slice + s4 (Commons) + s8 (Wikidata) might be a good compromise.

On 12-11-2020 00:58, John wrote:
I’ll throw my hat in this too. Moving it to the application layer will make a number of queries just not feasible any longer. It might make sense from the administration side, but from the user perspective it beaks one of the biggest features that toolforge has.

On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec <martin.urbanec@wikimedia.cz> wrote:

MusikAnimal is right, however, Wikidata and Commons either have a sui generis slice, or they share it with a few very large wikis. Tools that do any kind of crosswiki analysis would instantly break, as most of them utilise joining by Wikidata items at the very least.

I second Maarten here. This would mean a lot of things that currently require a (relatively simple) SQL query would need a full script, which would do the join at the application level.

I fully understand the reasoning, but there needs to be some replacement. Intentionally introduce breaking changes while providing no "new standard" is a bad pattern in a community environment.

Martin

On Wed, Nov 11, 2020, 10:31 PM MusikAnimal <musikanimal@gmail.com> wrote:

Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right?

~ MA

On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers <maarten@mdammers.nl> wrote:

Hi Joaquin,

On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud