[Cloud-announce] New Wikireplicas available, timeline update, and Quarry migration

List overview All Threads
Download

newer

older

[Cloud-announce] cloud-vps...

Wikimedia remote hackathon on May...

Joaquin Oltra Hernandez

12 Mar 2021 12 Mar '21

1:16 p.m.

TLDR: - Instead of `*.db.svc.eqiad.wmflabs` use `*.db.svc.wikimedia.cloud` to use the new replicas - Quarry will migrate March 23 to use the new cluster - In a ~month (April 15) the old cluster will start retiring. See https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timelin... for more details

Hi everyone,

I'm happy to announce that the new replicas cluster is available for use after a few weeks open for testing.

To use the new cluster, you will only need to change the hostname when connecting to the databases. Instead of `*.db.svc.eqiad.wmflabs` you can use `*.db.svc.wikimedia.cloud`. See:

- https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#New_hos... - https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#How_can... - https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database

This brings us to the timeline updates: https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timelin...

- March 2021 - Announce new cluster domains for all users - March 12 - Migrate Quarry to use the new cluster - March 23 - April 2021 - PAWS migration - Estimated first week of April - Migrate the old cluster to utilize new replication hosts. Replication may stop. - April 15 - Redirect old hostnames to the new cluster - April 28

Quarry is ready so it will transition first to use the new replicas. This will happen on March 23, at which point we will deploy the changes and update the docs. We are publishing a note on Tech news but if there are other venues where this information would be useful please help us spread the word.

PAWS is in the process of being migrated. We estimate it could be ready to use the new cluster at the beginning of April, and will publish more details as soon as we know.

On April 15 the old cluster will be moved under the new replication hosts, at which point there may be side effects and replication may stop. We recommend you update your code earlier to use the new hostnames to verify things are working normally.

Not long after, the old hostnames (`*.db.svc.eqiad.wmflabs`) will be redirected to the new ones (`*.db.svc.wikimedia.cloud`), at which point the old cluster will effectively be inaccessible.

If you have questions or need help please reply to cloud@lists.wikimedia.org, join the #wikimedia-cloud IRC channel, or open a ticket in Phabricator with the Data-Services tag.

Thanks,

_______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

Attachments:

attachment.htm (text/html — 3.2 KB)

Show replies by date

Maarten Dammers

13 Mar 13 Mar

12:17 p.m.

Hi Joaquin,

Completely nothing was done with the community input about not being able to do cross wiki joins anymore [1]. In the past the WMF would do something with community input. I guess the new strategy is to just give the appearance of community input to legitimize a decision. Nice way to alienate the volunteers.

Maarten

[1] https://lists.wikimedia.org/pipermail/cloud/2020-November/thread.html#1309

On 12-03-2021 19:16, Joaquin Oltra Hernandez wrote:

...

TLDR:

Instead of `*.db.svc.eqiad.wmflabs` use `*.db.svc.wikimedia.cloud`

to use the new replicas

Quarry will migrate March 23 to use the new cluster

In a ~month (April 15) the old cluster will start retiring. See

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timelin... for more details

Hi everyone,

I'm happy to announce that the new replicas cluster is available for use after a few weeks open for testing.

To use the new cluster, you will only need to change the hostname when connecting to the databases. Instead of `*.db.svc.eqiad.wmflabs` you can use `*.db.svc.wikimedia.cloud`. See:

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#New_hos...

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#How_can...

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database

This brings us to the timeline updates: https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timelin...

March 2021

- Announce new cluster domains for all users - March 12 - Migrate Quarry to use the new cluster - March 23

April 2021

- PAWS migration - Estimated first week of April - Migrate the old cluster to utilize new replication hosts. Replication may stop. - April 15 - Redirect old hostnames to the new cluster - April 28

Quarry is ready so it will transition first to use the new replicas. This will happen on March 23, at which point we will deploy the changes and update the docs. We are publishing a note on Tech news but if there are other venues where this information would be useful please help us spread the word.

PAWS is in the process of being migrated. We estimate it could be ready to use the new cluster at the beginning of April, and will publish more details as soon as we know.

On April 15 the old cluster will be moved under the new replication hosts, at which point there may be side effects and replication may stop. We recommend you update your code earlier to use the new hostnames to verify things are working normally.

Not long after, the old hostnames (`*.db.svc.eqiad.wmflabs`) will be redirected to the new ones (`*.db.svc.wikimedia.cloud`), at which point the old cluster will effectively be inaccessible.

If you have questions or need help please reply to cloud@lists.wikimedia.org mailto:cloud@lists.wikimedia.org, join the #wikimedia-cloud IRC channel, or open a ticket in Phabricator with the Data-Services tag.

Thanks,

Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Yetkin Sakal

5:46 p.m.

I completely agree with Maarten. It would be a step backward to stop supporting cross-database joins on wiki replicas. This is a breaking change and should not be applied unless a feasible solution to the problem is found. On Saturday, March 13, 2021, 8:17:39 PM GMT+3, Maarten Dammers maarten@mdammers.nl wrote:

Hi Joaquin,

Maarten

[1]https://lists.wikimedia.org/pipermail/cloud/2020-November/thread.html#1309

On 12-03-2021 19:16, Joaquin Oltra Hernandez wrote:

Hi everyone,

I'm happy to announce that the new replicas cluster is available for use after a few weeks open for testing.

To use the new cluster, you will only need to change the hostname when connecting to the databases. Instead of `*.db.svc.eqiad.wmflabs` you can use `*.db.svc.wikimedia.cloud`. See:

This brings us to the timeline updates: https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timelin...

PAWS is in the process of being migrated. We estimate it could be ready to use the new cluster at the beginning of April, and will publish more details as soon as we know.

Not long after, the old hostnames (`*.db.svc.eqiad.wmflabs`) will be redirected to the new ones (`*.db.svc.wikimedia.cloud`), at which point the old cluster will effectively be inaccessible.

If you have questions or need help please reply to cloud@lists.wikimedia.org, join the #wikimedia-cloud IRC channel, or open a ticket in Phabricator with the Data-Services tag. Thanks, _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Daniel Schwen

6:14 p.m.

This is the next step after disallowing user databases on replicas. It broke some of my tools but I recently rewrote them to move joining logic into my application. I also replicate small amounts of data (e.g. page titles for a subset of pages) into my user db for joins. I found it quite off-putting at first. Volunteers are effectively forced to do substantial amounts of extra work after creating tools that "work perfectly fine". But I can understand the need to provide more scalable infrastructure with the ever growing projects.

On Sat, Mar 13, 2021, 3:46 PM Yetkin Sakal via Cloud < cloud@lists.wikimedia.org> wrote:

...

I completely agree with Maarten. It would be a step backward to stop supporting cross-database joins on wiki replicas. This is a breaking change and should not be applied unless a feasible solution to the problem is found. On Saturday, March 13, 2021, 8:17:39 PM GMT+3, Maarten Dammers < maarten@mdammers.nl> wrote:

Hi Joaquin,

Completely nothing was done with the community input about not being able to do cross wiki joins anymore [1]. In the past the WMF would do something with community input. I guess the new strategy is to just give the appearance of community input to legitimize a decision. Nice way to alienate the volunteers.

Maarten

[1] https://lists.wikimedia.org/pipermail/cloud/2020-November/thread.html#1309 On 12-03-2021 19:16, Joaquin Oltra Hernandez wrote:

TLDR:

Instead of `*.db.svc.eqiad.wmflabs` use `*.db.svc.wikimedia.cloud` to

use the new replicas

Quarry will migrate March 23 to use the new cluster

In a ~month (April 15) the old cluster will start retiring. See

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timelin... for more details

Hi everyone,

I'm happy to announce that the new replicas cluster is available for use after a few weeks open for testing.

To use the new cluster, you will only need to change the hostname when connecting to the databases. Instead of `*.db.svc.eqiad.wmflabs` you can use `*.db.svc.wikimedia.cloud`. See:

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#New_hos...

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#How_can...

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database

This brings us to the timeline updates:

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timelin...

March 2021

Announce new cluster domains for all users - March 12

Migrate Quarry to use the new cluster - March 23

April 2021

PAWS migration - Estimated first week of April

Migrate the old cluster to utilize new replication hosts. Replication

may stop. - April 15

Redirect old hostnames to the new cluster - April 28

Quarry is ready so it will transition first to use the new replicas. This will happen on March 23, at which point we will deploy the changes and update the docs. We are publishing a note on Tech news but if there are other venues where this information would be useful please help us spread the word.

PAWS is in the process of being migrated. We estimate it could be ready to use the new cluster at the beginning of April, and will publish more details as soon as we know.

On April 15 the old cluster will be moved under the new replication hosts, at which point there may be side effects and replication may stop. We recommend you update your code earlier to use the new hostnames to verify things are working normally.

Not long after, the old hostnames (`*.db.svc.eqiad.wmflabs`) will be redirected to the new ones (`*.db.svc.wikimedia.cloud`), at which point the old cluster will effectively be inaccessible.

If you have questions or need help please reply to cloud@lists.wikimedia.org, join the #wikimedia-cloud IRC channel, or open a ticket in Phabricator with the Data-Services tag.

Thanks,

Wikimedia Cloud Services announce mailing listCloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud-announce

Wikimedia Cloud Services mailing listCloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Joaquin Oltra Hernandez

15 Mar 15 Mar

1:57 p.m.

Hi,

These changes are not arbitrary, they are a necessity. They are happening because the clusters are out of capacity, they keep having problems with replication lagging and crashes (eg: [1]) and restoring servers takes days where the rest of the cluster remains at increased load. Additionally they can't be upgraded to new MariaDB versions which we need to be in sync with the production databases and ensure security in the cluster [2].

There is a set of trade-offs, as outlined in [3], and the new architecture manages to maintain keep the replicas with almost all features: a complete data set updated in real time; with an SQL interface; in a stable and performant cluster that can be upgraded and maintained in sync with the production database clusters. With the growth experienced in the last years (commons, wikidata) the current technology used by Wikireplicas is not feasible.

Not being able to do cross-wiki joins with SQL is a jump in complexity, sometimes so big that is not easily overcome. Most cross-join queries can be recreated using two separate queries and filtering the results with code, which can already be daunting for some but at least there is a solution. However, there are some queries that are impractical due to the large amount of data involved. For example, one we have identified from the feedback is querying for overlapping local and Commons images.

We have been keeping track of all the email responses and phabricator comments with specific examples and code using cross-wiki joins, and we have been working on a system to gather data from Quarry to analyze cross-join queries to look into it in depth. [4] was made to figure out common use cases and possibilities to enable them again.

There is no easy answer, there are some things that are becoming harder to do, and others are becoming much harder, and we need to figure out together what are the ones that become practically impossible because of the size of the datasets and find better solutions.

[1] Latest crash: https://phabricator.wikimedia.org/T276980 - Replication lagging: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=labsd... [2] https://lists.wikimedia.org/pipermail/cloud/2020-November/001322.html [3] https://lists.wikimedia.org/pipermail/cloud/2020-November/001326.html [4] https://phabricator.wikimedia.org/T215858

On Sun, Mar 14, 2021 at 12:15 AM Daniel Schwen lists@schwen.de wrote:

...

This is the next step after disallowing user databases on replicas. It broke some of my tools but I recently rewrote them to move joining logic into my application. I also replicate small amounts of data (e.g. page titles for a subset of pages) into my user db for joins. I found it quite off-putting at first. Volunteers are effectively forced to do substantial amounts of extra work after creating tools that "work perfectly fine". But I can understand the need to provide more scalable infrastructure with the ever growing projects.

On Sat, Mar 13, 2021, 3:46 PM Yetkin Sakal via Cloud < cloud@lists.wikimedia.org> wrote:

...
I completely agree with Maarten. It would be a step backward to stop supporting cross-database joins on wiki replicas. This is a breaking change and should not be applied unless a feasible solution to the problem is found. On Saturday, March 13, 2021, 8:17:39 PM GMT+3, Maarten Dammers < maarten@mdammers.nl> wrote:

Hi Joaquin,

Completely nothing was done with the community input about not being able to do cross wiki joins anymore [1]. In the past the WMF would do something with community input. I guess the new strategy is to just give the appearance of community input to legitimize a decision. Nice way to alienate the volunteers.

Maarten

[1] https://lists.wikimedia.org/pipermail/cloud/2020-November/thread.html#1309 On 12-03-2021 19:16, Joaquin Oltra Hernandez wrote:

TLDR:

Instead of `*.db.svc.eqiad.wmflabs` use `*.db.svc.wikimedia.cloud` to

use the new replicas

Quarry will migrate March 23 to use the new cluster

In a ~month (April 15) the old cluster will start retiring. See

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timelin... for more details

Hi everyone,

I'm happy to announce that the new replicas cluster is available for use after a few weeks open for testing.

To use the new cluster, you will only need to change the hostname when connecting to the databases. Instead of `*.db.svc.eqiad.wmflabs` you can use `*.db.svc.wikimedia.cloud`. See:

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#New_hos...

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#How_can...

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database

This brings us to the timeline updates:

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timelin...

March 2021

Announce new cluster domains for all users - March 12

Migrate Quarry to use the new cluster - March 23

April 2021

PAWS migration - Estimated first week of April

Migrate the old cluster to utilize new replication hosts. Replication

may stop. - April 15

Redirect old hostnames to the new cluster - April 28

Quarry is ready so it will transition first to use the new replicas. This will happen on March 23, at which point we will deploy the changes and update the docs. We are publishing a note on Tech news but if there are other venues where this information would be useful please help us spread the word.

PAWS is in the process of being migrated. We estimate it could be ready to use the new cluster at the beginning of April, and will publish more details as soon as we know.

On April 15 the old cluster will be moved under the new replication hosts, at which point there may be side effects and replication may stop. We recommend you update your code earlier to use the new hostnames to verify things are working normally.

Not long after, the old hostnames (`*.db.svc.eqiad.wmflabs`) will be redirected to the new ones (`*.db.svc.wikimedia.cloud`), at which point the old cluster will effectively be inaccessible.

If you have questions or need help please reply to cloud@lists.wikimedia.org, join the #wikimedia-cloud IRC channel, or open a ticket in Phabricator with the Data-Services tag.

Thanks,

Wikimedia Cloud Services announce mailing listCloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud-announce

Wikimedia Cloud Services mailing listCloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

Daniel Schwen

2:40 p.m.

This might be a trivial suggestion (for me it was a game changer):

Segment your large queries based on an indexed column. By that I mean, add an additional WHERE clause to process only a small subset of the entire DB (e.g. page_id >= 130000 AND page_id < 140000) and then loop in your application over many of those small fast running queries moving the window from MIN(page_id) to MAX(page_id). This made it possible to handle the amounts of data returned in my application logic and got rid of query timeouts. I'm sure this reduces pressure on the DB servers, too (as the memory requirement for these smaller queries is a lot smaller).

Either way, I'd be interested in hearing other devs' experiences and tips on how to cope with the new restrictions. It's the new reality and there is little point in complaining about it IMO (not that I haven't done that already :-D)

On Mon, Mar 15, 2021 at 11:58 AM Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:

...

Hi,

These changes are not arbitrary, they are a necessity. They are happening because the clusters are out of capacity, they keep having problems with replication lagging and crashes (eg: [1]) and restoring servers takes days where the rest of the cluster remains at increased load. Additionally they can't be upgraded to new MariaDB versions which we need to be in sync with the production databases and ensure security in the cluster [2].

There is a set of trade-offs, as outlined in [3], and the new architecture manages to maintain keep the replicas with almost all features: a complete data set updated in real time; with an SQL interface; in a stable and performant cluster that can be upgraded and maintained in sync with the production database clusters. With the growth experienced in the last years (commons, wikidata) the current technology used by Wikireplicas is not feasible.

Not being able to do cross-wiki joins with SQL is a jump in complexity, sometimes so big that is not easily overcome. Most cross-join queries can be recreated using two separate queries and filtering the results with code, which can already be daunting for some but at least there is a solution. However, there are some queries that are impractical due to the large amount of data involved. For example, one we have identified from the feedback is querying for overlapping local and Commons images.

We have been keeping track of all the email responses and phabricator comments with specific examples and code using cross-wiki joins, and we have been working on a system to gather data from Quarry to analyze cross-join queries to look into it in depth. [4] was made to figure out common use cases and possibilities to enable them again.

There is no easy answer, there are some things that are becoming harder to do, and others are becoming much harder, and we need to figure out together what are the ones that become practically impossible because of the size of the datasets and find better solutions.

[1] Latest crash: https://phabricator.wikimedia.org/T276980 - Replication lagging: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=labsd... [2] https://lists.wikimedia.org/pipermail/cloud/2020-November/001322.html [3] https://lists.wikimedia.org/pipermail/cloud/2020-November/001326.html [4] https://phabricator.wikimedia.org/T215858

On Sun, Mar 14, 2021 at 12:15 AM Daniel Schwen lists@schwen.de wrote:

...
This is the next step after disallowing user databases on replicas. It broke some of my tools but I recently rewrote them to move joining logic into my application. I also replicate small amounts of data (e.g. page titles for a subset of pages) into my user db for joins. I found it quite off-putting at first. Volunteers are effectively forced to do substantial amounts of extra work after creating tools that "work perfectly fine". But I can understand the need to provide more scalable infrastructure with the ever growing projects.

On Sat, Mar 13, 2021, 3:46 PM Yetkin Sakal via Cloud < cloud@lists.wikimedia.org> wrote:

...
I completely agree with Maarten. It would be a step backward to stop supporting cross-database joins on wiki replicas. This is a breaking change and should not be applied unless a feasible solution to the problem is found. On Saturday, March 13, 2021, 8:17:39 PM GMT+3, Maarten Dammers < maarten@mdammers.nl> wrote:

Hi Joaquin,

Completely nothing was done with the community input about not being able to do cross wiki joins anymore [1]. In the past the WMF would do something with community input. I guess the new strategy is to just give the appearance of community input to legitimize a decision. Nice way to alienate the volunteers.

Maarten

[1] https://lists.wikimedia.org/pipermail/cloud/2020-November/thread.html#1309 On 12-03-2021 19:16, Joaquin Oltra Hernandez wrote:

TLDR:

Instead of `*.db.svc.eqiad.wmflabs` use `*.db.svc.wikimedia.cloud` to

use the new replicas

Quarry will migrate March 23 to use the new cluster

In a ~month (April 15) the old cluster will start retiring. See

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timelin... for more details

Hi everyone,

I'm happy to announce that the new replicas cluster is available for use after a few weeks open for testing.

To use the new cluster, you will only need to change the hostname when connecting to the databases. Instead of `*.db.svc.eqiad.wmflabs` you can use `*.db.svc.wikimedia.cloud`. See:

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#New_hos...

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#How_can...

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database

This brings us to the timeline updates:

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign#Timelin...

March 2021

Announce new cluster domains for all users - March 12

Migrate Quarry to use the new cluster - March 23

April 2021

PAWS migration - Estimated first week of April

Migrate the old cluster to utilize new replication hosts.

Replication may stop. - April 15

Redirect old hostnames to the new cluster - April 28

Quarry is ready so it will transition first to use the new replicas. This will happen on March 23, at which point we will deploy the changes and update the docs. We are publishing a note on Tech news but if there are other venues where this information would be useful please help us spread the word.

PAWS is in the process of being migrated. We estimate it could be ready to use the new cluster at the beginning of April, and will publish more details as soon as we know.

On April 15 the old cluster will be moved under the new replication hosts, at which point there may be side effects and replication may stop. We recommend you update your code earlier to use the new hostnames to verify things are working normally.

Not long after, the old hostnames (`*.db.svc.eqiad.wmflabs`) will be redirected to the new ones (`*.db.svc.wikimedia.cloud`), at which point the old cluster will effectively be inaccessible.

If you have questions or need help please reply to cloud@lists.wikimedia.org, join the #wikimedia-cloud IRC channel, or open a ticket in Phabricator with the Data-Services tag.

Thanks,

Wikimedia Cloud Services announce mailing listCloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud-announce

Wikimedia Cloud Services mailing listCloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Dan Andreescu

6:08 p.m.

...

[4] was made to figure out common use cases and possibilities to enable them again.

...

[4] https://phabricator.wikimedia.org/T215858

I just want to highlight this ^ thing Joaquin said and mention that our team (Data Engineering) is also participating in brainstorming ways to bring back not just cross-wiki joins but better datasets to run these queries. We have some good ideas, so please do participate in the task and give us more input so we can pick the best solution quickly.

Fastily

31 Mar 31 Mar

4:56 a.m.

A little late to the party, I just learned about this change today.

I maintain a number of bot tasks https://en.wikipedia.org/wiki/User:FastilyBot and database https://fastilybot-reports.toolforge.org reports https://en.wikipedia.org/wiki/Wikipedia:Database_reports on enwp that rely on cross-wiki joins (mostly page title joins between enwp and Commons) to function properly. I didn't find the migration instructions https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today? very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it would be grossly impractical for me to perform a "join" in the bot's code.

Is there going to be a replacement for this functionality?

Fastily

On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu dandreescu@wikimedia.org wrote:

...

[4] was made to figure out common use cases and possibilities to enable

...
them again.

...

...
[4] https://phabricator.wikimedia.org/T215858

I just want to highlight this ^ thing Joaquin said and mention that our team (Data Engineering) is also participating in brainstorming ways to bring back not just cross-wiki joins but better datasets to run these queries. We have some good ideas, so please do participate in the task and give us more input so we can pick the best solution quickly. _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Huji Lee

10 a.m.

I said it before, and I say it again: *some* databases should be available for cross-wiki JOIN everywhere. This would at least include commons_p and centralauth_p but perhaps also enwiki_p and meta_p

I know that we discussed it before and better long-term solutions can be imagined (such as a data lake, etc.) but we need a solution *now*.

Sorry for coming at it a bit passionately. It just feels like we are choosing a path of pain for end users (let their tools break, then let us offer alternatives that they could adopt) over a path of effort for the Clouds team (let them create robust solutions for the end users and give them ample time to transition to the new method before turning off cross-wiki joins).

On Wed, Mar 31, 2021 at 4:57 AM Fastily fastilywp@gmail.com wrote:

...

A little late to the party, I just learned about this change today.

I maintain a number of bot tasks https://en.wikipedia.org/wiki/User:FastilyBot and database https://fastilybot-reports.toolforge.org reports https://en.wikipedia.org/wiki/Wikipedia:Database_reports on enwp that rely on cross-wiki joins (mostly page title joins between enwp and Commons) to function properly. I didn't find the migration instructions https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today? very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it would be grossly impractical for me to perform a "join" in the bot's code.

Is there going to be a replacement for this functionality?

Fastily

On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu dandreescu@wikimedia.org wrote:

...
[4] was made to figure out common use cases and possibilities to enable

...
them again.

...

...
[4] https://phabricator.wikimedia.org/T215858

I just want to highlight this ^ thing Joaquin said and mention that our team (Data Engineering) is also participating in brainstorming ways to bring back not just cross-wiki joins but better datasets to run these queries. We have some good ideas, so please do participate in the task and give us more input so we can pick the best solution quickly. _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Aaron Halfaker

10:58 a.m.

...

over a path of effort for the Clouds team

It seems to me that the Cloud team is putting in all of the effort they can. I'm not sure where they would find more time and energy to implement a better solution. I imagine any better solution wouldn't be a matter of a few extra hours, but rather finding thousands and thousands of hours to build something bespoke.

This is painful. I think you raised some really good points about cross-joins with Central Auth and Commons as those are *designed* to be cross-referenced from other wikis. But ultimately, if there's no reasonable way to do it in the software (Maria DB) we have available, implementing our own solution would take several orders of magnitude more time and then we'd need even more time to maintain it.

The real source of pain here is the success of Wikimedia projects. Once things get so big that you can't really use *one bigass server *to solve the problem anymore, scaling involves taking on new complexities in downstream code. In my experience, everyone is struggling with the limits that the "big data" age has put on our ability to query and analyze. No matter what, this type of transition and likely the ones that will follow, will cause a big burden on tool developers. I just don't see a good way around that even though it is a very bad situation and many volunteers won't be able to handle the burden.

I guess all I'm trying to say is, don't lay this on the Cloud team being lazy. They aren't. They are one of the most volunteer focused teams at the Wikimedia Foundation and they do quite a lot to support us with the little resources they have. Your frustrations are perfectly valid though. This is a very frustrating situation.

On Wed, Mar 31, 2021 at 7:01 AM Huji Lee huji.huji@gmail.com wrote:

...

I said it before, and I say it again: *some* databases should be available for cross-wiki JOIN everywhere. This would at least include commons_p and centralauth_p but perhaps also enwiki_p and meta_p

I know that we discussed it before and better long-term solutions can be imagined (such as a data lake, etc.) but we need a solution *now*.

Sorry for coming at it a bit passionately. It just feels like we are choosing a path of pain for end users (let their tools break, then let us offer alternatives that they could adopt) over a path of effort for the Clouds team (let them create robust solutions for the end users and give them ample time to transition to the new method before turning off cross-wiki joins).

On Wed, Mar 31, 2021 at 4:57 AM Fastily fastilywp@gmail.com wrote:

...
A little late to the party, I just learned about this change today.

I maintain a number of bot tasks https://en.wikipedia.org/wiki/User:FastilyBot and database https://fastilybot-reports.toolforge.org reports https://en.wikipedia.org/wiki/Wikipedia:Database_reports on enwp that rely on cross-wiki joins (mostly page title joins between enwp and Commons) to function properly. I didn't find the migration instructions https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today? very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it would be grossly impractical for me to perform a "join" in the bot's code.

Is there going to be a replacement for this functionality?

Fastily

On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu dandreescu@wikimedia.org wrote:

...
[4] was made to figure out common use cases and possibilities to enable

...
them again.

...

...
[4] https://phabricator.wikimedia.org/T215858

I just want to highlight this ^ thing Joaquin said and mention that our team (Data Engineering) is also participating in brainstorming ways to bring back not just cross-wiki joins but better datasets to run these queries. We have some good ideas, so please do participate in the task and give us more input so we can pick the best solution quickly. _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Huji Lee

12:01 p.m.

I am not being critical of people (namely, the amazing Cloud team) here. I am being critical of decisions. That could even involve much higher level decisions e.g. should WMF have spent more money and hired more resources for this? It could very well be that I am uninformed, and these decisions were made very carefully and are the best option for WMF, the Cloud team and the users; but I did not get that impression from our previous conversations on this topic.

If you were to just look at this internally as the Clouds team (and I realize it is not my place to impersonate them), the tradeoff could also be viewed as: do we delay turning off cross-joins and make our employer mad (because we are delaying something we promised), or do we move forward and make some of our users mad (because we are breaking their tools and not offering a solid alternative). This is a classic middle-manager dilemma, and I feel sorry for anyone that is stuck between a rock and hard place like that.

By the way, I cannot help myself from highlighting how funny the oxymoron is in this sentence: *The real source of pain here is the success of Wikimedia projects*. I might start quoting that going forward!

On Wed, Mar 31, 2021 at 10:59 AM Aaron Halfaker aaron.halfaker@gmail.com wrote:

...

over a path of effort for the Clouds team

It seems to me that the Cloud team is putting in all of the effort they can. I'm not sure where they would find more time and energy to implement a better solution. I imagine any better solution wouldn't be a matter of a few extra hours, but rather finding thousands and thousands of hours to build something bespoke.

This is painful. I think you raised some really good points about cross-joins with Central Auth and Commons as those are *designed* to be cross-referenced from other wikis. But ultimately, if there's no reasonable way to do it in the software (Maria DB) we have available, implementing our own solution would take several orders of magnitude more time and then we'd need even more time to maintain it.

The real source of pain here is the success of Wikimedia projects. Once things get so big that you can't really use *one bigass server *to solve the problem anymore, scaling involves taking on new complexities in downstream code. In my experience, everyone is struggling with the limits that the "big data" age has put on our ability to query and analyze. No matter what, this type of transition and likely the ones that will follow, will cause a big burden on tool developers. I just don't see a good way around that even though it is a very bad situation and many volunteers won't be able to handle the burden.

I guess all I'm trying to say is, don't lay this on the Cloud team being lazy. They aren't. They are one of the most volunteer focused teams at the Wikimedia Foundation and they do quite a lot to support us with the little resources they have. Your frustrations are perfectly valid though. This is a very frustrating situation.

On Wed, Mar 31, 2021 at 7:01 AM Huji Lee huji.huji@gmail.com wrote:

...
I said it before, and I say it again: *some* databases should be available for cross-wiki JOIN everywhere. This would at least include commons_p and centralauth_p but perhaps also enwiki_p and meta_p

I know that we discussed it before and better long-term solutions can be imagined (such as a data lake, etc.) but we need a solution *now*.

Sorry for coming at it a bit passionately. It just feels like we are choosing a path of pain for end users (let their tools break, then let us offer alternatives that they could adopt) over a path of effort for the Clouds team (let them create robust solutions for the end users and give them ample time to transition to the new method before turning off cross-wiki joins).

On Wed, Mar 31, 2021 at 4:57 AM Fastily fastilywp@gmail.com wrote:

...
A little late to the party, I just learned about this change today.

I maintain a number of bot tasks https://en.wikipedia.org/wiki/User:FastilyBot and database https://fastilybot-reports.toolforge.org reports https://en.wikipedia.org/wiki/Wikipedia:Database_reports on enwp that rely on cross-wiki joins (mostly page title joins between enwp and Commons) to function properly. I didn't find the migration instructions https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today? very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it would be grossly impractical for me to perform a "join" in the bot's code.

Is there going to be a replacement for this functionality?

Fastily

On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu dandreescu@wikimedia.org wrote:

...
[4] was made to figure out common use cases and possibilities to enable

...
them again.

...

...
[4] https://phabricator.wikimedia.org/T215858

I just want to highlight this ^ thing Joaquin said and mention that our team (Data Engineering) is also participating in brainstorming ways to bring back not just cross-wiki joins but better datasets to run these queries. We have some good ideas, so please do participate in the task and give us more input so we can pick the best solution quickly. _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Kimmo Virtanen

12:11 p.m.

Hi,

...

This is painful. I think you raised some really good points about

cross-joins with Central Auth and Commons as those are *designed* to be cross-referenced from other wikis. But ultimately, if there's no a reasonable way to do it in the software (Maria DB) we have available, implementing our own solution would take several orders of magnitude more time and then we'd need even more time to maintain it.

There are things like Spider in the MariaDB realm in the example[1,2]. It may not scale or there can be other reasons why not to use it, but I haven't really heard or seen comments on why it would be so. Also, I think there is no technical reason to replicate just some tables to all servers so that the most common use cases could be fulfilled which could have other reasons not to do it. For example, it could be just an idea that is clearly bad for reasons we don't see right now. However, there is not currently visibility for the underlying limitations which are defining our choices in the end.

[1] https://mariadb.com/kb/en/spider-storage-engine-overview/ [2] https://dzone.com/articles/uses-for-mariadb-and-the-spider-storage-engine

...

I guess all I'm trying to say is, don't lay this on the Cloud team being

lazy. They aren't.

Also, afaik based on cloud team member comments that they seemed to understand why people feel crosswiki-joins are useful and important even when they don't have any solutions currently.

Br, --Kimmo Virtanen, Zache

On Wed, Mar 31, 2021 at 5:59 PM Aaron Halfaker aaron.halfaker@gmail.com wrote:

...

over a path of effort for the Clouds team

It seems to me that the Cloud team is putting in all of the effort they can. I'm not sure where they would find more time and energy to implement a better solution. I imagine any better solution wouldn't be a matter of a few extra hours, but rather finding thousands and thousands of hours to build something bespoke.

This is painful. I think you raised some really good points about cross-joins with Central Auth and Commons as those are *designed* to be cross-referenced from other wikis. But ultimately, if there's no reasonable way to do it in the software (Maria DB) we have available, implementing our own solution would take several orders of magnitude more time and then we'd need even more time to maintain it.

The real source of pain here is the success of Wikimedia projects. Once things get so big that you can't really use *one bigass server *to solve the problem anymore, scaling involves taking on new complexities in downstream code. In my experience, everyone is struggling with the limits that the "big data" age has put on our ability to query and analyze. No matter what, this type of transition and likely the ones that will follow, will cause a big burden on tool developers. I just don't see a good way around that even though it is a very bad situation and many volunteers won't be able to handle the burden.

I guess all I'm trying to say is, don't lay this on the Cloud team being lazy. They aren't. They are one of the most volunteer focused teams at the Wikimedia Foundation and they do quite a lot to support us with the little resources they have. Your frustrations are perfectly valid though. This is a very frustrating situation.

On Wed, Mar 31, 2021 at 7:01 AM Huji Lee huji.huji@gmail.com wrote:

...
I said it before, and I say it again: *some* databases should be available for cross-wiki JOIN everywhere. This would at least include commons_p and centralauth_p but perhaps also enwiki_p and meta_p

I know that we discussed it before and better long-term solutions can be imagined (such as a data lake, etc.) but we need a solution *now*.

Sorry for coming at it a bit passionately. It just feels like we are choosing a path of pain for end users (let their tools break, then let us offer alternatives that they could adopt) over a path of effort for the Clouds team (let them create robust solutions for the end users and give them ample time to transition to the new method before turning off cross-wiki joins).

On Wed, Mar 31, 2021 at 4:57 AM Fastily fastilywp@gmail.com wrote:

...
A little late to the party, I just learned about this change today.

I maintain a number of bot tasks https://en.wikipedia.org/wiki/User:FastilyBot and database https://fastilybot-reports.toolforge.org reports https://en.wikipedia.org/wiki/Wikipedia:Database_reports on enwp that rely on cross-wiki joins (mostly page title joins between enwp and Commons) to function properly. I didn't find the migration instructions https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today? very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it would be grossly impractical for me to perform a "join" in the bot's code.

Is there going to be a replacement for this functionality?

Fastily

On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu dandreescu@wikimedia.org wrote:

...
[4] was made to figure out common use cases and possibilities to enable

...
them again.

...

...
[4] https://phabricator.wikimedia.org/T215858

I just want to highlight this ^ thing Joaquin said and mention that our team (Data Engineering) is also participating in brainstorming ways to bring back not just cross-wiki joins but better datasets to run these queries. We have some good ideas, so please do participate in the task and give us more input so we can pick the best solution quickly. _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Joaquin Oltra Hernandez

3:46 p.m.

Hi Fastily, we are aware of the use case for matching commons pages/images/sha1s between commons/big wikis and other wikis, as it has come up many times. I'm cataloging all the comments and examples that have come up in the last 5 months in order to provide categorized input to the parent task https://phabricator.wikimedia.org/T215858 so that the engineering teams can think of solutions. I'll share it publicly once it is in a presentable state.

We did some exploration a while ago (from Huji's examples), you can see some notebooks with python approaches here https://phabricator.wikimedia.org/T267992#6637250, but there is too much data and doing the same takes a very long time and can be impractical. If you want to give it a try have a look at the notebooks, I don't think the code is too memory intensive, specially in bd808s notebook using the API, and Raspberry Pis could maybe handle it.

It is more complex and error-prone, for sure, so disabling those reports and waiting is sadly the option right now, until a suitable solution for this is found.

So, to answer your question:

Is there going to be a replacement for this functionality?

I can't promise anything yet but I can assure you the teams involved in these systems are aware of the need for this functionality and will be looking into how to provide it to make these reports/bots/queries viable.

We will send updates or new info to the cloud lists, and you can subscribe to these tasks if you want to follow more closely:

- Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema https://phabricator.wikimedia.org/T215858 - Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's https://phabricator.wikimedia.org/T267992 - Provide a mechanism for detecting duplicate files in commons and a local wiki https://phabricator.wikimedia.org/T268240 - Provide a mechanism for detecting duplicate files in enwiki and another wikipedia https://phabricator.wikimedia.org/T268242 - Provide a mechanism for accessing the names of image files on Commons when querying another wiki https://phabricator.wikimedia.org/T268244

On Wed, Mar 31, 2021 at 10:57 AM Fastily fastilywp@gmail.com wrote:

...

A little late to the party, I just learned about this change today.

I maintain a number of bot tasks https://en.wikipedia.org/wiki/User:FastilyBot and database https://fastilybot-reports.toolforge.org reports https://en.wikipedia.org/wiki/Wikipedia:Database_reports on enwp that rely on cross-wiki joins (mostly page title joins between enwp and Commons) to function properly. I didn't find the migration instructions https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today? very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it would be grossly impractical for me to perform a "join" in the bot's code.

Is there going to be a replacement for this functionality?

Fastily

On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu dandreescu@wikimedia.org wrote:

...
[4] was made to figure out common use cases and possibilities to enable

...
them again.

...

...
[4] https://phabricator.wikimedia.org/T215858

I just want to highlight this ^ thing Joaquin said and mention that our team (Data Engineering) is also participating in brainstorming ways to bring back not just cross-wiki joins but better datasets to run these queries. We have some good ideas, so please do participate in the task and give us more input so we can pick the best solution quickly. _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

Roy Smith

5:20 p.m.

Is it feasible to do a log analysis of the database servers to find out what tools are (were?) using cross-wiki joins? At least that would ensure that all the tool owners could be contacted directly to make sure they know this is happening.

...

On Mar 31, 2021, at 3:46 PM, Joaquin Oltra Hernandez jhernandez@wikimedia.org wrote:

Hi Fastily, we are aware of the use case for matching commons pages/images/sha1s between commons/big wikis and other wikis, as it has come up many times. I'm cataloging all the comments and examples that have come up in the last 5 months in order to provide categorized input to the parent task https://phabricator.wikimedia.org/T215858 so that the engineering teams can think of solutions. I'll share it publicly once it is in a presentable state.

We did some exploration a while ago (from Huji's examples), you can see some notebooks with python approaches here https://phabricator.wikimedia.org/T267992#6637250, but there is too much data and doing the same takes a very long time and can be impractical. If you want to give it a try have a look at the notebooks, I don't think the code is too memory intensive, specially in bd808s notebook using the API, and Raspberry Pis could maybe handle it.

It is more complex and error-prone, for sure, so disabling those reports and waiting is sadly the option right now, until a suitable solution for this is found.

So, to answer your question:

Is there going to be a replacement for this functionality? I can't promise anything yet but I can assure you the teams involved in these systems are aware of the need for this functionality and will be looking into how to provide it to make these reports/bots/queries viable.

We will send updates or new info to the cloud lists, and you can subscribe to these tasks if you want to follow more closely: Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema https://phabricator.wikimedia.org/T215858 Provide mechanism to detect name clashed media between Commons and a Local project, without needing to join tables across wiki-db's https://phabricator.wikimedia.org/T267992 Provide a mechanism for detecting duplicate files in commons and a local wiki https://phabricator.wikimedia.org/T268240 Provide a mechanism for detecting duplicate files in enwiki and another wikipedia https://phabricator.wikimedia.org/T268242 Provide a mechanism for accessing the names of image files on Commons when querying another wiki https://phabricator.wikimedia.org/T268244

On Wed, Mar 31, 2021 at 10:57 AM Fastily <fastilywp@gmail.com mailto:fastilywp@gmail.com> wrote: A little late to the party, I just learned about this change today.

I maintain a number of bot tasks https://en.wikipedia.org/wiki/User:FastilyBot and database https://fastilybot-reports.toolforge.org/ reports https://en.wikipedia.org/wiki/Wikipedia:Database_reports on enwp that rely on cross-wiki joins (mostly page title joins between enwp and Commons) to function properly. I didn't find the migration instructions https://wikitech.wikimedia.org/w/index.php?title=News/Wiki_Replicas_2020_Redesign&oldid=1905818#How_do_I_cross_reference_data_between_wikis_like_I_do_with_cross_joins_today? very helpful; I run FastilyBot on a Raspberry Pi, and needless to say it would be grossly impractical for me to perform a "join" in the bot's code.

Is there going to be a replacement for this functionality?

Fastily

On Mon, Mar 15, 2021 at 3:09 PM Dan Andreescu <dandreescu@wikimedia.org mailto:dandreescu@wikimedia.org> wrote: [4] was made to figure out common use cases and possibilities to enable them again. ... [4] https://phabricator.wikimedia.org/T215858 https://phabricator.wikimedia.org/T215858

I just want to highlight this ^ thing Joaquin said and mention that our team (Data Engineering) is also participating in brainstorming ways to bring back not just cross-wiki joins but better datasets to run these queries. We have some good ideas, so please do participate in the task and give us more input so we can pick the best solution quickly. _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org mailto:Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org mailto:labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud https://lists.wikimedia.org/mailman/listinfo/cloud _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org mailto:Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org mailto:labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Brooke Storm

5:51 p.m.

...

On Mar 31, 2021, at 2:20 PM, Roy Smith roy@panix.com wrote:

Is it feasible to do a log analysis of the database servers to find out what tools are (were?) using cross-wiki joins? At least that would ensure that all the tool owners could be contacted directly to make sure they know this is happening.

It’s not feasible to log all the queries, unfortunately. That grows log too fast to keep on a disk (since we tried it). However, I am running a service to sample the queries at random intervals, which doesn’t provide a complete list, but it’s been running for something like a month now. As a result, it should be a pretty good list. I just did a pull from that tool and can try to script a concept of who is doing cross-wiki joins and let you know. It’s possible it would be quite doable once I’ve got the list parsed out.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org

Daniel Schwen

6:09 p.m.

...

I run FastilyBot on a Raspberry Pi, and needless to say it would be grossly impractical for me to perform a "join" in the bot's code.

Why not run it on WMF Cloud? In code joins will very likely work there and Cloud is supported. You are effectively asking to also support a second way here.

And I think it is by no means "needless" to say that it would be "grossly impractical" to do this in the bot code. Joining page titles from commons and en.wp could for example be done by having the DB servers return a sorted stream of page titles (incrementally with limit). Your app code would advance the stream that has the lexicographically smaller title (returning equal titles). This would require little effort on the Pi, however it would transfer two big data streams (2-3 GB maybe - so feasibility would depend on your connection, but again this would be cheap on Cloud!).

Roy Smith

8:18 p.m.

I'm just playing around on tools-sgebastion-08. I can dump the first 1 million image names about half a minute:

...

tools.spi-tools-dev:xw-join$ time mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.web.db.svc.wikimedia.cloud commonswiki_p -N -e 'select img_name from image limit 10000000 ' > /dev/null

real 0m36.586s user 0m9.678s sys 0m1.324s

but if I try 10 million, it fails:

...

tools.spi-tools-dev:xw-join$ time mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.web.db.svc.wikimedia.cloud commonswiki_p -N -e 'select img_name from image limit 100000000 ' > /dev/null Killed

real 0m9.875s user 0m1.417s sys 0m1.561s

Is there some maximum query size configured by default? The full image table on commons is about 70M rows, so extrapolating from the first example, something like 1 hour to move all that data.

...

On Mar 31, 2021, at 6:09 PM, Daniel Schwen lists@schwen.de wrote:

I run FastilyBot on a Raspberry Pi, and needless to say it would be grossly impractical for me to perform a "join" in the bot's code.

Why not run it on WMF Cloud? In code joins will very likely work there and Cloud is supported. You are effectively asking to also support a second way here.

And I think it is by no means "needless" to say that it would be "grossly impractical" to do this in the bot code. Joining page titles from commons and en.wp could for example be done by having the DB servers return a sorted stream of page titles (incrementally with limit). Your app code would advance the stream that has the lexicographically smaller title (returning equal titles). This would require little effort on the Pi, however it would transfer two big data streams (2-3 GB maybe - so feasibility would depend on your connection, but again this would be cheap on Cloud!). _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Brooke Storm

9:47 p.m.

...

On Mar 31, 2021, at 5:18 PM, Roy Smith roy@panix.com wrote:

I'm just playing around on tools-sgebastion-08. I can dump the first 1 million image names about half a minute:

...
tools.spi-tools-dev:xw-join$ time mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.web.db.svc.wikimedia.cloud commonswiki_p -N -e 'select img_name from image limit 10000000 ' > /dev/null

real 0m36.586s user 0m9.678s sys 0m1.324s

but if I try 10 million, it fails:

...
tools.spi-tools-dev:xw-join$ time mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.web.db.svc.wikimedia.cloud commonswiki_p -N -e 'select img_name from image limit 100000000 ' > /dev/null Killed

real 0m9.875s user 0m1.417s sys 0m1.561s

Is there some maximum query size configured by default? The full image table on commons is about 70M rows, so extrapolating from the first example, something like 1 hour to move all that data.

That could be RAM limits on the bastion. Actually, scratch that, I’ve confirmed you were killed by the OOM killer on that bastion: Mar 31 23:29:17 tools-sgebastion-08 kernel: [2860588.199138] mysql invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

-Brooke

Roy Smith

10:35 p.m.

Thanks for looking into this. I tried this again a little later, and it ran fine. Odd that the amount of memory used depends on the number of rows. I would expect it would stream results to stdout as they came in, but apparently not.

Even weirder that the 100M example runs OOM in 10s, while the 10M example runs to completion in 36s. Could it be pre-allocating buffer space for the number of rows it expects to ultimately get? Ugh, that would be a crazy design, but it does seem like that's what's happening.

...

On Mar 31, 2021, at 9:47 PM, Brooke Storm bstorm@wikimedia.org wrote:

...
On Mar 31, 2021, at 5:18 PM, Roy Smith <roy@panix.com mailto:roy@panix.com> wrote:

I'm just playing around on tools-sgebastion-08. I can dump the first 1 million image names about half a minute:

...
tools.spi-tools-dev:xw-join$ time mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.web.db.svc.wikimedia.cloud commonswiki_p -N -e 'select img_name from image limit 10000000 ' > /dev/null

real 0m36.586s user 0m9.678s sys 0m1.324s

but if I try 10 million, it fails:

...
tools.spi-tools-dev:xw-join$ time mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.web.db.svc.wikimedia.cloud commonswiki_p -N -e 'select img_name from image limit 100000000 ' > /dev/null Killed

real 0m9.875s user 0m1.417s sys 0m1.561s

Is there some maximum query size configured by default? The full image table on commons is about 70M rows, so extrapolating from the first example, something like 1 hour to move all that data.

That could be RAM limits on the bastion. Actually, scratch that, I’ve confirmed you were killed by the OOM killer on that bastion: Mar 31 23:29:17 tools-sgebastion-08 kernel: [2860588.199138] mysql invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

-Brooke_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org mailto:Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org mailto:labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud https://lists.wikimedia.org/mailman/listinfo/cloud

Roy Smith

1 Apr 1 Apr

10:26 a.m.

I'd like to continue exploring this, just not quite sure of the appropriate way forward. I gather doing work like this on the toolforge bastion hosts is frowned upon, so I guess what I should be doing is spinning up a VPS instance on https://horizon.wikimedia.org/? https://horizon.wikimedia.org/? I've been reading through https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_Instances https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_Instances, from which I gather I need to be a project admin. Is there some existing project I can join to do this exploratory work, or should I create a new project?

...

On Mar 31, 2021, at 10:35 PM, Roy Smith roy@panix.com wrote:

Thanks for looking into this. I tried this again a little later, and it ran fine. Odd that the amount of memory used depends on the number of rows. I would expect it would stream results to stdout as they came in, but apparently not.

Even weirder that the 100M example runs OOM in 10s, while the 10M example runs to completion in 36s. Could it be pre-allocating buffer space for the number of rows it expects to ultimately get? Ugh, that would be a crazy design, but it does seem like that's what's happening.

...
On Mar 31, 2021, at 9:47 PM, Brooke Storm <bstorm@wikimedia.org mailto:bstorm@wikimedia.org> wrote:

...
On Mar 31, 2021, at 5:18 PM, Roy Smith <roy@panix.com mailto:roy@panix.com> wrote:

I'm just playing around on tools-sgebastion-08. I can dump the first 1 million image names about half a minute:

...
tools.spi-tools-dev:xw-join$ time mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.web.db.svc.wikimedia.cloud commonswiki_p -N -e 'select img_name from image limit 10000000 ' > /dev/null

real 0m36.586s user 0m9.678s sys 0m1.324s

but if I try 10 million, it fails:

...
tools.spi-tools-dev:xw-join$ time mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.web.db.svc.wikimedia.cloud commonswiki_p -N -e 'select img_name from image limit 100000000 ' > /dev/null Killed

real 0m9.875s user 0m1.417s sys 0m1.561s

Is there some maximum query size configured by default? The full image table on commons is about 70M rows, so extrapolating from the first example, something like 1 hour to move all that data.

That could be RAM limits on the bastion. Actually, scratch that, I’ve confirmed you were killed by the OOM killer on that bastion: Mar 31 23:29:17 tools-sgebastion-08 kernel: [2860588.199138] mysql invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

-Brooke_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org mailto:Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org mailto:labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Nicholas Skaggs

5 Apr 5 Apr

5:18 p.m.

Hello Roy, depending on what you are wanting to experiment with, https://paws.wmcloud.org/ might be a good choice. You are correct about Cloud VPS instances. If your work becomes its own project, you are more than welcome to request a new project on Cloud VPS.

On Thu, Apr 1, 2021 at 10:26 AM Roy Smith roy@panix.com wrote:

...

I'd like to continue exploring this, just not quite sure of the appropriate way forward. I gather doing work like this on the toolforge bastion hosts is frowned upon, so I guess what I should be doing is spinning up a VPS instance on https://horizon.wikimedia.org/? I've been reading through https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_Instances, from which I gather I need to be a project admin. Is there some existing project I can join to do this exploratory work, or should I create a new project?

On Mar 31, 2021, at 10:35 PM, Roy Smith roy@panix.com wrote:

Thanks for looking into this. I tried this again a little later, and it ran fine. Odd that the amount of memory used depends on the number of rows. I would expect it would stream results to stdout as they came in, but apparently not.

Even weirder that the 100M example runs OOM in 10s, while the 10M example runs to completion in 36s. Could it be pre-allocating buffer space for the number of rows it expects to ultimately get? Ugh, that would be a crazy design, but it does seem like that's what's happening.

On Mar 31, 2021, at 9:47 PM, Brooke Storm bstorm@wikimedia.org wrote:

On Mar 31, 2021, at 5:18 PM, Roy Smith roy@panix.com wrote:

I'm just playing around on tools-sgebastion-08. I can dump the first 1 million image names about half a minute:

tools.spi-tools-dev:xw-join$ time mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.web.db.svc.wikimedia.cloud commonswiki_p -N -e 'select img_name from image limit 10000000 ' > /dev/null

real 0m36.586s user 0m9.678s sys 0m1.324s

but if I try 10 million, it fails:

tools.spi-tools-dev:xw-join$ time mysql --defaults-file=$HOME/replica.my.cnf -h commonswiki.web.db.svc.wikimedia.cloud commonswiki_p -N -e 'select img_name from image limit 100000000 ' > /dev/null Killed

real 0m9.875s user 0m1.417s sys 0m1.561s

Is there some maximum query size configured by default? The full image table on commons is about 70M rows, so extrapolating from the first example, something like 1 hour to move all that data.

That could be RAM limits on the bastion. Actually, scratch that, I’ve confirmed you were killed by the OOM killer on that bastion: Mar 31 23:29:17 tools-sgebastion-08 kernel: [2860588.199138] mysql invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

-Brooke_______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- *Nicholas Skaggs* Engineering Manager, Cloud Services Wikimedia Foundation https://wikimediafoundation.org/

1360

Age (days ago)

1384

Last active (days ago)

cloud@lists.wikimedia.org

20 comments

12 participants

tags (0)

participants (12)

Aaron Halfaker
Brooke Storm
Dan Andreescu
Daniel Schwen
Fastily
Huji Lee
Joaquin Oltra Hernandez
Kimmo Virtanen
Maarten Dammers
Nicholas Skaggs
Roy Smith
Yetkin Sakal