[Cloud-announce] Wiki Replicas 2020 Redesign

List overview All Threads
Download

newer

older

password for thanos?

Toolforge kubernetes maintenance...

Joaquin Oltra Hernandez

10 Nov 2020 10 Nov '20

10:26 p.m.

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

Hi!

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What should I do? To avoid breaking changes, you can start making the following changes *now*: - Update existing tools to ensure queries are executed against the proper database connection - Eg: If you want to query the `eswiki_p` DB, you must connect to the `eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts - Check your existing tools and services queries for cross database JOINs, rewrite the joins in application code - Eg: If you are doing a join across databases, for example joining `enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

Timeline: - November - December: Early adopter testing - January 2021: Existing and new systems online, transition period starts - February 2021: Old hardware is decommissioned

We need your help - If you would like to beta test the new architecture, please let us know and we will reach out to you soon - Sharing examples / descriptions of how a tool or service was updated, writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

If you have questions or need help adapting your code or queries, please contact us [2] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication, or write on the talk page [3] https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign.

We will be sending reminders, and more specific examples of the changes via email and on the wiki page. For more information see [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign.

[1]: https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign [2]: https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication [3]: https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

Attachments:

attachment.htm (text/html — 3.0 KB)

Show replies by date

MusikAnimal

10 Nov 10 Nov

11:13 p.m.

Hi! Most tools query just a single db at a time, so I don't think this will be a massive problem. However some such as Global Contribs[0] and GUC[1] can theoretically query all of them from a single request. Creating new connections on-the-fly seems doable in production, the issue is how to work on these tools in a local environment. Currently the recommendation is to use a SSH tunnel to the desired host,[2] such as enwiki.analytics.db.svc.eqiad.wmflabs. Surely we can't do this same port forwarding for 900+ connections.

Any ideas? Perhaps there's some way to make a host that automatically forwards to the correct one, solely for developer use? Or will development of such global tools need to happen in the Cloud Services environment?

~ MA

[0] https://xtools.wmflabs.org/globalcontribs [1] https://guc.toolforge.org/ [2] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#SSH_tunneling_fo...

On Tue, Nov 10, 2020 at 3:26 PM Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:

...

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

Hi!

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What should I do? To avoid breaking changes, you can start making the following changes *now*:

Update existing tools to ensure queries are executed against the proper

database connection

Eg: If you want to query the `eswiki_p` DB, you must connect to the

`eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts

Check your existing tools and services queries for cross database JOINs,

rewrite the joins in application code

Eg: If you are doing a join across databases, for example joining

`enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

Timeline:

November - December: Early adopter testing

January 2021: Existing and new systems online, transition period starts

February 2021: Old hardware is decommissioned

We need your help

If you would like to beta test the new architecture, please let us know

and we will reach out to you soon

Sharing examples / descriptions of how a tool or service was updated,

writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

If you have questions or need help adapting your code or queries, please contact us [2] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication, or write on the talk page [3] https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign .

We will be sending reminders, and more specific examples of the changes via email and on the wiki page. For more information see [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

Brooke Storm

11:32 p.m.

Hi MA, You could still accomplish the local environment you are describing by using 8 ssh tunnels. All the database name DNS aliases go reference the section names, eventually (s1, s2, s3, s4 in the form of s1.analytics.db.svc.eqiad.wmflabs, etc.). An app could be written to connect to the correct section instead of the database if you are doing that kind of thing, but you’ll either need to make requests to https://noc.wikimedia.org/conf/dblists/s<correct-number>.dblist https://noc.wikimedia.org/conf/dblists/s%3Ccorrect-number%3E.dblist like https://noc.wikimedia.org/conf/dblists/s4.dblist https://noc.wikimedia.org/conf/dblists/s4.dblist and map things out from there or perhaps check DNS for the database name and look up the “s#” record from there (which is currently possible in Lua, and I can provide an example of how I did it in that language).

A mediawiki config checkout would also work besides what can be gleaned from noc.wikimedia.org http://noc.wikimedia.org/.

We can try to document some examples of how you might do it either way. I’m sure it is non-trivial, but 8 tunnels is more workable than 900, at least.

Routing by reading the queries on the fly is quite tricky. The closest I’ve seen ready-made tools come to that is ProxySQL, and that focuses on sharding, which is not exactly the same thing.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org mailto:bstorm@wikimedia.org IRC: bstorm

...

On Nov 10, 2020, at 2:13 PM, MusikAnimal musikanimal@gmail.com wrote:

Hi! Most tools query just a single db at a time, so I don't think this will be a massive problem. However some such as Global Contribs[0] and GUC[1] can theoretically query all of them from a single request. Creating new connections on-the-fly seems doable in production, the issue is how to work on these tools in a local environment. Currently the recommendation is to use a SSH tunnel to the desired host,[2] such as enwiki.analytics.db.svc.eqiad.wmflabs. Surely we can't do this same port forwarding for 900+ connections.

Any ideas? Perhaps there's some way to make a host that automatically forwards to the correct one, solely for developer use? Or will development of such global tools need to happen in the Cloud Services environment?

~ MA

[0] https://xtools.wmflabs.org/globalcontribs https://xtools.wmflabs.org/globalcontribs [1] https://guc.toolforge.org/ https://guc.toolforge.org/ [2] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#SSH_tunneling_fo... https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#SSH_tunneling_for_local_testing_which_makes_use_of_Wiki_Replica_databases On Tue, Nov 10, 2020 at 3:26 PM Joaquin Oltra Hernandez <jhernandez@wikimedia.org mailto:jhernandez@wikimedia.org> wrote: TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

Hi!

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What should I do? To avoid breaking changes, you can start making the following changes *now*:

Update existing tools to ensure queries are executed against the proper database connection

Eg: If you want to query the `eswiki_p` DB, you must connect to the `eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts

Check your existing tools and services queries for cross database JOINs, rewrite the joins in application code

Eg: If you are doing a join across databases, for example joining `enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

Timeline:

November - December: Early adopter testing

January 2021: Existing and new systems online, transition period starts

February 2021: Old hardware is decommissioned

We need your help

If you would like to beta test the new architecture, please let us know and we will reach out to you soon

Sharing examples / descriptions of how a tool or service was updated, writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

If you have questions or need help adapting your code or queries, please contact us [2] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication, or write on the talk page [3] https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign.

We will be sending reminders, and more specific examples of the changes via email and on the wiki page. For more information see [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign.

[1]: https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign [2]: https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication [3]: https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org mailto:Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org mailto:labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce https://lists.wikimedia.org/mailman/listinfo/cloud-announce _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

MusikAnimal

11 Nov 11 Nov

12:01 a.m.

Ah yes, 8 tunnels is more than manageable. The `slice` column in the meta_p.wiki table is the one we need to connect to for said wiki, right? So in theory, I always have SSH tunnels open for every slice, and the first thing I do is check meta_p.wiki for the given wiki, then I know which of those s1-s8 connections to use? So I really only need 8 connections (even in production). Maybe not what you would recommend for every tool, rather just the "global" ones facing this specific issue.

...

Can't you just tunnel to the login server and connect by hostname from

there?

Hmm I'm not sure I follow. Right now I SHH to login.toolforge.org, but with "-L 4711:enwiki.analytics.db.svc.eqiad.wmflabs:3306" for port forwarding from my local mysql to the remote. It sounds like instead I need to tunnel to s1-s8, and use the correct one based on the desired database.

~ MA

On Tue, Nov 10, 2020 at 4:32 PM Brooke Storm bstorm@wikimedia.org wrote:

...

Hi MA, You could still accomplish the local environment you are describing by using 8 ssh tunnels. All the database name DNS aliases go reference the section names, eventually (s1, s2, s3, s4 in the form of s1.analytics.db.svc.eqiad.wmflabs, etc.). An app could be written to connect to the correct section instead of the database if you are doing that kind of thing, but you’ll either need to make requests to https://noc.wikimedia.org/conf/dblists/s<correct-number>.dblist like https://noc.wikimedia.org/conf/dblists/s4.dblist and map things out from there or perhaps check DNS for the database name and look up the “s#” record from there (which is currently possible in Lua, and I can provide an example of how I did it in that language).

A mediawiki config checkout would also work besides what can be gleaned from noc.wikimedia.org.

We can try to document some examples of how you might do it either way. I’m sure it is non-trivial, but 8 tunnels is more workable than 900, at least.

Routing by reading the queries on the fly is quite tricky. The closest I’ve seen ready-made tools come to that is ProxySQL, and that focuses on sharding, which is not exactly the same thing.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org IRC: bstorm

On Nov 10, 2020, at 2:13 PM, MusikAnimal musikanimal@gmail.com wrote:

Hi! Most tools query just a single db at a time, so I don't think this will be a massive problem. However some such as Global Contribs[0] and GUC[1] can theoretically query all of them from a single request. Creating new connections on-the-fly seems doable in production, the issue is how to work on these tools in a local environment. Currently the recommendation is to use a SSH tunnel to the desired host,[2] such as enwiki.analytics.db.svc.eqiad.wmflabs. Surely we can't do this same port forwarding for 900+ connections.

Any ideas? Perhaps there's some way to make a host that automatically forwards to the correct one, solely for developer use? Or will development of such global tools need to happen in the Cloud Services environment?

~ MA

[0] https://xtools.wmflabs.org/globalcontribs [1] https://guc.toolforge.org/ [2] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#SSH_tunneling_fo...

On Tue, Nov 10, 2020 at 3:26 PM Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:

...
TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

Hi!

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What should I do? To avoid breaking changes, you can start making the following changes *now*:

Update existing tools to ensure queries are executed against the proper

database connection

Eg: If you want to query the `eswiki_p` DB, you must connect to the

`eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts

Check your existing tools and services queries for cross database

JOINs, rewrite the joins in application code

Eg: If you are doing a join across databases, for example joining

`enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

Timeline:

November - December: Early adopter testing

January 2021: Existing and new systems online, transition period starts

February 2021: Old hardware is decommissioned

We need your help

If you would like to beta test the new architecture, please let us know

and we will reach out to you soon

Sharing examples / descriptions of how a tool or service was updated,

writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

If you have questions or need help adapting your code or queries, please contact us [2] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication, or write on the talk page [3] https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign .

We will be sending reminders, and more specific examples of the changes via email and on the wiki page. For more information see [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Brooke Storm

12:13 a.m.

Yes, you might be able to use meta_p.wiki table. However, when wikis are moved between sections, nothing updates the meta_p.wiki table at this time. Requests to noc.wikimedia.org http://noc.wikimedia.org/ are accurate and up to date, as far as I know. We only update meta_p when we add the wiki (at least that’s how it is now). Also, the DNS gets synced and updated every time we run the script, so it is usually up-to-date. You could try meta_p.wiki and fall back to DNS or noc.wikimedia.org http://noc.wikimedia.org/ if that fails, perhaps? Meta_p is expected to be on s7 in the new design.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org mailto:bstorm@wikimedia.org IRC: bstorm

...

On Nov 10, 2020, at 3:01 PM, MusikAnimal musikanimal@gmail.com wrote:

Ah yes, 8 tunnels is more than manageable. The `slice` column in the meta_p.wiki table is the one we need to connect to for said wiki, right? So in theory, I always have SSH tunnels open for every slice, and the first thing I do is check meta_p.wiki for the given wiki, then I know which of those s1-s8 connections to use? So I really only need 8 connections (even in production). Maybe not what you would recommend for every tool, rather just the "global" ones facing this specific issue.

...
Can't you just tunnel to the login server and connect by hostname from there?

Hmm I'm not sure I follow. Right now I SHH to login.toolforge.org http://login.toolforge.org/, but with "-L 4711:enwiki.analytics.db.svc.eqiad.wmflabs:3306" for port forwarding from my local mysql to the remote. It sounds like instead I need to tunnel to s1-s8, and use the correct one based on the desired database.

~ MA

On Tue, Nov 10, 2020 at 4:32 PM Brooke Storm <bstorm@wikimedia.org mailto:bstorm@wikimedia.org> wrote: Hi MA, You could still accomplish the local environment you are describing by using 8 ssh tunnels. All the database name DNS aliases go reference the section names, eventually (s1, s2, s3, s4 in the form of s1.analytics.db.svc.eqiad.wmflabs, etc.). An app could be written to connect to the correct section instead of the database if you are doing that kind of thing, but you’ll either need to make requests to https://noc.wikimedia.org/conf/dblists/s<correct-number>.dblist https://noc.wikimedia.org/conf/dblists/s%3Ccorrect-number%3E.dblist like https://noc.wikimedia.org/conf/dblists/s4.dblist https://noc.wikimedia.org/conf/dblists/s4.dblist and map things out from there or perhaps check DNS for the database name and look up the “s#” record from there (which is currently possible in Lua, and I can provide an example of how I did it in that language).

A mediawiki config checkout would also work besides what can be gleaned from noc.wikimedia.org http://noc.wikimedia.org/.

We can try to document some examples of how you might do it either way. I’m sure it is non-trivial, but 8 tunnels is more workable than 900, at least.

Routing by reading the queries on the fly is quite tricky. The closest I’ve seen ready-made tools come to that is ProxySQL, and that focuses on sharding, which is not exactly the same thing.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org mailto:bstorm@wikimedia.org IRC: bstorm

...
On Nov 10, 2020, at 2:13 PM, MusikAnimal <musikanimal@gmail.com mailto:musikanimal@gmail.com> wrote:

Hi! Most tools query just a single db at a time, so I don't think this will be a massive problem. However some such as Global Contribs[0] and GUC[1] can theoretically query all of them from a single request. Creating new connections on-the-fly seems doable in production, the issue is how to work on these tools in a local environment. Currently the recommendation is to use a SSH tunnel to the desired host,[2] such as enwiki.analytics.db.svc.eqiad.wmflabs. Surely we can't do this same port forwarding for 900+ connections.

Any ideas? Perhaps there's some way to make a host that automatically forwards to the correct one, solely for developer use? Or will development of such global tools need to happen in the Cloud Services environment?

~ MA

[0] https://xtools.wmflabs.org/globalcontribs https://xtools.wmflabs.org/globalcontribs [1] https://guc.toolforge.org/ https://guc.toolforge.org/ [2] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#SSH_tunneling_fo... https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#SSH_tunneling_for_local_testing_which_makes_use_of_Wiki_Replica_databases On Tue, Nov 10, 2020 at 3:26 PM Joaquin Oltra Hernandez <jhernandez@wikimedia.org mailto:jhernandez@wikimedia.org> wrote: TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

Hi!

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What should I do? To avoid breaking changes, you can start making the following changes *now*:

Update existing tools to ensure queries are executed against the proper database connection

Eg: If you want to query the `eswiki_p` DB, you must connect to the `eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts

Check your existing tools and services queries for cross database JOINs, rewrite the joins in application code

Eg: If you are doing a join across databases, for example joining `enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

Timeline:

November - December: Early adopter testing

January 2021: Existing and new systems online, transition period starts

February 2021: Old hardware is decommissioned

We need your help

If you would like to beta test the new architecture, please let us know and we will reach out to you soon

Sharing examples / descriptions of how a tool or service was updated, writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

If you have questions or need help adapting your code or queries, please contact us [2] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication, or write on the talk page [3] https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign.

We will be sending reminders, and more specific examples of the changes via email and on the wiki page. For more information see [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign.

[1]: https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign [2]: https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication [3]: https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org mailto:Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org mailto:labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce https://lists.wikimedia.org/mailman/listinfo/cloud-announce _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org mailto:Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org mailto:labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org mailto:Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org mailto:labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud https://lists.wikimedia.org/mailman/listinfo/cloud _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

MusikAnimal

12:44 a.m.

Got it. The https://noc.wikimedia.org/conf/dblists/ lists are plenty fast and easy enough to parse. I'll just cache that. It would be neat if we could rely on the slice specified in meta_p in the future, as in my case we have to query meta_p.wiki regardless, but not a big deal :)

Thank you! I think I have enough information to move forward.

~ MA

On Tue, Nov 10, 2020 at 5:13 PM Brooke Storm bstorm@wikimedia.org wrote:

...

Yes, you might be able to use meta_p.wiki table. However, when wikis are moved between sections, nothing updates the meta_p.wiki table at this time. Requests to noc.wikimedia.org are accurate and up to date, as far as I know. We only update meta_p when we add the wiki (at least that’s how it is now). Also, the DNS gets synced and updated every time we run the script, so it is usually up-to-date. You could try meta_p.wiki and fall back to DNS or noc.wikimedia.org if that fails, perhaps? Meta_p is expected to be on s7 in the new design.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org IRC: bstorm

On Nov 10, 2020, at 3:01 PM, MusikAnimal musikanimal@gmail.com wrote:

Ah yes, 8 tunnels is more than manageable. The `slice` column in the meta_p.wiki table is the one we need to connect to for said wiki, right? So in theory, I always have SSH tunnels open for every slice, and the first thing I do is check meta_p.wiki for the given wiki, then I know which of those s1-s8 connections to use? So I really only need 8 connections (even in production). Maybe not what you would recommend for every tool, rather just the "global" ones facing this specific issue.

...
Can't you just tunnel to the login server and connect by hostname from

there?

Hmm I'm not sure I follow. Right now I SHH to login.toolforge.org, but with "-L 4711:enwiki.analytics.db.svc.eqiad.wmflabs:3306" for port forwarding from my local mysql to the remote. It sounds like instead I need to tunnel to s1-s8, and use the correct one based on the desired database.

~ MA

On Tue, Nov 10, 2020 at 4:32 PM Brooke Storm bstorm@wikimedia.org wrote:

...
Hi MA, You could still accomplish the local environment you are describing by using 8 ssh tunnels. All the database name DNS aliases go reference the section names, eventually (s1, s2, s3, s4 in the form of s1.analytics.db.svc.eqiad.wmflabs, etc.). An app could be written to connect to the correct section instead of the database if you are doing that kind of thing, but you’ll either need to make requests to https://noc.wikimedia.org/conf/dblists/s<correct-number>.dblist like https://noc.wikimedia.org/conf/dblists/s4.dblist and map things out from there or perhaps check DNS for the database name and look up the “s#” record from there (which is currently possible in Lua, and I can provide an example of how I did it in that language).

A mediawiki config checkout would also work besides what can be gleaned from noc.wikimedia.org.

We can try to document some examples of how you might do it either way. I’m sure it is non-trivial, but 8 tunnels is more workable than 900, at least.

Routing by reading the queries on the fly is quite tricky. The closest I’ve seen ready-made tools come to that is ProxySQL, and that focuses on sharding, which is not exactly the same thing.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org IRC: bstorm

On Nov 10, 2020, at 2:13 PM, MusikAnimal musikanimal@gmail.com wrote:

Hi! Most tools query just a single db at a time, so I don't think this will be a massive problem. However some such as Global Contribs[0] and GUC[1] can theoretically query all of them from a single request. Creating new connections on-the-fly seems doable in production, the issue is how to work on these tools in a local environment. Currently the recommendation is to use a SSH tunnel to the desired host,[2] such as enwiki.analytics.db.svc.eqiad.wmflabs. Surely we can't do this same port forwarding for 900+ connections.

Any ideas? Perhaps there's some way to make a host that automatically forwards to the correct one, solely for developer use? Or will development of such global tools need to happen in the Cloud Services environment?

~ MA

[0] https://xtools.wmflabs.org/globalcontribs [1] https://guc.toolforge.org/ [2] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#SSH_tunneling_fo...

On Tue, Nov 10, 2020 at 3:26 PM Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:

...
TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

Hi!

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What should I do? To avoid breaking changes, you can start making the following changes *now*:

Update existing tools to ensure queries are executed against the

proper database connection

Eg: If you want to query the `eswiki_p` DB, you must connect to the

`eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts

Check your existing tools and services queries for cross database

JOINs, rewrite the joins in application code

Eg: If you are doing a join across databases, for example joining

`enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

Timeline:

November - December: Early adopter testing

January 2021: Existing and new systems online, transition period starts

February 2021: Old hardware is decommissioned

We need your help

If you would like to beta test the new architecture, please let us

know and we will reach out to you soon

Sharing examples / descriptions of how a tool or service was updated,

writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

If you have questions or need help adapting your code or queries, please contact us [2] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication, or write on the talk page [3] https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign .

We will be sending reminders, and more specific examples of the changes via email and on the wiki page. For more information see [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

David Caro

19 Nov 19 Nov

4:40 p.m.

On a bit of a side note, for forwarding many ips/ports through ssh might be interesting a tool like sshuttle [1], I'v used it in the past with success.

It's a bit complex, it uses an ssh tunnel + iptables/pf/... rules to move the traffic through that tunnel, and a process running on the other side copied and started through that same ssh tunnel, so if you have a complex network setup it might break some things (ex. it does not work yet with systemd-resolved, default on fedora 33).

[1] https://github.com/sshuttle/sshuttle

Cheers,

On 11/10 17:44, MusikAnimal wrote:

...

Got it. The https://noc.wikimedia.org/conf/dblists/ lists are plenty fast and easy enough to parse. I'll just cache that. It would be neat if we could rely on the slice specified in meta_p in the future, as in my case we have to query meta_p.wiki regardless, but not a big deal :)

Thank you! I think I have enough information to move forward.

~ MA

On Tue, Nov 10, 2020 at 5:13 PM Brooke Storm bstorm@wikimedia.org wrote:

...
Yes, you might be able to use meta_p.wiki table. However, when wikis are moved between sections, nothing updates the meta_p.wiki table at this time. Requests to noc.wikimedia.org are accurate and up to date, as far as I know. We only update meta_p when we add the wiki (at least that’s how it is now). Also, the DNS gets synced and updated every time we run the script, so it is usually up-to-date. You could try meta_p.wiki and fall back to DNS or noc.wikimedia.org if that fails, perhaps? Meta_p is expected to be on s7 in the new design.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org IRC: bstorm

On Nov 10, 2020, at 3:01 PM, MusikAnimal musikanimal@gmail.com wrote:

Ah yes, 8 tunnels is more than manageable. The `slice` column in the meta_p.wiki table is the one we need to connect to for said wiki, right? So in theory, I always have SSH tunnels open for every slice, and the first thing I do is check meta_p.wiki for the given wiki, then I know which of those s1-s8 connections to use? So I really only need 8 connections (even in production). Maybe not what you would recommend for every tool, rather just the "global" ones facing this specific issue.

...
Can't you just tunnel to the login server and connect by hostname from

there?

Hmm I'm not sure I follow. Right now I SHH to login.toolforge.org, but with "-L 4711:enwiki.analytics.db.svc.eqiad.wmflabs:3306" for port forwarding from my local mysql to the remote. It sounds like instead I need to tunnel to s1-s8, and use the correct one based on the desired database.

~ MA

On Tue, Nov 10, 2020 at 4:32 PM Brooke Storm bstorm@wikimedia.org wrote:

...
Hi MA, You could still accomplish the local environment you are describing by using 8 ssh tunnels. All the database name DNS aliases go reference the section names, eventually (s1, s2, s3, s4 in the form of s1.analytics.db.svc.eqiad.wmflabs, etc.). An app could be written to connect to the correct section instead of the database if you are doing that kind of thing, but you’ll either need to make requests to https://noc.wikimedia.org/conf/dblists/s<correct-number>.dblist like https://noc.wikimedia.org/conf/dblists/s4.dblist and map things out from there or perhaps check DNS for the database name and look up the “s#” record from there (which is currently possible in Lua, and I can provide an example of how I did it in that language).

A mediawiki config checkout would also work besides what can be gleaned from noc.wikimedia.org.

We can try to document some examples of how you might do it either way. I’m sure it is non-trivial, but 8 tunnels is more workable than 900, at least.

Routing by reading the queries on the fly is quite tricky. The closest I’ve seen ready-made tools come to that is ProxySQL, and that focuses on sharding, which is not exactly the same thing.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org IRC: bstorm

On Nov 10, 2020, at 2:13 PM, MusikAnimal musikanimal@gmail.com wrote:

Hi! Most tools query just a single db at a time, so I don't think this will be a massive problem. However some such as Global Contribs[0] and GUC[1] can theoretically query all of them from a single request. Creating new connections on-the-fly seems doable in production, the issue is how to work on these tools in a local environment. Currently the recommendation is to use a SSH tunnel to the desired host,[2] such as enwiki.analytics.db.svc.eqiad.wmflabs. Surely we can't do this same port forwarding for 900+ connections.

Any ideas? Perhaps there's some way to make a host that automatically forwards to the correct one, solely for developer use? Or will development of such global tools need to happen in the Cloud Services environment?

~ MA

[0] https://xtools.wmflabs.org/globalcontribs [1] https://guc.toolforge.org/ [2] https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#SSH_tunneling_fo...

On Tue, Nov 10, 2020 at 3:26 PM Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:

...
TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

Hi!

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What should I do? To avoid breaking changes, you can start making the following changes *now*:

Update existing tools to ensure queries are executed against the

proper database connection

Eg: If you want to query the `eswiki_p` DB, you must connect to the

`eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts

Check your existing tools and services queries for cross database

JOINs, rewrite the joins in application code

Eg: If you are doing a join across databases, for example joining

`enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

Timeline:

November - December: Early adopter testing

January 2021: Existing and new systems online, transition period starts

February 2021: Old hardware is decommissioned

We need your help

If you would like to beta test the new architecture, please let us

know and we will reach out to you soon

Sharing examples / descriptions of how a tool or service was updated,

writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

If you have questions or need help adapting your code or queries, please contact us [2] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication, or write on the talk page [3] https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign .

We will be sending reminders, and more specific examples of the changes via email and on the wiki page. For more information see [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

...

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- David Caro SRE - Cloud Services Wikimedia Foundation https://wikimediafoundation.org/ PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3

Gergo Tisza

10 Nov 10 Nov

11:48 p.m.

On Tue, Nov 10, 2020 at 1:15 PM MusikAnimal musikanimal@gmail.com wrote:

...

Hi! Most tools query just a single db at a time, so I don't think this will be a massive problem. However some such as Global Contribs[0] and GUC[1] can theoretically query all of them from a single request. Creating new connections on-the-fly seems doable in production, the issue is how to work on these tools in a local environment. Currently the recommendation is to use a SSH tunnel to the desired host,[2] such as enwiki.analytics.db.svc.eqiad.wmflabs. Surely we can't do this same port forwarding for 900+ connections.

Any ideas? Perhaps there's some way to make a host that automatically forwards to the correct one, solely for developer use? Or will development of such global tools need to happen in the Cloud Services environment?

Can't you just tunnel to the login server and connect by hostname from there?

Huji Lee

11 Nov 11 Nov

3:47 a.m.

Cross-wiki JOINS are used by some of the queries we run regularly for fawiki. One of those queries looks for articles that don't have an image in their infobox in fawiki, but do have one on enwiki, so that we can use/import that image. Another one JOINs fawiki data with commons data to look for redundant images. Yet another one, looks for articles that all use an image that doesn't exist (for cleanup purposes) but needs to join with commons db because the referenced file might exist there. Lastly, we have a report that looks for fair use images on fawiki that had the same name as an image on enwiki where the enwiki copy was deleted; this usually indicates in improper application of fair use, and enwiki -- due to its larger community -- finds and deletes these faster than we could on fawiki.

There may be other cases I am unaware of. The point is, losing the cross-wiki JOIN capability can make some of the above tasks really difficult or completely impossible.

On Tue, Nov 10, 2020 at 3:27 PM Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:

...

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

Hi!

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What should I do? To avoid breaking changes, you can start making the following changes *now*:

Update existing tools to ensure queries are executed against the proper

database connection

Eg: If you want to query the `eswiki_p` DB, you must connect to the

`eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts

Check your existing tools and services queries for cross database JOINs,

rewrite the joins in application code

Eg: If you are doing a join across databases, for example joining

`enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

Timeline:

November - December: Early adopter testing

January 2021: Existing and new systems online, transition period starts

February 2021: Old hardware is decommissioned

We need your help

If you would like to beta test the new architecture, please let us know

and we will reach out to you soon

Sharing examples / descriptions of how a tool or service was updated,

writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

If you have questions or need help adapting your code or queries, please contact us [2] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication, or write on the talk page [3] https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign .

We will be sending reminders, and more specific examples of the changes via email and on the wiki page. For more information see [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

AntiCompositeNumber

6:25 a.m.

Most cross-db JOINs can be recreated using two queries and an external tool to filter the results. However, there are some queries that would be simply impractical due to the large amount of data involved, and the query for overlapping local and Commons images is one of them. There are basically two ways to recreate the query: re-implement the inner join or re-implement a semi-join subquery.

Recreating a JOIN is conceptually very simple: get two lists and compare them. However, there are 67,034 files on fawiki, 891,286 files on enwiki, and 65,559,375 files on Commons. Simply joining by name would be impossible -- MariaDB would time out a few hundred times before returning all that data, and even if it did, storing those lists even as efficiently as possible would be quite the memory hog. So the query would have to be paginated. The only common identifier we have is the file name, and since the letters in the names aren't evenly distributed, paginating wouldn't exactly be fun. The other option is implementing the Commons lookup like a semi-join subquery. Iterate over the local data, paginating any way you want. Then, for every item, query the Commons database for that title. Of course, we're now making a million requests to the database, which isn't going to be very fast simply due to network delays. We could be a little nicer and group a bunch of titles together in the query, which will probably get us down from a million queries to fifty thousand or so. Of course, this all gets more complicated if you want a query more complex than SELECT enwiki_p.img_title FROM enwiki_p.image JOIN commonswiki_p.image ON enwiki_p.img_title = commonswiki_p.img_title;

I understand the system engineering reasons for this change, but I think it's worth underscoring exactly how disruptive it will be for the queries that depended on this functionality. I'm certainly no expert, but I'm willing to help wrap queries in Python until they start working again.

ACN

On Tue, Nov 10, 2020 at 8:48 PM Huji Lee huji.huji@gmail.com wrote:

...

Cross-wiki JOINS are used by some of the queries we run regularly for fawiki. One of those queries looks for articles that don't have an image in their infobox in fawiki, but do have one on enwiki, so that we can use/import that image. Another one JOINs fawiki data with commons data to look for redundant images. Yet another one, looks for articles that all use an image that doesn't exist (for cleanup purposes) but needs to join with commons db because the referenced file might exist there. Lastly, we have a report that looks for fair use images on fawiki that had the same name as an image on enwiki where the enwiki copy was deleted; this usually indicates in improper application of fair use, and enwiki -- due to its larger community -- finds and deletes these faster than we could on fawiki.

There may be other cases I am unaware of. The point is, losing the cross-wiki JOIN capability can make some of the above tasks really difficult or completely impossible.

On Tue, Nov 10, 2020 at 3:27 PM Joaquin Oltra Hernandez jhernandez@wikimedia.org wrote:

...
TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] for more details.

Hi!

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What should I do? To avoid breaking changes, you can start making the following changes *now*:

Update existing tools to ensure queries are executed against the proper database connection

Eg: If you want to query the `eswiki_p` DB, you must connect to the `eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts

Check your existing tools and services queries for cross database JOINs, rewrite the joins in application code

Eg: If you are doing a join across databases, for example joining `enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

Timeline:

November - December: Early adopter testing

January 2021: Existing and new systems online, transition period starts

February 2021: Old hardware is decommissioned

We need your help

If you would like to beta test the new architecture, please let us know and we will reach out to you soon

Sharing examples / descriptions of how a tool or service was updated, writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

If you have questions or need help adapting your code or queries, please contact us [2], or write on the talk page [3].

We will be sending reminders, and more specific examples of the changes via email and on the wiki page. For more information see [1].

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly labs-announce@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud-announce _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

xover＠pobox.com

8:58 a.m.

On Wed, Nov 11, 2020 at 5:26 AM AntiCompositeNumber anticompositenumber@gmail.com wrote:

...

I understand the system engineering reasons for this change, but I think it's worth underscoring exactly how disruptive it will be for the queries that depended on this functionality.

The use cases seem to be relatively few and relatively limited. Could this perhap be a good case for a data mart (ETL) or meta index style approach? I'm thinking of things like CloverDX and Jaspersoft ETL, or even Apache Solr or another non-SQL solution.

Moving JOINs up the stack from the SQL layer to the application layer does not sound like an architecturally sound approach.

Cheers, Xover

Joaquin Oltra Hernandez

16 Nov 16 Nov

10:29 p.m.

Moving the joins to the application layer definitely makes things quite complex compared to an SQL query.

Having a data lake or other solutions like you mention makes it more feasible to do these kinds of joins with big data, but it also usually requires careful schema and index design when moving the data to it for the queries to be performant. In these cases you would also lose the flexibility of arbitrarily querying the DB like the replicas provide currently, so in the end there would be a different set of tradeoffs. It is important to understand what things are truly not doable with existing tools and services, so that something like this can be considered for filling the gaps if necessary.

Currently the focus is keeping the replicas stable, maintainable and performant, so this work must happen soon.

On Wed, Nov 11, 2020 at 7:59 AM xover@pobox.com wrote:

...

On Wed, Nov 11, 2020 at 5:26 AM AntiCompositeNumber anticompositenumber@gmail.com wrote:

...
I understand the system engineering reasons for this change, but I think it's worth underscoring exactly how disruptive it will be for the queries that depended on this functionality.

The use cases seem to be relatively few and relatively limited. Could this perhap be a good case for a data mart (ETL) or meta index style approach? I'm thinking of things like CloverDX and Jaspersoft ETL, or even Apache Solr or another non-SQL solution.

Moving JOINs up the stack from the SQL layer to the application layer does not sound like an architecturally sound approach.

Cheers, Xover

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

Joaquin Oltra Hernandez

10:20 p.m.

I have incorporated some of the suggestions and info from the threads to the wiki page.

I would like to document specific code examples, specially of SQL cross joins, and how migrating away from them would look like. If you have looked at this and done changes to your queries and code it would be super helpful if you can point me to the repo/code to show real examples in the documentation.

If you can share real use cases of your use cases like Huji Lee and MusikAnimal, it is also very useful to discuss and get help, thanks Brooke and ACN.

Huji, how do you run the queries? Do you have a Toolforge project with code where you query the DB? Do you use Quarry or maybe PAWS? The migration path is different depending on your workflow and skills so that background information helps provide suggestions. Like ACN mentioned, the answer varies a lot depending on context and use case, we can try and help you do the changes bit by bit. If you want to make a new thread with the specifics of your code we should be able to help come to solutions.

On Wed, Nov 11, 2020 at 5:25 AM AntiCompositeNumber < anticompositenumber@gmail.com> wrote:

...

Most cross-db JOINs can be recreated using two queries and an external tool to filter the results. However, there are some queries that would be simply impractical due to the large amount of data involved, and the query for overlapping local and Commons images is one of them. There are basically two ways to recreate the query: re-implement the inner join or re-implement a semi-join subquery.

Recreating a JOIN is conceptually very simple: get two lists and compare them. However, there are 67,034 files on fawiki, 891,286 files on enwiki, and 65,559,375 files on Commons. Simply joining by name would be impossible -- MariaDB would time out a few hundred times before returning all that data, and even if it did, storing those lists even as efficiently as possible would be quite the memory hog. So the query would have to be paginated. The only common identifier we have is the file name, and since the letters in the names aren't evenly distributed, paginating wouldn't exactly be fun. The other option is implementing the Commons lookup like a semi-join subquery. Iterate over the local data, paginating any way you want. Then, for every item, query the Commons database for that title. Of course, we're now making a million requests to the database, which isn't going to be very fast simply due to network delays. We could be a little nicer and group a bunch of titles together in the query, which will probably get us down from a million queries to fifty thousand or so. Of course, this all gets more complicated if you want a query more complex than SELECT enwiki_p.img_title FROM enwiki_p.image JOIN commonswiki_p.image ON enwiki_p.img_title = commonswiki_p.img_title;

I understand the system engineering reasons for this change, but I think it's worth underscoring exactly how disruptive it will be for the queries that depended on this functionality. I'm certainly no expert, but I'm willing to help wrap queries in Python until they start working again.

ACN

On Tue, Nov 10, 2020 at 8:48 PM Huji Lee huji.huji@gmail.com wrote:

...
Cross-wiki JOINS are used by some of the queries we run regularly for

fawiki. One of those queries looks for articles that don't have an image in their infobox in fawiki, but do have one on enwiki, so that we can use/import that image. Another one JOINs fawiki data with commons data to look for redundant images. Yet another one, looks for articles that all use an image that doesn't exist (for cleanup purposes) but needs to join with commons db because the referenced file might exist there. Lastly, we have a report that looks for fair use images on fawiki that had the same name as an image on enwiki where the enwiki copy was deleted; this usually indicates in improper application of fair use, and enwiki -- due to its larger community -- finds and deletes these faster than we could on fawiki.

...
There may be other cases I am unaware of. The point is, losing the

cross-wiki JOIN capability can make some of the above tasks really difficult or completely impossible.

...
On Tue, Nov 10, 2020 at 3:27 PM Joaquin Oltra Hernandez <

jhernandez@wikimedia.org> wrote:

...
...
TLDR: Wiki Replicas' architecture is being redesigned for stability and

performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] for more details.

...
...
Hi!

In the interest of making and keeping Wiki Replicas a stable and

performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

...
...
What should I do? To avoid breaking changes, you can start making the

following changes *now*:

...
...

Update existing tools to ensure queries are executed against the

proper database connection

...
...

Eg: If you want to query the `eswiki_p` DB, you must connect to the

`eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts

...
...

Check your existing tools and services queries for cross database

JOINs, rewrite the joins in application code

...
...

Eg: If you are doing a join across databases, for example joining

`enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

...
...
Timeline:

November - December: Early adopter testing

January 2021: Existing and new systems online, transition period

starts

...
...

February 2021: Old hardware is decommissioned

We need your help

If you would like to beta test the new architecture, please let us

know and we will reach out to you soon

...
...

Sharing examples / descriptions of how a tool or service was updated,

writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

...
...
If you have questions or need help adapting your code or queries,

please contact us [2], or write on the talk page [3].

...
...
We will be sending reminders, and more specific examples of the changes

via email and on the wiki page. For more information see [1].

...
...
[1]:

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign

...
...
[2]:

https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication

...
...
[3]:

https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign

...
...
-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly

labs-announce@lists.wikimedia.org)

...
...
https://lists.wikimedia.org/mailman/listinfo/cloud-announce _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

Huji Lee

11:05 p.m.

Yes, I have a Toolforge account and there are a bunch of cronjobs that run weekly (and a few that run daily).

The code can be found at https://github.com/PersianWikipedia/fawikibot/tree/master/HujiBot where stats.py is the program that actually connects to the DB, but weekly.py and weekly-slow.py are wrapper scripts in which you can find the SQL queries themselves. All queries initiate on fawiki_p so by simply searching those two python files for keywords like "commonswiki" or "enwiki" you will find all cross-wiki joins we currently have.

On Mon, Nov 16, 2020 at 3:20 PM Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:

...

I have incorporated some of the suggestions and info from the threads to the wiki page.

I would like to document specific code examples, specially of SQL cross joins, and how migrating away from them would look like. If you have looked at this and done changes to your queries and code it would be super helpful if you can point me to the repo/code to show real examples in the documentation.

If you can share real use cases of your use cases like Huji Lee and MusikAnimal, it is also very useful to discuss and get help, thanks Brooke and ACN.

Huji, how do you run the queries? Do you have a Toolforge project with code where you query the DB? Do you use Quarry or maybe PAWS? The migration path is different depending on your workflow and skills so that background information helps provide suggestions. Like ACN mentioned, the answer varies a lot depending on context and use case, we can try and help you do the changes bit by bit. If you want to make a new thread with the specifics of your code we should be able to help come to solutions.

On Wed, Nov 11, 2020 at 5:25 AM AntiCompositeNumber < anticompositenumber@gmail.com> wrote:

...
Most cross-db JOINs can be recreated using two queries and an external tool to filter the results. However, there are some queries that would be simply impractical due to the large amount of data involved, and the query for overlapping local and Commons images is one of them. There are basically two ways to recreate the query: re-implement the inner join or re-implement a semi-join subquery.

Recreating a JOIN is conceptually very simple: get two lists and compare them. However, there are 67,034 files on fawiki, 891,286 files on enwiki, and 65,559,375 files on Commons. Simply joining by name would be impossible -- MariaDB would time out a few hundred times before returning all that data, and even if it did, storing those lists even as efficiently as possible would be quite the memory hog. So the query would have to be paginated. The only common identifier we have is the file name, and since the letters in the names aren't evenly distributed, paginating wouldn't exactly be fun. The other option is implementing the Commons lookup like a semi-join subquery. Iterate over the local data, paginating any way you want. Then, for every item, query the Commons database for that title. Of course, we're now making a million requests to the database, which isn't going to be very fast simply due to network delays. We could be a little nicer and group a bunch of titles together in the query, which will probably get us down from a million queries to fifty thousand or so. Of course, this all gets more complicated if you want a query more complex than SELECT enwiki_p.img_title FROM enwiki_p.image JOIN commonswiki_p.image ON enwiki_p.img_title = commonswiki_p.img_title;

I understand the system engineering reasons for this change, but I think it's worth underscoring exactly how disruptive it will be for the queries that depended on this functionality. I'm certainly no expert, but I'm willing to help wrap queries in Python until they start working again.

ACN

On Tue, Nov 10, 2020 at 8:48 PM Huji Lee huji.huji@gmail.com wrote:

...
Cross-wiki JOINS are used by some of the queries we run regularly for

fawiki. One of those queries looks for articles that don't have an image in their infobox in fawiki, but do have one on enwiki, so that we can use/import that image. Another one JOINs fawiki data with commons data to look for redundant images. Yet another one, looks for articles that all use an image that doesn't exist (for cleanup purposes) but needs to join with commons db because the referenced file might exist there. Lastly, we have a report that looks for fair use images on fawiki that had the same name as an image on enwiki where the enwiki copy was deleted; this usually indicates in improper application of fair use, and enwiki -- due to its larger community -- finds and deletes these faster than we could on fawiki.

...
There may be other cases I am unaware of. The point is, losing the

cross-wiki JOIN capability can make some of the above tasks really difficult or completely impossible.

...
On Tue, Nov 10, 2020 at 3:27 PM Joaquin Oltra Hernandez <

jhernandez@wikimedia.org> wrote:

...
...
TLDR: Wiki Replicas' architecture is being redesigned for stability

and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] for more details.

...
...
Hi!

In the interest of making and keeping Wiki Replicas a stable and

performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

...
...
What should I do? To avoid breaking changes, you can start making the

following changes *now*:

...
...

Update existing tools to ensure queries are executed against the

proper database connection

...
...

Eg: If you want to query the `eswiki_p` DB, you must connect to

the `eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts

...
...

Check your existing tools and services queries for cross database

JOINs, rewrite the joins in application code

...
...

Eg: If you are doing a join across databases, for example joining

`enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

...
...
Timeline:

November - December: Early adopter testing

January 2021: Existing and new systems online, transition period

starts

...
...

February 2021: Old hardware is decommissioned

We need your help

If you would like to beta test the new architecture, please let us

know and we will reach out to you soon

...
...

Sharing examples / descriptions of how a tool or service was

updated, writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

...
...
If you have questions or need help adapting your code or queries,

please contact us [2], or write on the talk page [3].

...
...
We will be sending reminders, and more specific examples of the

changes via email and on the wiki page. For more information see [1].

...
...
[1]:

https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign

...
...
[2]:

https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication

...
...
[3]:

https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign

...
...
-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services announce mailing list Cloud-announce@lists.wikimedia.org (formerly

labs-announce@lists.wikimedia.org)

...
...
https://lists.wikimedia.org/mailman/listinfo/cloud-announce _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Maarten Dammers

11 Nov 11 Nov

11:11 p.m.

Hi Joaquin,

On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

...

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten

MusikAnimal

11:30 p.m.

Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right?

~ MA

On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl wrote:

...

Hi Joaquin, On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Huji Lee

12 Nov 12 Nov

12:17 a.m.

One some level, the real issue here is that different wikis are living on different slices (s1, s2, s3).

One possible solution is to replicate "shared" wikis (Wikidata and Commons) and possibly a few other "mother" wikis (at least En WP) into *every* slice. The uses cases that need to join enwiktionary with fawikibooks are rare or nonexistent; we don't need every wiki database to be accessible on every server. But we certainly want some of them to be universally accessible.

Can we do that?

On Wed, Nov 11, 2020 at 4:31 PM MusikAnimal musikanimal@gmail.com wrote:

...

Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right?

~ MA

On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl wrote:

...
Hi Joaquin, On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Martin Urbanec

1:39 a.m.

MusikAnimal is right, however, Wikidata and Commons either have a sui generis slice, or they share it with a few very large wikis. Tools that do any kind of crosswiki analysis would instantly break, as most of them utilise joining by Wikidata items at the very least.

I second Maarten here. This would mean a lot of things that currently require a (relatively simple) SQL query would need a full script, which would do the join at the application level.

I fully understand the reasoning, but there needs to be some replacement. Intentionally introduce breaking changes while providing no "new standard" is a bad pattern in a community environment.

Martin

On Wed, Nov 11, 2020, 10:31 PM MusikAnimal musikanimal@gmail.com wrote:

...

Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right?

~ MA

On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl wrote:

...
Hi Joaquin, On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

John

1:58 a.m.

I’ll throw my hat in this too. Moving it to the application layer will make a number of queries just not feasible any longer. It might make sense from the administration side, but from the user perspective it beaks one of the biggest features that toolforge has.

On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec martin.urbanec@wikimedia.cz wrote:

...

MusikAnimal is right, however, Wikidata and Commons either have a sui generis slice, or they share it with a few very large wikis. Tools that do any kind of crosswiki analysis would instantly break, as most of them utilise joining by Wikidata items at the very least.

I second Maarten here. This would mean a lot of things that currently require a (relatively simple) SQL query would need a full script, which would do the join at the application level.

I fully understand the reasoning, but there needs to be some replacement. Intentionally introduce breaking changes while providing no "new standard" is a bad pattern in a community environment.

Martin

On Wed, Nov 11, 2020, 10:31 PM MusikAnimal musikanimal@gmail.com wrote:

...
Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right?

~ MA

On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl wrote:

...
Hi Joaquin, On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Maarten Dammers

13 Nov 13 Nov

1:30 a.m.

I recall some point in time (Toolserver maybe?) when all the slices (overview at https://tools-info.toolforge.org/?listmetap ) were at different servers, but the Commons slice (s4) was on every server. At some point new fancy database servers were introduced with all the slices on all servers. Having 6 servers with each one having a slice + s4 (Commons) + s8 (Wikidata) might be a good compromise.

On 12-11-2020 00:58, John wrote:

...

On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec <martin.urbanec@wikimedia.cz mailto:martin.urbanec@wikimedia.cz> wrote:

MusikAnimal is right, however, Wikidata and Commons either have a
sui generis slice, or they share it with a few very large wikis.
Tools that do any kind of crosswiki analysis would instantly
break, as most of them utilise joining by Wikidata items at the
very least.

I second Maarten here. This would mean a lot of things that
currently require a (relatively simple) SQL query would need a
full script, which would do the join at the application level.

I fully understand the reasoning, but there needs to be some
replacement. Intentionally introduce breaking changes while
providing no "new standard" is a bad pattern in a community
environment.

Martin

On Wed, Nov 11, 2020, 10:31 PM MusikAnimal <musikanimal@gmail.com
<mailto:musikanimal@gmail.com>> wrote:

    Technically, cross-wiki joins aren't completely disallowed,
    you just have to make sure each of the db names are on the
    same slice/section, right?

    ~ MA

    On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers
    <maarten@mdammers.nl <mailto:maarten@mdammers.nl>> wrote:

        Hi Joaquin,

        On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

...

        TLDR: Wiki Replicas' architecture is being redesigned for
        stability and performance. Cross database JOINs will not
        be available and a host connection will only allow
        querying its associated DB. See [1]
        <https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign>
        for more details.

        If you only think of Wikipedia, not a lot will break
        probably, but if you take into account Commons and
        Wikidata a lot will break. A quick grep in my folder with
        Commons queries returns 123 lines with cross database
        joins. So yes, stuff will break and tools will be
        abandoned. This follows the practice that seems to have
        become standard for the WMF these days: Decisions are made
        with a small group within the WMF without any community
        involved. Only after the decision has been made, it's
        announced.

        Unhappy and disappointed,

        Maarten

        _______________________________________________
        Wikimedia Cloud Services mailing list
        Cloud@lists.wikimedia.org
        <mailto:Cloud@lists.wikimedia.org> (formerly
        labs-l@lists.wikimedia.org
        <mailto:labs-l@lists.wikimedia.org>)
        https://lists.wikimedia.org/mailman/listinfo/cloud

    _______________________________________________
    Wikimedia Cloud Services mailing list
    Cloud@lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org>
    (formerly labs-l@lists.wikimedia.org
    <mailto:labs-l@lists.wikimedia.org>)
    https://lists.wikimedia.org/mailman/listinfo/cloud

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org <mailto:Cloud@lists.wikimedia.org>
(formerly labs-l@lists.wikimedia.org
<mailto:labs-l@lists.wikimedia.org>)
https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Martin Urbanec

2:16 a.m.

+1 to Marteen

Another idea is to have the database structured as-planned, but add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

Martin

pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers maarten@mdammers.nl napsal:

...

I recall some point in time (Toolserver maybe?) when all the slices (overview at https://tools-info.toolforge.org/?listmetap ) were at different servers, but the Commons slice (s4) was on every server. At some point new fancy database servers were introduced with all the slices on all servers. Having 6 servers with each one having a slice + s4 (Commons) + s8 (Wikidata) might be a good compromise. On 12-11-2020 00:58, John wrote:

I’ll throw my hat in this too. Moving it to the application layer will make a number of queries just not feasible any longer. It might make sense from the administration side, but from the user perspective it beaks one of the biggest features that toolforge has.

On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
MusikAnimal is right, however, Wikidata and Commons either have a sui generis slice, or they share it with a few very large wikis. Tools that do any kind of crosswiki analysis would instantly break, as most of them utilise joining by Wikidata items at the very least.

I second Maarten here. This would mean a lot of things that currently require a (relatively simple) SQL query would need a full script, which would do the join at the application level.

I fully understand the reasoning, but there needs to be some replacement. Intentionally introduce breaking changes while providing no "new standard" is a bad pattern in a community environment.

Martin

On Wed, Nov 11, 2020, 10:31 PM MusikAnimal musikanimal@gmail.com wrote:

...
Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right?

~ MA

On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl wrote:

...
Hi Joaquin, On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing listCloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Kimmo Virtanen

8:51 a.m.

...

Maarten: Having 6 servers with each one having a slice + s4 (Commons) +

s8 (Wikidata) might be a good compromise.

...

Martin: Another idea is to have the database structured as-planned, but

add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

...

From the point of view of a person who is using cross database joins on

both tools and analysis queries I would say that both ideas would be suitable. I think that 90% of my crosswiki queries are written against *wiki + wikidata/commons. However, I would not say that it is only for those who really need it but more like that cross database joins are an awesome feature for everybody and it is a loss if it will be gone.

In older times we had also ability to do joins between user databases and replica databases, which was removed in 2017 if I googled correctly.[1] My guess is that one reason for the increasing query complexity is that there is no possibility for creating tmp tables or joining to preselected data so everything is done in single queries. In any case, if the solution is what Martin suggests to move cross joinable databases to a single server and the original problem was that it was hard to keep in sync multiple servers then we could reintroduce the user database joins as well.

[1] https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server...

Br, -- Kimmo Virtanen, Zache

On Fri, Nov 13, 2020 at 2:17 AM Martin Urbanec martin.urbanec@wikimedia.cz wrote:

...

+1 to Marteen

Another idea is to have the database structured as-planned, but add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

Martin

pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers maarten@mdammers.nl napsal:

...
I recall some point in time (Toolserver maybe?) when all the slices (overview at https://tools-info.toolforge.org/?listmetap ) were at different servers, but the Commons slice (s4) was on every server. At some point new fancy database servers were introduced with all the slices on all servers. Having 6 servers with each one having a slice + s4 (Commons) + s8 (Wikidata) might be a good compromise. On 12-11-2020 00:58, John wrote:

I’ll throw my hat in this too. Moving it to the application layer will make a number of queries just not feasible any longer. It might make sense from the administration side, but from the user perspective it beaks one of the biggest features that toolforge has.

On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
MusikAnimal is right, however, Wikidata and Commons either have a sui generis slice, or they share it with a few very large wikis. Tools that do any kind of crosswiki analysis would instantly break, as most of them utilise joining by Wikidata items at the very least.

I second Maarten here. This would mean a lot of things that currently require a (relatively simple) SQL query would need a full script, which would do the join at the application level.

I fully understand the reasoning, but there needs to be some replacement. Intentionally introduce breaking changes while providing no "new standard" is a bad pattern in a community environment.

Martin

On Wed, Nov 11, 2020, 10:31 PM MusikAnimal musikanimal@gmail.com wrote:

...
Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right?

~ MA

On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl wrote:

...
Hi Joaquin, On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing listCloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Kimmo Virtanen

12:38 p.m.

As a follow up comment.

If I understand correctly the main problems are a) databases are growing too big to be stored in single instances and b) query complexity is growing.

a) the growth of the data is not going away as the major drivers for the growth are automated edits from Wikidata and Structured data on Commons. They are generating new data with increasing speed faster than humans ever could. So the longer term answer is to store the data to separate instances and use something like federated queries. This is how the access to the commonwiki replica was originally done when toolserver moved to toollabs in 2014.[1] Another long term solution to make databases smaller is to replicate only the current state of the wikidata/commonswiki and leave for example the revision history out.

b) a major factor for query complexity which affects the query execution times is afaik the actor migration and the data sanitization which executes the queries through the multiple views.[2,3] I have no idea how bad the problem currently is, but one could think that replication could be implemented with lighter sanitation by leaving some of the problematic data out altogether from replication.

Anyway, my question is, are there more detailed plans for the *Wiki Replicas 2020 Redesign *than what is on the wikipage[4] or tickets linked from it? I guess there is if the plan is to buy new hardware in October and now we are in the implementation phase? Also is there information on the actual bottlenecks at table level? I.e., which tables (in which databases) are the too big ones, hard to keep up in replication and slow in terms of query time?

[1] https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Migration_of_Toolser... ? [2] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R... [3] https://phabricator.wikimedia.org/T215445 [4] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign

Br, -- Kimmo Virtanen, Zache

On Fri, Nov 13, 2020 at 8:51 AM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:

...

...
Maarten: Having 6 servers with each one having a slice + s4 (Commons) +

s8 (Wikidata) might be a good compromise.

...
Martin: Another idea is to have the database structured as-planned, but

add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

From the point of view of a person who is using cross database joins on both tools and analysis queries I would say that both ideas would be suitable. I think that 90% of my crosswiki queries are written against *wiki + wikidata/commons. However, I would not say that it is only for those who really need it but more like that cross database joins are an awesome feature for everybody and it is a loss if it will be gone.

In older times we had also ability to do joins between user databases and replica databases, which was removed in 2017 if I googled correctly.[1] My guess is that one reason for the increasing query complexity is that there is no possibility for creating tmp tables or joining to preselected data so everything is done in single queries. In any case, if the solution is what Martin suggests to move cross joinable databases to a single server and the original problem was that it was hard to keep in sync multiple servers then we could reintroduce the user database joins as well.

[1] https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server...

Br, -- Kimmo Virtanen, Zache

On Fri, Nov 13, 2020 at 2:17 AM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
+1 to Marteen

Another idea is to have the database structured as-planned, but add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

Martin

pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers maarten@mdammers.nl napsal:

...
I recall some point in time (Toolserver maybe?) when all the slices (overview at https://tools-info.toolforge.org/?listmetap ) were at different servers, but the Commons slice (s4) was on every server. At some point new fancy database servers were introduced with all the slices on all servers. Having 6 servers with each one having a slice + s4 (Commons) + s8 (Wikidata) might be a good compromise. On 12-11-2020 00:58, John wrote:

I’ll throw my hat in this too. Moving it to the application layer will make a number of queries just not feasible any longer. It might make sense from the administration side, but from the user perspective it beaks one of the biggest features that toolforge has.

On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
MusikAnimal is right, however, Wikidata and Commons either have a sui generis slice, or they share it with a few very large wikis. Tools that do any kind of crosswiki analysis would instantly break, as most of them utilise joining by Wikidata items at the very least.

I second Maarten here. This would mean a lot of things that currently require a (relatively simple) SQL query would need a full script, which would do the join at the application level.

I fully understand the reasoning, but there needs to be some replacement. Intentionally introduce breaking changes while providing no "new standard" is a bad pattern in a community environment.

Martin

On Wed, Nov 11, 2020, 10:31 PM MusikAnimal musikanimal@gmail.com wrote:

...
Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right?

~ MA

On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl wrote:

...
Hi Joaquin, On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing listCloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Amir Sarabadani

14 Nov 14 Nov

9:51 p.m.

Hello, I actually welcome the change and am quite happy about it. It might break several tools (including some of mine) but as a database nerd, I can see the benefits outweighing the problems (and I wish benefits would have been communicated in the announcement).

The short version is that this change would make labs replicas blazing fast.

The long version: Database of all of wikis is currently being replicated to a set of giant "cloud" or "labs" replicas. IIRC correctly, these dbs have 512GB memory (while being massive is not big enough to hold everything), the space left for InnoDB Buffer pool should be around 350GB and storing everything in there is impossible (the rest would be for temporary tables and other critical functions), so I assume when you query quarry (sorry, I had to make the pun), most it is actually coming from reading disk which is ten times slower. Looking at graphs https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=13&orgId=1&var-server=labsdb1011&var-port=9104&from=now-7d&to=now, The Innodb buffer pool efficiency for labs dbs is around 99% (two nines), while production databases (similar hardware but split into eight different sections) is 99.99% (four nines), these two orders of magnitude difference is mostly because of cache locality which I hope we would achieve if these changes get done (unless the new hardware will be commodity hardware instead of beefy servers but I doubt that, correct me if I'm wrong). Meaning less timeouts, less slow apps and tools, etc. It's not just speed though, the updates coming in to replicas would be split too so it wouldn't saturate the network and less heavy I/O in memory and disk meaning better scalability (adding commons/wikidata on each section would be the exact opposite of that and even if we do it now, we eventually have to pull the plug as wikis are growing and we are not the same size or growth speed we used to be years ago).

I understand it would break tools and queries but I have a feeling that lots of them should be already split into multiple queries, or should read dumps instead or sometimes it's more of an x/y problem https://en.wikipedia.org/wiki/XY_problem

I think this is great and a big thank you for doing it.

On Fri, Nov 13, 2020 at 11:39 AM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:

...

As a follow up comment.

If I understand correctly the main problems are a) databases are growing too big to be stored in single instances and b) query complexity is growing.

a) the growth of the data is not going away as the major drivers for the growth are automated edits from Wikidata and Structured data on Commons. They are generating new data with increasing speed faster than humans ever could. So the longer term answer is to store the data to separate instances and use something like federated queries. This is how the access to the commonwiki replica was originally done when toolserver moved to toollabs in 2014.[1] Another long term solution to make databases smaller is to replicate only the current state of the wikidata/commonswiki and leave for example the revision history out.

b) a major factor for query complexity which affects the query execution times is afaik the actor migration and the data sanitization which executes the queries through the multiple views.[2,3] I have no idea how bad the problem currently is, but one could think that replication could be implemented with lighter sanitation by leaving some of the problematic data out altogether from replication.

Anyway, my question is, are there more detailed plans for the *Wiki Replicas 2020 Redesign *than what is on the wikipage[4] or tickets linked from it? I guess there is if the plan is to buy new hardware in October and now we are in the implementation phase? Also is there information on the actual bottlenecks at table level? I.e., which tables (in which databases) are the too big ones, hard to keep up in replication and slow in terms of query time?

[1] https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Migration_of_Toolser... ? [2] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R... [3] https://phabricator.wikimedia.org/T215445 [4] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign

Br, -- Kimmo Virtanen, Zache

On Fri, Nov 13, 2020 at 8:51 AM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:

...
...
Maarten: Having 6 servers with each one having a slice + s4 (Commons)

s8 (Wikidata) might be a good compromise.

...
Martin: Another idea is to have the database structured as-planned, but

add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

From the point of view of a person who is using cross database joins on both tools and analysis queries I would say that both ideas would be suitable. I think that 90% of my crosswiki queries are written against *wiki + wikidata/commons. However, I would not say that it is only for those who really need it but more like that cross database joins are an awesome feature for everybody and it is a loss if it will be gone.

In older times we had also ability to do joins between user databases and replica databases, which was removed in 2017 if I googled correctly.[1] My guess is that one reason for the increasing query complexity is that there is no possibility for creating tmp tables or joining to preselected data so everything is done in single queries. In any case, if the solution is what Martin suggests to move cross joinable databases to a single server and the original problem was that it was hard to keep in sync multiple servers then we could reintroduce the user database joins as well.

[1] https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server...

Br, -- Kimmo Virtanen, Zache

On Fri, Nov 13, 2020 at 2:17 AM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
+1 to Marteen

Another idea is to have the database structured as-planned, but add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

Martin

pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers maarten@mdammers.nl napsal:

...
I recall some point in time (Toolserver maybe?) when all the slices (overview at https://tools-info.toolforge.org/?listmetap ) were at different servers, but the Commons slice (s4) was on every server. At some point new fancy database servers were introduced with all the slices on all servers. Having 6 servers with each one having a slice + s4 (Commons) + s8 (Wikidata) might be a good compromise. On 12-11-2020 00:58, John wrote:

I’ll throw my hat in this too. Moving it to the application layer will make a number of queries just not feasible any longer. It might make sense from the administration side, but from the user perspective it beaks one of the biggest features that toolforge has.

On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
MusikAnimal is right, however, Wikidata and Commons either have a sui generis slice, or they share it with a few very large wikis. Tools that do any kind of crosswiki analysis would instantly break, as most of them utilise joining by Wikidata items at the very least.

I second Maarten here. This would mean a lot of things that currently require a (relatively simple) SQL query would need a full script, which would do the join at the application level.

I fully understand the reasoning, but there needs to be some replacement. Intentionally introduce breaking changes while providing no "new standard" is a bad pattern in a community environment.

Martin

On Wed, Nov 11, 2020, 10:31 PM MusikAnimal musikanimal@gmail.com wrote:

...
Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right?

~ MA

On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl wrote:

> Hi Joaquin, > On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote: > > TLDR: Wiki Replicas' architecture is being redesigned for stability > and performance. Cross database JOINs will not be available and a host > connection will only allow querying its associated DB. See [1] > https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign > for more details. > > If you only think of Wikipedia, not a lot will break probably, but > if you take into account Commons and Wikidata a lot will break. A quick > grep in my folder with Commons queries returns 123 lines with cross > database joins. So yes, stuff will break and tools will be abandoned. This > follows the practice that seems to have become standard for the WMF these > days: Decisions are made with a small group within the WMF without any > community involved. Only after the decision has been made, it's announced. > > Unhappy and disappointed, > > Maarten > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud > _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing listCloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Amir (he/him)

Huji Lee

10:24 p.m.

I like the idea of dumps as an alternative too. But I think this should be a service that is offered via the WM Clouds. Some might remember me asking related questions on this very mailing list several months ago.

Having a DB called "latest_dump" which actually has the latest dump of all wikis would be tremendously helpful. Many cross-wiki queries can work off of several-days-old data.

On Sat, Nov 14, 2020 at 2:52 PM Amir Sarabadani ladsgroup@gmail.com wrote:

...

Hello, I actually welcome the change and am quite happy about it. It might break several tools (including some of mine) but as a database nerd, I can see the benefits outweighing the problems (and I wish benefits would have been communicated in the announcement).

The short version is that this change would make labs replicas blazing fast.

The long version: Database of all of wikis is currently being replicated to a set of giant "cloud" or "labs" replicas. IIRC correctly, these dbs have 512GB memory (while being massive is not big enough to hold everything), the space left for InnoDB Buffer pool should be around 350GB and storing everything in there is impossible (the rest would be for temporary tables and other critical functions), so I assume when you query quarry (sorry, I had to make the pun), most it is actually coming from reading disk which is ten times slower. Looking at graphs https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=13&orgId=1&var-server=labsdb1011&var-port=9104&from=now-7d&to=now, The Innodb buffer pool efficiency for labs dbs is around 99% (two nines), while production databases (similar hardware but split into eight different sections) is 99.99% (four nines), these two orders of magnitude difference is mostly because of cache locality which I hope we would achieve if these changes get done (unless the new hardware will be commodity hardware instead of beefy servers but I doubt that, correct me if I'm wrong). Meaning less timeouts, less slow apps and tools, etc. It's not just speed though, the updates coming in to replicas would be split too so it wouldn't saturate the network and less heavy I/O in memory and disk meaning better scalability (adding commons/wikidata on each section would be the exact opposite of that and even if we do it now, we eventually have to pull the plug as wikis are growing and we are not the same size or growth speed we used to be years ago).

I understand it would break tools and queries but I have a feeling that lots of them should be already split into multiple queries, or should read dumps instead or sometimes it's more of an x/y problem https://en.wikipedia.org/wiki/XY_problem

I think this is great and a big thank you for doing it.

On Fri, Nov 13, 2020 at 11:39 AM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:

...
As a follow up comment.

If I understand correctly the main problems are a) databases are growing too big to be stored in single instances and b) query complexity is growing.

a) the growth of the data is not going away as the major drivers for the growth are automated edits from Wikidata and Structured data on Commons. They are generating new data with increasing speed faster than humans ever could. So the longer term answer is to store the data to separate instances and use something like federated queries. This is how the access to the commonwiki replica was originally done when toolserver moved to toollabs in 2014.[1] Another long term solution to make databases smaller is to replicate only the current state of the wikidata/commonswiki and leave for example the revision history out.

b) a major factor for query complexity which affects the query execution times is afaik the actor migration and the data sanitization which executes the queries through the multiple views.[2,3] I have no idea how bad the problem currently is, but one could think that replication could be implemented with lighter sanitation by leaving some of the problematic data out altogether from replication.

Anyway, my question is, are there more detailed plans for the *Wiki Replicas 2020 Redesign *than what is on the wikipage[4] or tickets linked from it? I guess there is if the plan is to buy new hardware in October and now we are in the implementation phase? Also is there information on the actual bottlenecks at table level? I.e., which tables (in which databases) are the too big ones, hard to keep up in replication and slow in terms of query time?

[1] https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Migration_of_Toolser... ? [2] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R... [3] https://phabricator.wikimedia.org/T215445 [4] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign

Br, -- Kimmo Virtanen, Zache

On Fri, Nov 13, 2020 at 8:51 AM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:

...
...
Maarten: Having 6 servers with each one having a slice + s4 (Commons)

s8 (Wikidata) might be a good compromise.

...
Martin: Another idea is to have the database structured as-planned,

but add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

From the point of view of a person who is using cross database joins on both tools and analysis queries I would say that both ideas would be suitable. I think that 90% of my crosswiki queries are written against *wiki + wikidata/commons. However, I would not say that it is only for those who really need it but more like that cross database joins are an awesome feature for everybody and it is a loss if it will be gone.

In older times we had also ability to do joins between user databases and replica databases, which was removed in 2017 if I googled correctly.[1] My guess is that one reason for the increasing query complexity is that there is no possibility for creating tmp tables or joining to preselected data so everything is done in single queries. In any case, if the solution is what Martin suggests to move cross joinable databases to a single server and the original problem was that it was hard to keep in sync multiple servers then we could reintroduce the user database joins as well.

[1] https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server...

Br, -- Kimmo Virtanen, Zache

On Fri, Nov 13, 2020 at 2:17 AM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
+1 to Marteen

Another idea is to have the database structured as-planned, but add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

Martin

pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers maarten@mdammers.nl napsal:

...
I recall some point in time (Toolserver maybe?) when all the slices (overview at https://tools-info.toolforge.org/?listmetap ) were at different servers, but the Commons slice (s4) was on every server. At some point new fancy database servers were introduced with all the slices on all servers. Having 6 servers with each one having a slice + s4 (Commons) + s8 (Wikidata) might be a good compromise. On 12-11-2020 00:58, John wrote:

I’ll throw my hat in this too. Moving it to the application layer will make a number of queries just not feasible any longer. It might make sense from the administration side, but from the user perspective it beaks one of the biggest features that toolforge has.

On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
MusikAnimal is right, however, Wikidata and Commons either have a sui generis slice, or they share it with a few very large wikis. Tools that do any kind of crosswiki analysis would instantly break, as most of them utilise joining by Wikidata items at the very least.

I second Maarten here. This would mean a lot of things that currently require a (relatively simple) SQL query would need a full script, which would do the join at the application level.

I fully understand the reasoning, but there needs to be some replacement. Intentionally introduce breaking changes while providing no "new standard" is a bad pattern in a community environment.

Martin

On Wed, Nov 11, 2020, 10:31 PM MusikAnimal musikanimal@gmail.com wrote:

> Technically, cross-wiki joins aren't completely disallowed, you just > have to make sure each of the db names are on the same slice/section, > right? > > ~ MA > > On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl > wrote: > >> Hi Joaquin, >> On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote: >> >> TLDR: Wiki Replicas' architecture is being redesigned for stability >> and performance. Cross database JOINs will not be available and a host >> connection will only allow querying its associated DB. See [1] >> https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign >> for more details. >> >> If you only think of Wikipedia, not a lot will break probably, but >> if you take into account Commons and Wikidata a lot will break. A quick >> grep in my folder with Commons queries returns 123 lines with cross >> database joins. So yes, stuff will break and tools will be abandoned. This >> follows the practice that seems to have become standard for the WMF these >> days: Decisions are made with a small group within the WMF without any >> community involved. Only after the decision has been made, it's announced. >> >> Unhappy and disappointed, >> >> Maarten >> _______________________________________________ >> Wikimedia Cloud Services mailing list >> Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) >> https://lists.wikimedia.org/mailman/listinfo/cloud >> > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud > _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing listCloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Amir (he/him)

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Kimmo Virtanen

15 Nov 15 Nov

8:23 p.m.

...

The Innodb buffer pool efficiency for labs dbs is around 99% (two

nines), while production databases (similar hardware but split into eight different sections) is 99.99% (four nines), these two orders of magnitude difference is mostly because of cache locality which I hope we would achieve if these changes get done

It of course helps if there is more hardware and memory per database, but the same level of speed than in production is not really realistic if the major part of the query complexity comes from data sanitization views which will be still there.

...

It's not just speed though, the updates coming in to replicas would be

split too so it wouldn't saturate the network and less heavy I/O in memory and disk meaning better scalability (adding commons/wikidata on each section would be the exact opposite of that and even if we do it now, we eventually have to pull the plug as wikis are growing and we are not the same size or growth speed we used to be years ago).

Yes, replicating 1:1 clone of commons/wikidata on each section is not useful. I still would like to see the ability of doing queries over multiple databases as an important feature.

Br, -- Kimmo Virtanen, Zache

On Sat, Nov 14, 2020 at 9:52 PM Amir Sarabadani ladsgroup@gmail.com wrote:

...

Hello, I actually welcome the change and am quite happy about it. It might break several tools (including some of mine) but as a database nerd, I can see the benefits outweighing the problems (and I wish benefits would have been communicated in the announcement).

The short version is that this change would make labs replicas blazing fast.

The long version: Database of all of wikis is currently being replicated to a set of giant "cloud" or "labs" replicas. IIRC correctly, these dbs have 512GB memory (while being massive is not big enough to hold everything), the space left for InnoDB Buffer pool should be around 350GB and storing everything in there is impossible (the rest would be for temporary tables and other critical functions), so I assume when you query quarry (sorry, I had to make the pun), most it is actually coming from reading disk which is ten times slower. Looking at graphs https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=13&orgId=1&var-server=labsdb1011&var-port=9104&from=now-7d&to=now, The Innodb buffer pool efficiency for labs dbs is around 99% (two nines), while production databases (similar hardware but split into eight different sections) is 99.99% (four nines), these two orders of magnitude difference is mostly because of cache locality which I hope we would achieve if these changes get done (unless the new hardware will be commodity hardware instead of beefy servers but I doubt that, correct me if I'm wrong). Meaning less timeouts, less slow apps and tools, etc. It's not just speed though, the updates coming in to replicas would be split too so it wouldn't saturate the network and less heavy I/O in memory and disk meaning better scalability (adding commons/wikidata on each section would be the exact opposite of that and even if we do it now, we eventually have to pull the plug as wikis are growing and we are not the same size or growth speed we used to be years ago).

I understand it would break tools and queries but I have a feeling that lots of them should be already split into multiple queries, or should read dumps instead or sometimes it's more of an x/y problem https://en.wikipedia.org/wiki/XY_problem

I think this is great and a big thank you for doing it.

On Fri, Nov 13, 2020 at 11:39 AM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:

...
As a follow up comment.

If I understand correctly the main problems are a) databases are growing too big to be stored in single instances and b) query complexity is growing.

a) the growth of the data is not going away as the major drivers for the growth are automated edits from Wikidata and Structured data on Commons. They are generating new data with increasing speed faster than humans ever could. So the longer term answer is to store the data to separate instances and use something like federated queries. This is how the access to the commonwiki replica was originally done when toolserver moved to toollabs in 2014.[1] Another long term solution to make databases smaller is to replicate only the current state of the wikidata/commonswiki and leave for example the revision history out.

b) a major factor for query complexity which affects the query execution times is afaik the actor migration and the data sanitization which executes the queries through the multiple views.[2,3] I have no idea how bad the problem currently is, but one could think that replication could be implemented with lighter sanitation by leaving some of the problematic data out altogether from replication.

Anyway, my question is, are there more detailed plans for the *Wiki Replicas 2020 Redesign *than what is on the wikipage[4] or tickets linked from it? I guess there is if the plan is to buy new hardware in October and now we are in the implementation phase? Also is there information on the actual bottlenecks at table level? I.e., which tables (in which databases) are the too big ones, hard to keep up in replication and slow in terms of query time?

[1] https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Migration_of_Toolser... ? [2] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R... [3] https://phabricator.wikimedia.org/T215445 [4] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign

Br, -- Kimmo Virtanen, Zache

On Fri, Nov 13, 2020 at 8:51 AM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:

...
...
Maarten: Having 6 servers with each one having a slice + s4 (Commons)

s8 (Wikidata) might be a good compromise.

...
Martin: Another idea is to have the database structured as-planned,

but add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

From the point of view of a person who is using cross database joins on both tools and analysis queries I would say that both ideas would be suitable. I think that 90% of my crosswiki queries are written against *wiki + wikidata/commons. However, I would not say that it is only for those who really need it but more like that cross database joins are an awesome feature for everybody and it is a loss if it will be gone.

In older times we had also ability to do joins between user databases and replica databases, which was removed in 2017 if I googled correctly.[1] My guess is that one reason for the increasing query complexity is that there is no possibility for creating tmp tables or joining to preselected data so everything is done in single queries. In any case, if the solution is what Martin suggests to move cross joinable databases to a single server and the original problem was that it was hard to keep in sync multiple servers then we could reintroduce the user database joins as well.

[1] https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server...

Br, -- Kimmo Virtanen, Zache

On Fri, Nov 13, 2020 at 2:17 AM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
+1 to Marteen

Another idea is to have the database structured as-planned, but add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

Martin

pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers maarten@mdammers.nl napsal:

...
I recall some point in time (Toolserver maybe?) when all the slices (overview at https://tools-info.toolforge.org/?listmetap ) were at different servers, but the Commons slice (s4) was on every server. At some point new fancy database servers were introduced with all the slices on all servers. Having 6 servers with each one having a slice + s4 (Commons) + s8 (Wikidata) might be a good compromise. On 12-11-2020 00:58, John wrote:

I’ll throw my hat in this too. Moving it to the application layer will make a number of queries just not feasible any longer. It might make sense from the administration side, but from the user perspective it beaks one of the biggest features that toolforge has.

On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
MusikAnimal is right, however, Wikidata and Commons either have a sui generis slice, or they share it with a few very large wikis. Tools that do any kind of crosswiki analysis would instantly break, as most of them utilise joining by Wikidata items at the very least.

I second Maarten here. This would mean a lot of things that currently require a (relatively simple) SQL query would need a full script, which would do the join at the application level.

I fully understand the reasoning, but there needs to be some replacement. Intentionally introduce breaking changes while providing no "new standard" is a bad pattern in a community environment.

Martin

On Wed, Nov 11, 2020, 10:31 PM MusikAnimal musikanimal@gmail.com wrote:

> Technically, cross-wiki joins aren't completely disallowed, you just > have to make sure each of the db names are on the same slice/section, > right? > > ~ MA > > On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl > wrote: > >> Hi Joaquin, >> On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote: >> >> TLDR: Wiki Replicas' architecture is being redesigned for stability >> and performance. Cross database JOINs will not be available and a host >> connection will only allow querying its associated DB. See [1] >> https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign >> for more details. >> >> If you only think of Wikipedia, not a lot will break probably, but >> if you take into account Commons and Wikidata a lot will break. A quick >> grep in my folder with Commons queries returns 123 lines with cross >> database joins. So yes, stuff will break and tools will be abandoned. This >> follows the practice that seems to have become standard for the WMF these >> days: Decisions are made with a small group within the WMF without any >> community involved. Only after the decision has been made, it's announced. >> >> Unhappy and disappointed, >> >> Maarten >> _______________________________________________ >> Wikimedia Cloud Services mailing list >> Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) >> https://lists.wikimedia.org/mailman/listinfo/cloud >> > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud > _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing listCloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Amir (he/him)

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Nicholas Skaggs

17 Nov 17 Nov

1:11 a.m.

Kimmo, while I can't directly answer your question on bottlenecks, I will try and provide a little background information on existing issues for those who are new (like myself!).

Here's a recent example of replication issues with the current setup: https://lists.wikimedia.org/pipermail/cloud-admin/2020-September/000409.html https://lists.wikimedia.org/pipermail/cloud-admin/2020-October/000413.html

Replication lagged hours behind, and it's not the first instance of this occurring.

As per https://phabricator.wikimedia.org/T249188#6204681 capacity is full and it's not currently possible to upgrade as-is, despite the fact that Wiki Replicas are being affected by bugs in the current version. In addition, the current setup means any error recovery can take many days. See https://lists.wikimedia.org/pipermail/cloud-admin/2020-March/000387.html for further background information on historical issues.

If you'd rather see it in graphical form, you can look at the metrics directly:

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=labsd...

I hope this helps!

On Fri, Nov 13, 2020 at 5:39 AM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:

...

As a follow up comment.

If I understand correctly the main problems are a) databases are growing too big to be stored in single instances and b) query complexity is growing.

a) the growth of the data is not going away as the major drivers for the growth are automated edits from Wikidata and Structured data on Commons. They are generating new data with increasing speed faster than humans ever could. So the longer term answer is to store the data to separate instances and use something like federated queries. This is how the access to the commonwiki replica was originally done when toolserver moved to toollabs in 2014.[1] Another long term solution to make databases smaller is to replicate only the current state of the wikidata/commonswiki and leave for example the revision history out.

b) a major factor for query complexity which affects the query execution times is afaik the actor migration and the data sanitization which executes the queries through the multiple views.[2,3] I have no idea how bad the problem currently is, but one could think that replication could be implemented with lighter sanitation by leaving some of the problematic data out altogether from replication.

Anyway, my question is, are there more detailed plans for the *Wiki Replicas 2020 Redesign *than what is on the wikipage[4] or tickets linked from it? I guess there is if the plan is to buy new hardware in October and now we are in the implementation phase? Also is there information on the actual bottlenecks at table level? I.e., which tables (in which databases) are the too big ones, hard to keep up in replication and slow in terms of query time?

[1] https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Migration_of_Toolser... ? [2] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R... [3] https://phabricator.wikimedia.org/T215445 [4] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign

Br, -- Kimmo Virtanen, Zache

On Fri, Nov 13, 2020 at 8:51 AM Kimmo Virtanen kimmo.virtanen@gmail.com wrote:

...
...
Maarten: Having 6 servers with each one having a slice + s4 (Commons)

s8 (Wikidata) might be a good compromise.

...
Martin: Another idea is to have the database structured as-planned, but

add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

From the point of view of a person who is using cross database joins on both tools and analysis queries I would say that both ideas would be suitable. I think that 90% of my crosswiki queries are written against *wiki + wikidata/commons. However, I would not say that it is only for those who really need it but more like that cross database joins are an awesome feature for everybody and it is a loss if it will be gone.

In older times we had also ability to do joins between user databases and replica databases, which was removed in 2017 if I googled correctly.[1] My guess is that one reason for the increasing query complexity is that there is no possibility for creating tmp tables or joining to preselected data so everything is done in single queries. In any case, if the solution is what Martin suggests to move cross joinable databases to a single server and the original problem was that it was hard to keep in sync multiple servers then we could reintroduce the user database joins as well.

[1] https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server...

Br, -- Kimmo Virtanen, Zache

On Fri, Nov 13, 2020 at 2:17 AM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
+1 to Marteen

Another idea is to have the database structured as-planned, but add a server with *all* databases that would be slower/less stable, but will provide a solution for those who really need cross database joins

Martin

pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers maarten@mdammers.nl napsal:

...
I recall some point in time (Toolserver maybe?) when all the slices (overview at https://tools-info.toolforge.org/?listmetap ) were at different servers, but the Commons slice (s4) was on every server. At some point new fancy database servers were introduced with all the slices on all servers. Having 6 servers with each one having a slice + s4 (Commons) + s8 (Wikidata) might be a good compromise. On 12-11-2020 00:58, John wrote:

I’ll throw my hat in this too. Moving it to the application layer will make a number of queries just not feasible any longer. It might make sense from the administration side, but from the user perspective it beaks one of the biggest features that toolforge has.

On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec < martin.urbanec@wikimedia.cz> wrote:

...
MusikAnimal is right, however, Wikidata and Commons either have a sui generis slice, or they share it with a few very large wikis. Tools that do any kind of crosswiki analysis would instantly break, as most of them utilise joining by Wikidata items at the very least.

I second Maarten here. This would mean a lot of things that currently require a (relatively simple) SQL query would need a full script, which would do the join at the application level.

I fully understand the reasoning, but there needs to be some replacement. Intentionally introduce breaking changes while providing no "new standard" is a bad pattern in a community environment.

Martin

On Wed, Nov 11, 2020, 10:31 PM MusikAnimal musikanimal@gmail.com wrote:

...
Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right?

~ MA

On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl wrote:

> Hi Joaquin, > On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote: > > TLDR: Wiki Replicas' architecture is being redesigned for stability > and performance. Cross database JOINs will not be available and a host > connection will only allow querying its associated DB. See [1] > https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign > for more details. > > If you only think of Wikipedia, not a lot will break probably, but > if you take into account Commons and Wikidata a lot will break. A quick > grep in my folder with Commons queries returns 123 lines with cross > database joins. So yes, stuff will break and tools will be abandoned. This > follows the practice that seems to have become standard for the WMF these > days: Decisions are made with a small group within the WMF without any > community involved. Only after the decision has been made, it's announced. > > Unhappy and disappointed, > > Maarten > _______________________________________________ > Wikimedia Cloud Services mailing list > Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/cloud > _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing listCloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- *Nicholas Skaggs* Engineering Manager, Cloud Services Wikimedia Foundation https://wikimediafoundation.org/

Joaquin Oltra Hernandez

16 Nov 16 Nov

10:52 p.m.

Hey MA, I've checked, and while not explicitly disallowed, the fact that this could work is more of an implementation detail that shouldn't really be relied on.

The sections and where the instances are on them are organized to maintain the service, and are not supposed to be depended on since they could change.

Even if the mappings are public and fairly stable, there could be a point where a change in the implementation/organization is made -like with this new architecture- and those in-section cross-db joins would stop working.

On Wed, Nov 11, 2020 at 10:31 PM MusikAnimal musikanimal@gmail.com wrote:

...

Technically, cross-wiki joins aren't completely disallowed, you just have to make sure each of the db names are on the same slice/section, right?

~ MA

On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers maarten@mdammers.nl wrote:

...
Hi Joaquin, On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

MusikAnimal

19 Nov 19 Nov

7:44 a.m.

Hello Joaquin!

Hey MA, I've checked, and while not explicitly disallowed, the fact that

...

this could work is more of an implementation detail that shouldn't really be relied on.

The sections and where the instances are on them are organized to maintain the service, and are not supposed to be depended on since they could change.

Even if the mappings are public and fairly stable, there could be a point where a change in the implementation/organization is made -like with this new architecture- and those in-section cross-db joins would stop working.

I'm not saying I will blindly construct cross-wiki queries. Rather, I will only do it after fetching from the db-lists to confirm which ones can be queried cross-wiki. In the case of GUC and XTools Global Contribs, this could mean massive performance improvements. Allow me to paint a picture; we have an account attached to 100 wikis, I want a list of all global edits ordered chronologically. For day-to-day steward life, this is essential so I'd like to find the most efficient route possible, even if it's a little hacky :) So going off of what we're doing now, my high-level vision would be:

1) check db-lists (or from cached result) 2) Check CentralAuth to see which wikis the user has edits on. Here we find there are 100 wikis. 2) Cross-referencing the db-lists, I now know that 75 of the wikis I want to query are on s1, and 25 on s2. 3) For each wiki, I have a subquery to grab *all* edits by that user on that specific wiki within that slice (may along add WHERE clauses for rev_timestmap, etc.) 4) Take each of those subqueries and wrap it like: (SELECT * FROM ( [subquery1] ) UNION ( [subquery2] ) … ) a ORDER BY rev_timestamp DESC LIMIT 50 5) Do the same for each of the other slices 6) Combine the results from each slice and resequence the edits chronologically, stopping at 50 (the first page of edits to show to the user).

That sounds not like the most fun, but I think it would work. With the current 8 slices, it shouldn't slow it down too terribly (some slices will be faster than others).

Are you discouraging this approach? If I *have* to open and use a separate connection to each of those 100 databases, regardless of the slice, the processing may become much slower. Let's move on to IPs, where we have to check *every* wiki. 900+ separation connections. Again, I'm not sure how I'd get this even set up on my local, as presumably I'd need 900+ open SSH tunnels. Maybe a bash script?

I just want to make sure I've got this right before I start cording. In the end hopefully I'll have a working strategy that I can share with others.

Thanks,

~ MA

MusikAnimal

7:59 a.m.

Unrelated but important question (sorry to fragment this thread): What about the max number of connections imposed on a db user? That applies to each open connection, right? For example, we still occasionally hit it in XTools, meaning there are 30 open connections and no new connections can be opened. If this is true, I will surely need a major increase in the number of allowed connections for my db user. Let's assume I'm able to do with just 6 to access all dbs. That means only 5 people need to be running Global Contribs queries before the next user gets an error (give or take, depending on how fast the queries are and which connections are tied up, also taking into account use by other XTools features such as the Edit Counter). Surely you catch my drift. I suppose I'll find out when I get there; just sharing this thought ahead of time in case it hasn't been considered yet.

~ MA

On Thu, Nov 19, 2020 at 12:44 AM MusikAnimal musikanimal@gmail.com wrote:

...

Hello Joaquin!

Hey MA, I've checked, and while not explicitly disallowed, the fact that

...
this could work is more of an implementation detail that shouldn't really be relied on.

The sections and where the instances are on them are organized to maintain the service, and are not supposed to be depended on since they could change.

Even if the mappings are public and fairly stable, there could be a point where a change in the implementation/organization is made -like with this new architecture- and those in-section cross-db joins would stop working.

I'm not saying I will blindly construct cross-wiki queries. Rather, I will only do it after fetching from the db-lists to confirm which ones can be queried cross-wiki. In the case of GUC and XTools Global Contribs, this could mean massive performance improvements. Allow me to paint a picture; we have an account attached to 100 wikis, I want a list of all global edits ordered chronologically. For day-to-day steward life, this is essential so I'd like to find the most efficient route possible, even if it's a little hacky :) So going off of what we're doing now, my high-level vision would be:

check db-lists (or from cached result)

Check CentralAuth to see which wikis the user has edits on. Here we

find there are 100 wikis. 2) Cross-referencing the db-lists, I now know that 75 of the wikis I want to query are on s1, and 25 on s2. 3) For each wiki, I have a subquery to grab *all* edits by that user on that specific wiki within that slice (may along add WHERE clauses for rev_timestmap, etc.) 4) Take each of those subqueries and wrap it like: (SELECT * FROM ( [subquery1] ) UNION ( [subquery2] ) … ) a ORDER BY rev_timestamp DESC LIMIT 50 5) Do the same for each of the other slices 6) Combine the results from each slice and resequence the edits chronologically, stopping at 50 (the first page of edits to show to the user).

That sounds not like the most fun, but I think it would work. With the current 8 slices, it shouldn't slow it down too terribly (some slices will be faster than others).

Are you discouraging this approach? If I *have* to open and use a separate connection to each of those 100 databases, regardless of the slice, the processing may become much slower. Let's move on to IPs, where we have to check *every* wiki. 900+ separation connections. Again, I'm not sure how I'd get this even set up on my local, as presumably I'd need 900+ open SSH tunnels. Maybe a bash script?

I just want to make sure I've got this right before I start cording. In the end hopefully I'll have a working strategy that I can share with others.

Thanks,

~ MA

Joaquin Oltra Hernandez

5:31 p.m.

Here is my understanding of the connections situation:

Currently the limit is 10 concurrent connections available to an account on an instance. Since the move is to a multi-instance architecture, with the new architecture you will have 10 concurrent connections per account per instance (in this case, 1 instance = 1 section).

So that would effectively mean you have 10 on enwiki (s1), 10 on wikidata (s8), 10 on commons (& testcommonswiki in s4), 10 on smaller wikis (s3 or s5), etc. This likely means more parallelism for some activities and less connection issues for many applications if you happen to be querying DBs on separate instances.

In the case of GUC and XTools Global Contribs since they hit every DB (and hence instance), I think the effective limit will still be 10, so it will be something to keep in mind. In the future if you hit issues with the number of connections we can think of ways to avoid those.

On Thu, Nov 19, 2020 at 7:00 AM MusikAnimal musikanimal@gmail.com wrote:

...

Unrelated but important question (sorry to fragment this thread): What about the max number of connections imposed on a db user? That applies to each open connection, right? For example, we still occasionally hit it in XTools, meaning there are 30 open connections and no new connections can be opened. If this is true, I will surely need a major increase in the number of allowed connections for my db user. Let's assume I'm able to do with just 6 to access all dbs. That means only 5 people need to be running Global Contribs queries before the next user gets an error (give or take, depending on how fast the queries are and which connections are tied up, also taking into account use by other XTools features such as the Edit Counter). Surely you catch my drift. I suppose I'll find out when I get there; just sharing this thought ahead of time in case it hasn't been considered yet.

~ MA

On Thu, Nov 19, 2020 at 12:44 AM MusikAnimal musikanimal@gmail.com wrote:

...
Hello Joaquin!

Hey MA, I've checked, and while not explicitly disallowed, the fact that

...
this could work is more of an implementation detail that shouldn't really be relied on.

The sections and where the instances are on them are organized to maintain the service, and are not supposed to be depended on since they could change.

Even if the mappings are public and fairly stable, there could be a point where a change in the implementation/organization is made -like with this new architecture- and those in-section cross-db joins would stop working.

I'm not saying I will blindly construct cross-wiki queries. Rather, I will only do it after fetching from the db-lists to confirm which ones can be queried cross-wiki. In the case of GUC and XTools Global Contribs, this could mean massive performance improvements. Allow me to paint a picture; we have an account attached to 100 wikis, I want a list of all global edits ordered chronologically. For day-to-day steward life, this is essential so I'd like to find the most efficient route possible, even if it's a little hacky :) So going off of what we're doing now, my high-level vision would be:

check db-lists (or from cached result)

Check CentralAuth to see which wikis the user has edits on. Here we

find there are 100 wikis. 2) Cross-referencing the db-lists, I now know that 75 of the wikis I want to query are on s1, and 25 on s2. 3) For each wiki, I have a subquery to grab *all* edits by that user on that specific wiki within that slice (may along add WHERE clauses for rev_timestmap, etc.) 4) Take each of those subqueries and wrap it like: (SELECT * FROM ( [subquery1] ) UNION ( [subquery2] ) … ) a ORDER BY rev_timestamp DESC LIMIT 50 5) Do the same for each of the other slices 6) Combine the results from each slice and resequence the edits chronologically, stopping at 50 (the first page of edits to show to the user).

That sounds not like the most fun, but I think it would work. With the current 8 slices, it shouldn't slow it down too terribly (some slices will be faster than others).

Are you discouraging this approach? If I *have* to open and use a separate connection to each of those 100 databases, regardless of the slice, the processing may become much slower. Let's move on to IPs, where we have to check *every* wiki. 900+ separation connections. Again, I'm not sure how I'd get this even set up on my local, as presumably I'd need 900+ open SSH tunnels. Maybe a bash script?

I just want to make sure I've got this right before I start cording. In the end hopefully I'll have a working strategy that I can share with others.

Thanks,

~ MA

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

Joaquin Oltra Hernandez

5:13 p.m.

Hey MA,

I personally think given your knowledge and experience and what GUC and XTools Global Contribs do, your approach of using those implementation details to get better performance makes sense. The outline you present is very clear and seems reasonable to me. You also mention programmatically reading the sections and db-lists which will make the implementation more resilient to changes.

Still though, most tools shouldn't care about these and it is better if they do not rely on them to avoid future headaches. I think as a rule of thumb, relying on implementation details should be avoided by most developers.

Does that make sense?

On Thu, Nov 19, 2020 at 6:44 AM MusikAnimal musikanimal@gmail.com wrote:

...

Hello Joaquin!

Hey MA, I've checked, and while not explicitly disallowed, the fact that

...
this could work is more of an implementation detail that shouldn't really be relied on.

The sections and where the instances are on them are organized to maintain the service, and are not supposed to be depended on since they could change.

Even if the mappings are public and fairly stable, there could be a point where a change in the implementation/organization is made -like with this new architecture- and those in-section cross-db joins would stop working.

I'm not saying I will blindly construct cross-wiki queries. Rather, I will only do it after fetching from the db-lists to confirm which ones can be queried cross-wiki. In the case of GUC and XTools Global Contribs, this could mean massive performance improvements. Allow me to paint a picture; we have an account attached to 100 wikis, I want a list of all global edits ordered chronologically. For day-to-day steward life, this is essential so I'd like to find the most efficient route possible, even if it's a little hacky :) So going off of what we're doing now, my high-level vision would be:

check db-lists (or from cached result)

Check CentralAuth to see which wikis the user has edits on. Here we

find there are 100 wikis. 2) Cross-referencing the db-lists, I now know that 75 of the wikis I want to query are on s1, and 25 on s2. 3) For each wiki, I have a subquery to grab *all* edits by that user on that specific wiki within that slice (may along add WHERE clauses for rev_timestmap, etc.) 4) Take each of those subqueries and wrap it like: (SELECT * FROM ( [subquery1] ) UNION ( [subquery2] ) … ) a ORDER BY rev_timestamp DESC LIMIT 50 5) Do the same for each of the other slices 6) Combine the results from each slice and resequence the edits chronologically, stopping at 50 (the first page of edits to show to the user).

That sounds not like the most fun, but I think it would work. With the current 8 slices, it shouldn't slow it down too terribly (some slices will be faster than others).

Are you discouraging this approach? If I *have* to open and use a separate connection to each of those 100 databases, regardless of the slice, the processing may become much slower. Let's move on to IPs, where we have to check *every* wiki. 900+ separation connections. Again, I'm not sure how I'd get this even set up on my local, as presumably I'd need 900+ open SSH tunnels. Maybe a bash script?

I just want to make sure I've got this right before I start cording. In the end hopefully I'll have a working strategy that I can share with others.

Thanks,

~ MA _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

Joaquin Oltra Hernandez

16 Nov 16 Nov

10:42 p.m.

Hi Maarten,

I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years. Breaking changes are always painful, in this case of the replicas I think the changes follow the recommendations laid out years ago. As far as I could see in the docs, connection reusing and cross-DB joins are not documented or advertised. The fact that they work is an implementation detail that has been useful but with the amount of data we have makes the service unstable, slow, and very hard to maintain. For example, people often report issues when looking at replag https://replag.toolforge.org/, and here https://phabricator.wikimedia.org/search/query/vzOgtuG0eo.n/#R are some examples of recent instability and crashes due to the current architecture and usage.

I'm sorry about the extra work this will cause, I hope the improved stability and performance will make it worth it for you, and that you will reconsider and migrate your code to work on the new architecture (or reach out for specific help if you need it). Your experience and examples would be very helpful for other developers in the community.

On Wed, Nov 11, 2020 at 10:11 PM Maarten Dammers maarten@mdammers.nl wrote:

...

Hi Joaquin, On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

Kimmo Virtanen

17 Nov 17 Nov

10:14 a.m.

...

... As far as I could see in the docs, connection reusing and cross-DB

joins are not documented or advertised ...

Not sure what you are talking about. The cross-DB joins were key features of ToolServer[1] which Wikimedia Labs replaced in 2012-2014 and those were just not as initial features of Wikimedia Labs db replica but also they were mandatory features for the transition as tools at the time were depending on them.[2] There also used to be some level documentation written by WMF tech on the wiki too based on the initial configuration ticket.[3]

[1] https://meta.wikimedia.org/wiki/Toolserver [2] https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Migration_of_Toolser... ? [3] https://static-bugzilla.wikimedia.org/show_bug.cgi?id=57876

Br, -- Kimmo Virtanen, Zache

On Mon, Nov 16, 2020 at 10:43 PM Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:

...

Hi Maarten,

I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years. Breaking changes are always painful, in this case of the replicas I think the changes follow the recommendations laid out years ago. As far as I could see in the docs, connection reusing and cross-DB joins are not documented or advertised. The fact that they work is an implementation detail that has been useful but with the amount of data we have makes the service unstable, slow, and very hard to maintain. For example, people often report issues when looking at replag https://replag.toolforge.org/, and here https://phabricator.wikimedia.org/search/query/vzOgtuG0eo.n/#R are some examples of recent instability and crashes due to the current architecture and usage.

I'm sorry about the extra work this will cause, I hope the improved stability and performance will make it worth it for you, and that you will reconsider and migrate your code to work on the new architecture (or reach out for specific help if you need it). Your experience and examples would be very helpful for other developers in the community.

On Wed, Nov 11, 2020 at 10:11 PM Maarten Dammers maarten@mdammers.nl wrote:

...
Hi Joaquin, On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

If you only think of Wikipedia, not a lot will break probably, but if you take into account Commons and Wikidata a lot will break. A quick grep in my folder with Commons queries returns 123 lines with cross database joins. So yes, stuff will break and tools will be abandoned. This follows the practice that seems to have become standard for the WMF these days: Decisions are made with a small group within the WMF without any community involved. Only after the decision has been made, it's announced.

Unhappy and disappointed,

Maarten _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Maarten Dammers

8:01 p.m.

Hi Joaquin,

On 16-11-2020 21:42, Joaquin Oltra Hernandez wrote:

...

Hi Maarten,

I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years.

You do realize the current setup was announced as new 3 years ago? See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server... .

...

I'm sorry about the extra work this will cause, I hope the improved stability and performance will make it worth it for you, and that you will reconsider and migrate your code to work on the new architecture (or reach out for specific help if you need it).

No, saying sorry won't make it right and no, it won't make it worth it for me. If I want very stable access to a single wiki, I'll use the API of that wiki.

...

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

It currently doesn't really feel to me that you're advocating for the developers, it feels more like you're the unlucky person having to sell the bad WMF management decisions to the angry developers.

Maarten

Jaime Crespo

8:57 p.m.

So I think there is something here. Different people have different needs, so far the number one need for wikireplicas is for those that needed underlying access to an "almost real time" copy of the internal database structure, as is. This is based on the fact that latency is the most common complaint regarding wikireplicas.

The issue is that there is 3 properties we can play with: #1 Having a complete dataset #2 Having data updated as soon as production #3 Continue using the same api and SQL syntax for backwards compatibility #4 Being able to query everything at the same time (data lake)

With the current technology used by wikireplicas, and the growth experimented in the last years, one has to sacrifice one of the above. Since 2013, and wikidata and commons popularity has exploded, in addition to getting more features and data per edit. The natural decision is to keep #1, #2 and #3 and sacrifice #4, especially because it will also reduce latency as an unintended consequence.

That doesn't mean that #4 is impossible, but it would need either (probably more than one): a) be precise about what subset of the data is needed to have it consolidated (e.g. only some tables exposed) b) load static dumps that are not updated in real time (e.g. only once a month) c) stop using MySQL/InnoDB and use an OLAP engine, like a column-based storage or something more analytic-y

Keeping the current technology is the easiest path to achieve #1, #2, and #3 short term, but the data size and load make #4 impossible- it no longer "fits" on a single db with MySQL/MariaDB. But I think if someone had a concrete proposal and provided feedback to achieve #4 on a separate service, people would listen- for example, I have thought about proposing setting up an analytics engine loaded every week or every month from backups with a subset of the data, but would need people providing feedback on what data would be useful to expose (e.g. the previous email about fawiki and enwiki image usage)?

I propose to open a ticket to discuss architecture and technical solutions on Phabricator- if you see it productive, and where more people can express interest in moving it forward- and not just me.

PS: The federated approach of the old tools db didn't work well back them, and won't work well now, specially with such large tables, and it has big security implications

AntiCompositeNumber

9:01 p.m.

I took a look at converting the query used for GreenC Bot's Job 10, which tracks enwiki files that "shadow" a different file on Commons. It is currently run daily, and the query executes in about 60-90 seconds. I tried three methods to recreate that query without a SQL cross-database join. The naive method of "just give me all the files" didn't work because it timed out somewhere. The paginated version of that query was on track to take over 5 hours to complete. A similar method that emulates a subquery instead of a join was projected to take about 6 hours. Both stopped early because I got bored of watching them and PAWS doesn't work unattended. I also wasn't able to properly test them because people kept fixing the shadowed files before the script got to them. The code is at https://public.paws.wmcloud.org/User:AntiCompositeBot/ShadowsCommonsQuery.ipynb.

ACN

On Tue, Nov 17, 2020 at 1:02 PM Maarten Dammers maarten@mdammers.nl wrote:

...

Hi Joaquin,

On 16-11-2020 21:42, Joaquin Oltra Hernandez wrote:

Hi Maarten,

I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years.

You do realize the current setup was announced as new 3 years ago? See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server... .

I'm sorry about the extra work this will cause, I hope the improved stability and performance will make it worth it for you, and that you will reconsider and migrate your code to work on the new architecture (or reach out for specific help if you need it).

No, saying sorry won't make it right and no, it won't make it worth it for me. If I want very stable access to a single wiki, I'll use the API of that wiki.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

It currently doesn't really feel to me that you're advocating for the developers, it feels more like you're the unlucky person having to sell the bad WMF management decisions to the angry developers.

Maarten

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Brooke Storm

9:05 p.m.

ACN: Thanks! We’ve created a ticket for that one to help collaborate and surface the process here: https://phabricator.wikimedia.org/T267992 https://phabricator.wikimedia.org/T267992 Anybody working on that, please add info there.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org mailto:bstorm@wikimedia.org IRC: bstorm

...

On Nov 17, 2020, at 12:01 PM, AntiCompositeNumber anticompositenumber@gmail.com wrote:

I took a look at converting the query used for GreenC Bot's Job 10, which tracks enwiki files that "shadow" a different file on Commons. It is currently run daily, and the query executes in about 60-90 seconds. I tried three methods to recreate that query without a SQL cross-database join. The naive method of "just give me all the files" didn't work because it timed out somewhere. The paginated version of that query was on track to take over 5 hours to complete. A similar method that emulates a subquery instead of a join was projected to take about 6 hours. Both stopped early because I got bored of watching them and PAWS doesn't work unattended. I also wasn't able to properly test them because people kept fixing the shadowed files before the script got to them. The code is at https://public.paws.wmcloud.org/User:AntiCompositeBot/ShadowsCommonsQuery.ipynb.

ACN

On Tue, Nov 17, 2020 at 1:02 PM Maarten Dammers maarten@mdammers.nl wrote:

...
Hi Joaquin,

On 16-11-2020 21:42, Joaquin Oltra Hernandez wrote:

Hi Maarten,

I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years.

You do realize the current setup was announced as new 3 years ago? See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server... .

I'm sorry about the extra work this will cause, I hope the improved stability and performance will make it worth it for you, and that you will reconsider and migrate your code to work on the new architecture (or reach out for specific help if you need it).

No, saying sorry won't make it right and no, it won't make it worth it for me. If I want very stable access to a single wiki, I'll use the API of that wiki.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

It currently doesn't really feel to me that you're advocating for the developers, it feels more like you're the unlucky person having to sell the bad WMF management decisions to the angry developers.

Maarten

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Amir Sarabadani

9:46 p.m.

Hello, Actually Jaime's email gave me an idea. Why not having a separate actual data lake? Like a hadoop cluster, it can even take the data from analytics cluster (after being sanitized of course). I remember there were some discussions about having a hadoop or Presto cluster in WM Cloud.

Has this been considered?

Thanks.

On Tue, Nov 17, 2020 at 8:05 PM Brooke Storm bstorm@wikimedia.org wrote:

...

ACN: Thanks! We’ve created a ticket for that one to help collaborate and surface the process here: https://phabricator.wikimedia.org/T267992 Anybody working on that, please add info there.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org IRC: bstorm

On Nov 17, 2020, at 12:01 PM, AntiCompositeNumber < anticompositenumber@gmail.com> wrote:

I took a look at converting the query used for GreenC Bot's Job 10, which tracks enwiki files that "shadow" a different file on Commons. It is currently run daily, and the query executes in about 60-90 seconds. I tried three methods to recreate that query without a SQL cross-database join. The naive method of "just give me all the files" didn't work because it timed out somewhere. The paginated version of that query was on track to take over 5 hours to complete. A similar method that emulates a subquery instead of a join was projected to take about 6 hours. Both stopped early because I got bored of watching them and PAWS doesn't work unattended. I also wasn't able to properly test them because people kept fixing the shadowed files before the script got to them. The code is at < https://public.paws.wmcloud.org/User:AntiCompositeBot/ShadowsCommonsQuery.ip...

...
.

ACN

On Tue, Nov 17, 2020 at 1:02 PM Maarten Dammers maarten@mdammers.nl wrote:

Hi Joaquin,

On 16-11-2020 21:42, Joaquin Oltra Hernandez wrote:

Hi Maarten,

I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years.

You do realize the current setup was announced as new 3 years ago? See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server... .

I'm sorry about the extra work this will cause, I hope the improved stability and performance will make it worth it for you, and that you will reconsider and migrate your code to work on the new architecture (or reach out for specific help if you need it).

No, saying sorry won't make it right and no, it won't make it worth it for me. If I want very stable access to a single wiki, I'll use the API of that wiki.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

It currently doesn't really feel to me that you're advocating for the developers, it feels more like you're the unlucky person having to sell the bad WMF management decisions to the angry developers.

Maarten

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Amir (he/him)

Nicholas Skaggs

23 Nov 23 Nov

5:53 p.m.

Amir, in case you hadn't seen it, your memory is correct. This was considered in the past. See https://phabricator.wikimedia.org/T215858#6631859.

On Tue, Nov 17, 2020 at 2:47 PM Amir Sarabadani ladsgroup@gmail.com wrote:

...

Hello, Actually Jaime's email gave me an idea. Why not having a separate actual data lake? Like a hadoop cluster, it can even take the data from analytics cluster (after being sanitized of course). I remember there were some discussions about having a hadoop or Presto cluster in WM Cloud.

Has this been considered?

Thanks.

On Tue, Nov 17, 2020 at 8:05 PM Brooke Storm bstorm@wikimedia.org wrote:

...
ACN: Thanks! We’ve created a ticket for that one to help collaborate and surface the process here: https://phabricator.wikimedia.org/T267992 Anybody working on that, please add info there.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org IRC: bstorm

On Nov 17, 2020, at 12:01 PM, AntiCompositeNumber < anticompositenumber@gmail.com> wrote:

I took a look at converting the query used for GreenC Bot's Job 10, which tracks enwiki files that "shadow" a different file on Commons. It is currently run daily, and the query executes in about 60-90 seconds. I tried three methods to recreate that query without a SQL cross-database join. The naive method of "just give me all the files" didn't work because it timed out somewhere. The paginated version of that query was on track to take over 5 hours to complete. A similar method that emulates a subquery instead of a join was projected to take about 6 hours. Both stopped early because I got bored of watching them and PAWS doesn't work unattended. I also wasn't able to properly test them because people kept fixing the shadowed files before the script got to them. The code is at < https://public.paws.wmcloud.org/User:AntiCompositeBot/ShadowsCommonsQuery.ip...

...
.

ACN

On Tue, Nov 17, 2020 at 1:02 PM Maarten Dammers maarten@mdammers.nl wrote:

Hi Joaquin,

On 16-11-2020 21:42, Joaquin Oltra Hernandez wrote:

Hi Maarten,

I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years.

You do realize the current setup was announced as new 3 years ago? See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server... .

I'm sorry about the extra work this will cause, I hope the improved stability and performance will make it worth it for you, and that you will reconsider and migrate your code to work on the new architecture (or reach out for specific help if you need it).

No, saying sorry won't make it right and no, it won't make it worth it for me. If I want very stable access to a single wiki, I'll use the API of that wiki.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

It currently doesn't really feel to me that you're advocating for the developers, it feels more like you're the unlucky person having to sell the bad WMF management decisions to the angry developers.

Maarten

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Amir (he/him)

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- *Nicholas Skaggs* Engineering Manager, Cloud Services Wikimedia Foundation https://wikimediafoundation.org/

MusikAnimal

9 Dec 9 Dec

6:19 a.m.

Hello again! Thinking about this more, I'm wondering if it makes sense to have a tool to assist with parsing the dblists at noc.wikimedia.org. I know the official recommendation is to not to connect to slices, but the issue is how to work locally. I alone maintain many tools that are capable of connecting to any database. I have a single bash alias I use to set up my SSH tunnel. When I start a new tool, I just give it 127.0.0.1 as the host and the 4711 as the port number. Easy peasy. I can't imagine trying to instruct a newbie how to contribute to tool Foo (which requires a tunnel to enwiki on port 1234), and tool Bar (tunnel to frwiki on port 5678), etc. etc... perhaps it's best to establish a standard system for developers working locally? For the truly "global" tools like I talked about before, we have to use slices, and though they may not change much it's a lot of work to check the dblists manually.

So, my thoughts are this tool could do two things: 1) A webservice with a form where you enter in your username, local MySQL port, and whether you want to use the analytics or web replicas. After submitting, it prints the necessary command, something like: ssh -L 4711:s1.web.db.svc.eqiad.wmflabs:3306 -L 4712:s2.web.db.svc.eqiad.wmflabs:3306 … username@login.toolforge.org 2) A public API for tools to use to get the slice given a database name.

For both, it goes by the dblists at noc.wikimedia.org (with some caching to improve response time).

So in the README for *my* tool, I tell the developer to go to the above to get the command they should use to set up the local SSH tunnel. The README file could even link to a pre-filled form to ensure the port numbers align with what that tool expects. This way, the developer doesn't even need to add port numbers and what not to a .env file or what have you, since the tool goes by what the above tool outputs (though you could provide a means to override this, in the event the developer has other things running on those ports). Hopefully what I'm saying makes sense.

Is this a stupid idea? I might go ahead and build a tool for the #2 use case, at least, because right now I will have to reinvent the wheel for at least three "global" tools that I maintain. We could consider also adding this logic to libraries, such as ToolforgeBundle https://github.com/wikimedia/ToolforgeBundle which is for PHP/Symfony apps running on Cloud Services.

~ MA

On Mon, Nov 23, 2020 at 10:53 AM Nicholas Skaggs nskaggs@wikimedia.org wrote:

...

Amir, in case you hadn't seen it, your memory is correct. This was considered in the past. See https://phabricator.wikimedia.org/T215858#6631859.

On Tue, Nov 17, 2020 at 2:47 PM Amir Sarabadani ladsgroup@gmail.com wrote:

...
Hello, Actually Jaime's email gave me an idea. Why not having a separate actual data lake? Like a hadoop cluster, it can even take the data from analytics cluster (after being sanitized of course). I remember there were some discussions about having a hadoop or Presto cluster in WM Cloud.

Has this been considered?

Thanks.

On Tue, Nov 17, 2020 at 8:05 PM Brooke Storm bstorm@wikimedia.org wrote:

...
ACN: Thanks! We’ve created a ticket for that one to help collaborate and surface the process here: https://phabricator.wikimedia.org/T267992 Anybody working on that, please add info there.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org IRC: bstorm

On Nov 17, 2020, at 12:01 PM, AntiCompositeNumber < anticompositenumber@gmail.com> wrote:

I took a look at converting the query used for GreenC Bot's Job 10, which tracks enwiki files that "shadow" a different file on Commons. It is currently run daily, and the query executes in about 60-90 seconds. I tried three methods to recreate that query without a SQL cross-database join. The naive method of "just give me all the files" didn't work because it timed out somewhere. The paginated version of that query was on track to take over 5 hours to complete. A similar method that emulates a subquery instead of a join was projected to take about 6 hours. Both stopped early because I got bored of watching them and PAWS doesn't work unattended. I also wasn't able to properly test them because people kept fixing the shadowed files before the script got to them. The code is at < https://public.paws.wmcloud.org/User:AntiCompositeBot/ShadowsCommonsQuery.ip...

...
.

ACN

On Tue, Nov 17, 2020 at 1:02 PM Maarten Dammers maarten@mdammers.nl wrote:

Hi Joaquin,

On 16-11-2020 21:42, Joaquin Oltra Hernandez wrote:

Hi Maarten,

I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years.

You do realize the current setup was announced as new 3 years ago? See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server... .

I'm sorry about the extra work this will cause, I hope the improved stability and performance will make it worth it for you, and that you will reconsider and migrate your code to work on the new architecture (or reach out for specific help if you need it).

No, saying sorry won't make it right and no, it won't make it worth it for me. If I want very stable access to a single wiki, I'll use the API of that wiki.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

It currently doesn't really feel to me that you're advocating for the developers, it feels more like you're the unlucky person having to sell the bad WMF management decisions to the angry developers.

Maarten

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Amir (he/him)

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- *Nicholas Skaggs* Engineering Manager, Cloud Services Wikimedia Foundation https://wikimediafoundation.org/ _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Andrew Otto

10 Dec 10 Dec

4:19 p.m.

FYI, This isn't for Cloud Services, but we've got something sorta similar for internal analytics replicas.

https://github.com/wikimedia/analytics-refinery/blob/master/bin/analytics-my... https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/...

On Tue, Dec 8, 2020 at 11:20 PM MusikAnimal musikanimal@gmail.com wrote:

...

Hello again! Thinking about this more, I'm wondering if it makes sense to have a tool to assist with parsing the dblists at noc.wikimedia.org. I know the official recommendation is to not to connect to slices, but the issue is how to work locally. I alone maintain many tools that are capable of connecting to any database. I have a single bash alias I use to set up my SSH tunnel. When I start a new tool, I just give it 127.0.0.1 as the host and the 4711 as the port number. Easy peasy. I can't imagine trying to instruct a newbie how to contribute to tool Foo (which requires a tunnel to enwiki on port 1234), and tool Bar (tunnel to frwiki on port 5678), etc. etc... perhaps it's best to establish a standard system for developers working locally? For the truly "global" tools like I talked about before, we have to use slices, and though they may not change much it's a lot of work to check the dblists manually.

So, my thoughts are this tool could do two things:

A webservice with a form where you enter in your username, local MySQL

port, and whether you want to use the analytics or web replicas. After submitting, it prints the necessary command, something like: ssh -L 4711:s1.web.db.svc.eqiad.wmflabs:3306 -L 4712:s2.web.db.svc.eqiad.wmflabs:3306 … username@login.toolforge.org 2) A public API for tools to use to get the slice given a database name.

For both, it goes by the dblists at noc.wikimedia.org (with some caching to improve response time).

So in the README for *my* tool, I tell the developer to go to the above to get the command they should use to set up the local SSH tunnel. The README file could even link to a pre-filled form to ensure the port numbers align with what that tool expects. This way, the developer doesn't even need to add port numbers and what not to a .env file or what have you, since the tool goes by what the above tool outputs (though you could provide a means to override this, in the event the developer has other things running on those ports). Hopefully what I'm saying makes sense.

Is this a stupid idea? I might go ahead and build a tool for the #2 use case, at least, because right now I will have to reinvent the wheel for at least three "global" tools that I maintain. We could consider also adding this logic to libraries, such as ToolforgeBundle https://github.com/wikimedia/ToolforgeBundle which is for PHP/Symfony apps running on Cloud Services.

~ MA

On Mon, Nov 23, 2020 at 10:53 AM Nicholas Skaggs nskaggs@wikimedia.org wrote:

...
Amir, in case you hadn't seen it, your memory is correct. This was considered in the past. See https://phabricator.wikimedia.org/T215858#6631859.

On Tue, Nov 17, 2020 at 2:47 PM Amir Sarabadani ladsgroup@gmail.com wrote:

...
Hello, Actually Jaime's email gave me an idea. Why not having a separate actual data lake? Like a hadoop cluster, it can even take the data from analytics cluster (after being sanitized of course). I remember there were some discussions about having a hadoop or Presto cluster in WM Cloud.

Has this been considered?

Thanks.

On Tue, Nov 17, 2020 at 8:05 PM Brooke Storm bstorm@wikimedia.org wrote:

...
ACN: Thanks! We’ve created a ticket for that one to help collaborate and surface the process here: https://phabricator.wikimedia.org/T267992 Anybody working on that, please add info there.

Brooke Storm Staff SRE Wikimedia Cloud Services bstorm@wikimedia.org IRC: bstorm

On Nov 17, 2020, at 12:01 PM, AntiCompositeNumber < anticompositenumber@gmail.com> wrote:

I took a look at converting the query used for GreenC Bot's Job 10, which tracks enwiki files that "shadow" a different file on Commons. It is currently run daily, and the query executes in about 60-90 seconds. I tried three methods to recreate that query without a SQL cross-database join. The naive method of "just give me all the files" didn't work because it timed out somewhere. The paginated version of that query was on track to take over 5 hours to complete. A similar method that emulates a subquery instead of a join was projected to take about 6 hours. Both stopped early because I got bored of watching them and PAWS doesn't work unattended. I also wasn't able to properly test them because people kept fixing the shadowed files before the script got to them. The code is at < https://public.paws.wmcloud.org/User:AntiCompositeBot/ShadowsCommonsQuery.ip...

...
.

ACN

On Tue, Nov 17, 2020 at 1:02 PM Maarten Dammers maarten@mdammers.nl wrote:

Hi Joaquin,

On 16-11-2020 21:42, Joaquin Oltra Hernandez wrote:

Hi Maarten,

I believe this work started many years ago, and it was paused, and recently restarted because of the stability and performance problems in the last years.

You do realize the current setup was announced as new 3 years ago? See https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_server... .

I'm sorry about the extra work this will cause, I hope the improved stability and performance will make it worth it for you, and that you will reconsider and migrate your code to work on the new architecture (or reach out for specific help if you need it).

No, saying sorry won't make it right and no, it won't make it worth it for me. If I want very stable access to a single wiki, I'll use the API of that wiki.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

It currently doesn't really feel to me that you're advocating for the developers, it feels more like you're the unlucky person having to sell the bad WMF management decisions to the angry developers.

Maarten

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Amir (he/him)

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- *Nicholas Skaggs* Engineering Manager, Cloud Services Wikimedia Foundation https://wikimediafoundation.org/ _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Joaquin Oltra Hernandez

15 Dec 15 Dec

5:31 p.m.

New subject: Wiki Replicas 2020 Redesign

Hi everyone, here is an update before Christmas season:

We are diligently working on T260389 Redesign and rebuild the wikireplicas service using a multi-instance architecture https://phabricator.wikimedia.org/T260389. We are running a bit delayed in terms of the timeline, so early adopter testing will happen in January, and we expect to have the new cluster ready in January.

We have collected and continue discussing all the feedback and are trying to get a clear idea of the use cases that won’t be supported by the new architecture. We have T215858 https://phabricator.wikimedia.org/T215858 as a follow up task to the new architecture, and some tasks (T267992 https://phabricator.wikimedia.org/T267992, T268240 https://phabricator.wikimedia.org/T268240, T268242 https://phabricator.wikimedia.org/T268242, T268244 https://phabricator.wikimedia.org/T268244) for impacted tools and use cases.

If you know about use cases or tools that require cross-joins please comment on the task, make a subtask, or reach out to me.

We published an update in Tech News 2020/49 https://meta.wikimedia.org/wiki/Tech/News/2020/49 at the end of November, and are looking for feedback in T268498 Feedback from Quarry and PAWS users, and other wiki editors affected by the new architecture https://phabricator.wikimedia.org/T268498. If you know of Quarry or PAWS users that use the replicas with cross joins, please direct them to the task to post about how they use them.

Thanks to everyone who discussed, created tasks with their use cases, and helped try out migration paths.

On Tue, Nov 10, 2020 at 9:26 PM Joaquin Oltra Hernandez < jhernandez@wikimedia.org> wrote:

...

TLDR: Wiki Replicas' architecture is being redesigned for stability and performance. Cross database JOINs will not be available and a host connection will only allow querying its associated DB. See [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign for more details.

Hi!

In the interest of making and keeping Wiki Replicas a stable and performant service, a new backend architecture is needed. There is some impact in the features and usage patterns.

What should I do? To avoid breaking changes, you can start making the following changes *now*:

Update existing tools to ensure queries are executed against the proper

database connection

Eg: If you want to query the `eswiki_p` DB, you must connect to the

`eswiki.analytics.db.svc.eqiad.wmflabs` host and `eswiki_p` DB, and not to enwiki or other hosts

Check your existing tools and services queries for cross database JOINs,

rewrite the joins in application code

Eg: If you are doing a join across databases, for example joining

`enwiki_p` and `eswiki_p`, you will need to query them separately, and filter the results of the separate queries in the code

Timeline:

November - December: Early adopter testing

January 2021: Existing and new systems online, transition period starts

February 2021: Old hardware is decommissioned

We need your help

If you would like to beta test the new architecture, please let us know

and we will reach out to you soon

Sharing examples / descriptions of how a tool or service was updated,

writing a common solution or some example code others can utilize and reference, helping others on IRC and the mailing lists

If you have questions or need help adapting your code or queries, please contact us [2] https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_communication, or write on the talk page [3] https://wikitech.wikimedia.org/wiki/Talk:News/Wiki_Replicas_2020_Redesign .

We will be sending reminders, and more specific examples of the changes via email and on the wiki page. For more information see [1] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign.

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

-- Joaquin Oltra Hernandez Developer Advocate - Wikimedia Foundation

1468

Age (days ago)

1503

Last active (days ago)

cloud@lists.wikimedia.org

42 comments

16 participants

tags (0)

participants (16)

Amir Sarabadani
Andrew Otto
AntiCompositeNumber
Brooke Storm
David Caro
Gergo Tisza
Huji Lee
Jaime Crespo
Joaquin Oltra Hernandez
John
Kimmo Virtanen
Maarten Dammers
Martin Urbanec
MusikAnimal
Nicholas Skaggs
xover＠pobox.com