Toolforge SQL performance issue with new comment table

List overview All Threads
Download

newer

older

[Cloud-announce] New views on Wiki...

excessive lagging

Jesse Plamondon-Willard

3 Jun 2019 3 Jun '19

1:42 a.m.

Hello, Some Toolforge SQL queries have much worse performance when updated for the new comment table. For example, see one previous SQL query <https://tools.wmflabs.org/sql-optimizer?use=enwiki_p&sql=SELECT%0D%0A++++++++++++++++++++user_id%2C%0D%0A++++++++++++++++++++user_registration%2C%0D%0A++++++++++++++++++++DATE_FORMAT%28user_registration%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29+AS+registration%2C%0D%0A++++++++++++++++++++user_editcount%2C%0D%0A++++++++++++++++++++GROUP_CONCAT%28ug_group+SEPARATOR+%22%2C+%22%29+AS+user_groups%2C%0D%0A++++++++++++++++++++ipb_by_text%2C%0D%0A++++++++++++++++++++%27%27+AS+ipb_reason%2C+--+column+deleted%0D%0A++++++++++++++++++++DATE_FORMAT%28ipb_timestamp%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29+AS+ipb_timestamp%2C%0D%0A++++++++++++++++++++ipb_deleted%2C%0D%0A++++++++++++++++++++COALESCE%28DATE_FORMAT%28ipb_expiry%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29%2C+ipb_expiry%29+AS+ipb_expiry%0D%0A++++++++++++++++FROM%0D%0A++++++++++++++++++++user%0D%0A++++++++++++++++++++LEFT+JOIN+user_groups+ON+user_id+%3D+ug_user%0D%0A++++++++++++++++++++LEFT+JOIN+ipblocks_ipindex+ON+user_id+%3D+ipb_user%0D%0A++++++++++++++++WHERE+user_name+%3D+%27Pathoschild%27%0D%0A++++++++++++++++LIMIT+1> and its updated version <https://tools.wmflabs.org/sql-optimizer?use=enwiki_p&sql=SELECT%0D%0A++++++++++++++++++++user_id%2C%0D%0A++++++++++++++++++++user_registration%2C%0D%0A++++++++++++++++++++DATE_FORMAT%28user_registration%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29+AS+registration%2C%0D%0A++++++++++++++++++++user_editcount%2C%0D%0A++++++++++++++++++++GROUP_CONCAT%28ug_group+SEPARATOR+%22%2C+%22%29+AS+user_groups%2C%0D%0A++++++++++++++++++++ipb_by_text%2C%0D%0A++++++++++++++++++++comment_text+AS+ipb_reason%2C%0D%0A++++++++++++++++++++DATE_FORMAT%28ipb_timestamp%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29+AS+ipb_timestamp%2C%0D%0A++++++++++++++++++++ipb_deleted%2C%0D%0A++++++++++++++++++++COALESCE%28DATE_FORMAT%28ipb_expiry%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29%2C+ipb_expiry%29+AS+ipb_expiry%0D%0A++++++++++++++++FROM%0D%0A++++++++++++++++++++user%0D%0A++++++++++++++++++++LEFT+JOIN+user_groups+ON+user_id+%3D+ug_user%0D%0A++++++++++++++++++++LEFT+JOIN+ipblocks_ipindex+ON+user_id+%3D+ipb_user%0D%0A++++++++++++++++++++LEFT+JOIN+comment+ON+ipb_reason_id+%3D+comment_id%0D%0A++++++++++++++++WHERE+user_name+%3D+%27Pathoschild%27%0D%0A++++++++++++++++LIMIT+1> which just adds a join on the comment table. The change adds ten dependent subqueries and increases Stalktoy load times from ≈15 seconds to ≈245 seconds, and the same change for another query was enough to have my tools temporarily rate-limited <https://phabricator.wikimedia.org/T217853>. I haven't found a way to update those queries efficiently. Is there an optimisation I'm missing? Why does the comment view need subqueries on ten other tables, and is there an alternate version without the subqueries for cases where we just need to join by ID? -- Jesse Plamondon-Willard (Pathoschild)

Attachments:

attachment.htm (text/html — 3.2 KB)

Show replies by date

Amir Sarabadani

3 Jun 3 Jun

1:54 a.m.

Hey, One important optimization you can use and it's often missed out (and it's going to be needed more as we normalize more tables) is join decomposition. It basically means you don't join and query but do two (or several) queries separately in your code. This might seem counter intuitive but it's pretty useful for several reasons (like MySQL will cache the normalizing table and answer faster or you yourself can cache some parts). You can read more about those in "High Performance MySQL" book. HTH On Mon, Jun 3, 2019 at 3:43 AM Jesse Plamondon-Willard < pathoschild(a)gmail.com> wrote:

...

Hello, Some Toolforge SQL queries have much worse performance when updated for the new comment table. For example, see one previous SQL query <https://tools.wmflabs.org/sql-optimizer?use=enwiki_p&sql=SELECT%0D%0A++++++++++++++++++++user_id%2C%0D%0A++++++++++++++++++++user_registration%2C%0D%0A++++++++++++++++++++DATE_FORMAT%28user_registration%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29+AS+registration%2C%0D%0A++++++++++++++++++++user_editcount%2C%0D%0A++++++++++++++++++++GROUP_CONCAT%28ug_group+SEPARATOR+%22%2C+%22%29+AS+user_groups%2C%0D%0A++++++++++++++++++++ipb_by_text%2C%0D%0A++++++++++++++++++++%27%27+AS+ipb_reason%2C+--+column+deleted%0D%0A++++++++++++++++++++DATE_FORMAT%28ipb_timestamp%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29+AS+ipb_timestamp%2C%0D%0A++++++++++++++++++++ipb_deleted%2C%0D%0A++++++++++++++++++++COALESCE%28DATE_FORMAT%28ipb_expiry%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29%2C+ipb_expiry%29+AS+ipb_expiry%0D%0A++++++++++++++++FROM%0D%0A++++++++++++++++++++user%0D%0A++++++++++++++++++++LEFT+JOIN+user_groups+ON+user_id+%3D+ug_user%0D%0A++++++++++++++++++++LEFT+JOIN+ipblocks_ipindex+ON+user_id+%3D+ipb_user%0D%0A++++++++++++++++WHERE+user_name+%3D+%27Pathoschild%27%0D%0A++++++++++++++++LIMIT+1> and its updated version <https://tools.wmflabs.org/sql-optimizer?use=enwiki_p&sql=SELECT%0D%0A++++++++++++++++++++user_id%2C%0D%0A++++++++++++++++++++user_registration%2C%0D%0A++++++++++++++++++++DATE_FORMAT%28user_registration%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29+AS+registration%2C%0D%0A++++++++++++++++++++user_editcount%2C%0D%0A++++++++++++++++++++GROUP_CONCAT%28ug_group+SEPARATOR+%22%2C+%22%29+AS+user_groups%2C%0D%0A++++++++++++++++++++ipb_by_text%2C%0D%0A++++++++++++++++++++comment_text+AS+ipb_reason%2C%0D%0A++++++++++++++++++++DATE_FORMAT%28ipb_timestamp%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29+AS+ipb_timestamp%2C%0D%0A++++++++++++++++++++ipb_deleted%2C%0D%0A++++++++++++++++++++COALESCE%28DATE_FORMAT%28ipb_expiry%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29%2C+ipb_expiry%29+AS+ipb_expiry%0D%0A++++++++++++++++FROM%0D%0A++++++++++++++++++++user%0D%0A++++++++++++++++++++LEFT+JOIN+user_groups+ON+user_id+%3D+ug_user%0D%0A++++++++++++++++++++LEFT+JOIN+ipblocks_ipindex+ON+user_id+%3D+ipb_user%0D%0A++++++++++++++++++++LEFT+JOIN+comment+ON+ipb_reason_id+%3D+comment_id%0D%0A++++++++++++++++WHERE+user_name+%3D+%27Pathoschild%27%0D%0A++++++++++++++++LIMIT+1> which just adds a join on the comment table. The change adds ten dependent subqueries and increases Stalktoy load times from ≈15 seconds to ≈245 seconds, and the same change for another query was enough to have my tools temporarily rate-limited <https://phabricator.wikimedia.org/T217853> . I haven't found a way to update those queries efficiently. Is there an optimisation I'm missing? Why does the comment view need subqueries on ten other tables, and is there an alternate version without the subqueries for cases where we just need to join by ID? -- Jesse Plamondon-Willard (Pathoschild) _______________________________________________ Wikimedia Cloud Services mailing list Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Amir (he/him)

Jesse Plamondon-Willard

3:09 a.m.

Thanks! Performance is much better with that change. On Sun, 2 Jun 2019 at 21:54, Amir Sarabadani <ladsgroup(a)gmail.com> wrote:

...

Huji Lee

6:50 p.m.

Out of curiosity, Jesse, can you share the query after that change, so it is all stored in this thread for future reference? On Sun, Jun 2, 2019 at 11:09 PM Jesse Plamondon-Willard < pathoschild(a)gmail.com> wrote:

...

Thanks! Performance is much better with that change. On Sun, 2 Jun 2019 at 21:54, Amir Sarabadani <ladsgroup(a)gmail.com> wrote:

_______________________________________________ Wikimedia Cloud Services mailing list Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Jesse Plamondon-Willard

7:38 p.m.

Sure! I just fetch the comment ID in the original query ([1]), and fetch the comment text with a separate query ([2]). You need *n*+1 queries based on the number of records, but the performance is adequate for my use cases. [1]: main query <https://tools.wmflabs.org/sql-optimizer?use=enwiki_p&sql=SELECT%0D%0A++++user_id%2C%0D%0A++++user_registration%2C%0D%0A++++DATE_FORMAT%28user_registration%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29+AS+registration%2C%0D%0A++++user_editcount%2C%0D%0A++++GROUP_CONCAT%28ug_group+SEPARATOR+%22%2C+%22%29+AS+user_groups%2C%0D%0A++++ipb_by_text%2C%0D%0A++++ipb_reason_id%2C%0D%0A++++DATE_FORMAT%28ipb_timestamp%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29+AS+ipb_timestamp%2C%0D%0A++++ipb_deleted%2C%0D%0A++++COALESCE%28DATE_FORMAT%28ipb_expiry%2C+%22%25Y-%25m-%25d+%25H%3A%25i%22%29%2C+ipb_expiry%29+AS+ipb_expiry%0D%0AFROM%0D%0A++++user%0D%0A++++LEFT+JOIN+user_groups+ON+user_id+%3D+ug_user%0D%0A++++LEFT+JOIN+ipblocks_ipindex+ON+user_id+%3D+ipb_user%0D%0AWHERE+user_name+%3D+%27Pathoschild%27%0D%0ALIMIT+1> [2]: secondary query to fetch comment text <https://tools.wmflabs.org/sql-optimizer?use=enwiki_p&sql=SELECT+comment_text%0D%0AFROM+comment%0D%0AWHERE+comment_id+%3D+1000%0D%0ALIMIT+1> On Mon, 3 Jun 2019 at 14:51, Huji Lee <huji.huji(a)gmail.com> wrote:

...

Out of curiosity, Jesse, can you share the query after that change, so it is all stored in this thread for future reference?

John

7:58 p.m.

This is a prime example of over normalization. We are normalizing and making things slower. On Mon, Jun 3, 2019 at 3:38 PM Jesse Plamondon-Willard < pathoschild(a)gmail.com> wrote:

...

Out of curiosity, Jesse, can you share the query after that change, so it is all stored in this thread for future reference?

Chico Venancio

8:43 p.m.

Hey John, What is the specific normalization you believe is unneeded? Chico Venancio Em seg, 3 de jun de 2019 16:58, John <phoenixoverride(a)gmail.com> escreveu:

...

This is a prime example of over normalization. We are normalizing and making things slower. On Mon, Jun 3, 2019 at 3:38 PM Jesse Plamondon-Willard < pathoschild(a)gmail.com> wrote:

Out of curiosity, Jesse, can you share the query after that change, so it is all stored in this thread for future reference?

John

8:50 p.m.

With the introduction of the actor and comment tables, I have seen a notable spike in query times. Take a look at Jessie’s query. A query that used to take a couple of seconds now takes almost 5 minutes. It also spanned what 8-9 different joins now? In database design you should almost never have perfect normalization due to performance issues. I know this point will be ignored, but it’s at least worth mentioning. On Mon, Jun 3, 2019 at 4:44 PM Chico Venancio <chicocvenancio(a)gmail.com> wrote:

...

Hey John, What is the specific normalization you believe is unneeded? Chico Venancio Em seg, 3 de jun de 2019 16:58, John <phoenixoverride(a)gmail.com> escreveu:

This is a prime example of over normalization. We are normalizing and making things slower. On Mon, Jun 3, 2019 at 3:38 PM Jesse Plamondon-Willard < pathoschild(a)gmail.com> wrote:

Out of curiosity, Jesse, can you share the query after that change, so it is all stored in this thread for future reference?

Daniel Kinzler

4 Jun 4 Jun

2:27 p.m.

Am 03.06.19 um 22:50 schrieb John:

...

The MediaWiki schema is full of denormalization for performance reasons. Factoring comments and actors out of the revision table adds quite a bit of complexity, especially for the migration phase - you can be sure that this was not done because it looks nicer in an ER diagram. This change was driven by an urgent need to reduce storage requirements of the revision table, as well as a need to store additional information (machine readable records along with human readable comments, representations of different kinds of anonymous or remote users, etc). Prior discussion, for reference: * https://phabricator.wikimedia.org/T161671 * https://phabricator.wikimedia.org/T167246 * https://phabricator.wikimedia.org/T153333 * https://phabricator.wikimedia.org/T6714 * https://phabricator.wikimedia.org/T95144 Wikimedia is running one of the top ten websites on about 1% of the budget other companies "up there" have for the job. We are *very* keen on optimizing for performance in terms of request time, but also have to optimize for performance in terms of replication bandwidth, the time it takes to apply schema changes to existing tables, how much hardware we can maintain, etc. This normalization was done because after long discussion, it was found to be the optimal solution in terms of overall performance of the website, considering all factors. Otherwise, it would not have been done. Of course, what is optimal for the Wikimedia sites may not be optimal for things on ToolForge. It's impossible to optimize a schema for arbitrary queries, we can only optimize for the ones that have most impact on most requests. If you are interested in how such decisions are made, I invite you all to participate in the RFC process <https://www.mediawiki.org/wiki/Requests_for_comment/Process>. New RFCs, IRC discussions, last calls for comments, and similar information is published in the weekly TechCom radar email on wikitech-l. -- Daniel Kinzler Principal Software Engineer, Core Platform Wikimedia Foundation

Alex Monk

3 Jun 3 Jun

8:54 p.m.

I believe John is trying to make the implication that the comment table was created purely for the sake of normalisation, but I believe it came out of the concerns around the size of the revision table: https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_ta… On Mon, 3 Jun 2019 at 21:44, Chico Venancio <chicocvenancio(a)gmail.com> wrote:

...

Hey John, What is the specific normalization you believe is unneeded? Chico Venancio Em seg, 3 de jun de 2019 16:58, John <phoenixoverride(a)gmail.com> escreveu:

This is a prime example of over normalization. We are normalizing and making things slower. On Mon, Jun 3, 2019 at 3:38 PM Jesse Plamondon-Willard < pathoschild(a)gmail.com> wrote:

Out of curiosity, Jesse, can you share the query after that change, so it is all stored in this thread for future reference?

Amir Sarabadani

8:50 p.m.

I guarantee you, mediawiki core is everything but "over-normalized". We haven't done anything yet. In WMCS it's slower for reasons mentioned above, in production it's fast. Also, as I mentioned about "join decomposition", please read. Best regards On Mon, Jun 3, 2019, 21:58 John <phoenixoverride(a)gmail.com> wrote:

...

This is a prime example of over normalization. We are normalizing and making things slower. On Mon, Jun 3, 2019 at 3:38 PM Jesse Plamondon-Willard < pathoschild(a)gmail.com> wrote:

Out of curiosity, Jesse, can you share the query after that change, so it is all stored in this thread for future reference?

John

8:55 p.m.

Please don’t be patronizing insulting. What did read. What used to be one fairly quick query will now be thousands of decoupled queries. I know a number of cases where the decoupled query count can get into the hundreds of thousands. So what used to take 1 query and <2 minutes is now going to hammer the database with ~300k queries and additional logic on the programming side. On Mon, Jun 3, 2019 at 4:50 PM Amir Sarabadani <ladsgroup(a)gmail.com> wrote:

...

This is a prime example of over normalization. We are normalizing and making things slower. On Mon, Jun 3, 2019 at 3:38 PM Jesse Plamondon-Willard < pathoschild(a)gmail.com> wrote:

Out of curiosity, Jesse, can you share the query after that change, so it is all stored in this thread for future reference?

Bryan Davis

11:23 p.m.

On Mon, Jun 3, 2019 at 2:56 PM John <phoenixoverride(a)gmail.com> wrote:

...

The Wiki Replica databases are going though some tough changes for "analytic" queries right now. By analytic, I mean the sort of queries that tools want/need to do where the data generated by the wikis is being combed through to find correlations that are not needed to support the "operational" queries that are done by the MediaWiki software. The database industry terms OLTP (online transactional processing) [0] and OLAP (online analytic processing) [1] are used to describe these related, but very different use cases. Folks involved in MediaWiki core work have been making a lot of changes to the OLTP schema used by all wikis to make it more efficient in storage size and to reduce data duplication. These changes are very much needed to keep the wikis the size of the larger Wikimedia wikis performant as the amount of pages grows. Unfortunately these same changes are making many OLAP queries more expensive. The DBA and Cloud Services teams will try to do what they can to minimize the amount of additional slowness that is introduced, but there are going to be limits to what we can accomplish. Today the best advice I can give is to ask for help looking at queries which are too slow to work at all for the tools that need them. This mailing list, the #wikimedia-cloud Freenode IRC channel, or Phabricator are the best places to ask today. Some queries might be easy to make faster, others sadly might be as good as they can get and need other more complicated changes like splitting the query into smaller batches of work in various ways. The Technical Engagement team is interested in starting a project sometime in the next fiscal year (July 2019 - June 2020) to gather requirements for an OLAP database schema that would be easier for tools to use. Once we have that we would move into a phase of trying to design a ELT (extract, transform, load) pipeline that would convert the OLTP schema that MediaWiki has (what you see on the Wiki Replica hosts today) into that OLAP schema. This part is likely to be tricky to do while still keeping the lag from the production databases reasonably low and the security of the data high. This project is all speculative at this point however. I would like to see it happen, but it will take help from a lot of folks and a fairly large investment in time and probably hardware as well. I can't give any timeline for completion or honestly even promise that it will eventually happen yet. I can tell you however that we know this is a growing problem, and that there are people who would like to help make it better. [0]: https://en.wikipedia.org/wiki/Online_transaction_processing [1]: https://en.wikipedia.org/wiki/Online_analytical_processing [2]: https://en.wikipedia.org/wiki/Extract,_transform,_load Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855

Huji Lee

4 Jun 4 Jun

1:10 a.m.

It may help to put things in context and remind everyone that among the most visited websites in the world (of which, Wikipedia is #5 per Alexa [0]) the vast majority do not even use a relational database, making real-time large-scale analytic queries something that is not easily available to the uninitiated or the third-party query writer. The fact that we can run a lot of these analytical queries using near-real-time data is an exceptional and amazing feat. PS: Bryan misspelled ETL by accident :) [0]: https://www.alexa.com/topsites On Mon, Jun 3, 2019 at 7:24 PM Bryan Davis <bd808(a)wikimedia.org> wrote:

...

On Mon, Jun 3, 2019 at 2:56 PM John <phoenixoverride(a)gmail.com> wrote:

Please don’t be patronizing insulting. What did read. What used to be

one fairly quick query will now be thousands of decoupled queries. I know a number of cases where the decoupled query count can get into the hundreds of thousands. So what used to take 1 query and <2 minutes is now going to hammer the database with ~300k queries and additional logic on the programming side. The Wiki Replica databases are going though some tough changes for "analytic" queries right now. By analytic, I mean the sort of queries that tools want/need to do where the data generated by the wikis is being combed through to find correlations that are not needed to support the "operational" queries that are done by the MediaWiki software. The database industry terms OLTP (online transactional processing) [0] and OLAP (online analytic processing) [1] are used to describe these related, but very different use cases. Folks involved in MediaWiki core work have been making a lot of changes to the OLTP schema used by all wikis to make it more efficient in storage size and to reduce data duplication. These changes are very much needed to keep the wikis the size of the larger Wikimedia wikis performant as the amount of pages grows. Unfortunately these same changes are making many OLAP queries more expensive. The DBA and Cloud Services teams will try to do what they can to minimize the amount of additional slowness that is introduced, but there are going to be limits to what we can accomplish. Today the best advice I can give is to ask for help looking at queries which are too slow to work at all for the tools that need them. This mailing list, the #wikimedia-cloud Freenode IRC channel, or Phabricator are the best places to ask today. Some queries might be easy to make faster, others sadly might be as good as they can get and need other more complicated changes like splitting the query into smaller batches of work in various ways. The Technical Engagement team is interested in starting a project sometime in the next fiscal year (July 2019 - June 2020) to gather requirements for an OLAP database schema that would be easier for tools to use. Once we have that we would move into a phase of trying to design a ELT (extract, transform, load) pipeline that would convert the OLTP schema that MediaWiki has (what you see on the Wiki Replica hosts today) into that OLAP schema. This part is likely to be tricky to do while still keeping the lag from the production databases reasonably low and the security of the data high. This project is all speculative at this point however. I would like to see it happen, but it will take help from a lot of folks and a fairly large investment in time and probably hardware as well. I can't give any timeline for completion or honestly even promise that it will eventually happen yet. I can tell you however that we know this is a growing problem, and that there are people who would like to help make it better. [0]: https://en.wikipedia.org/wiki/Online_transaction_processing [1]: https://en.wikipedia.org/wiki/Online_analytical_processing [2]: https://en.wikipedia.org/wiki/Extract,_transform,_load Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855 _______________________________________________ Wikimedia Cloud Services mailing list Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Jaime Crespo

7 a.m.

I think there is a misunderstanding that normalization reduces performance (even for analytic queries)- all the opposite, due to a lower amount of actual data stored, the memory to disk ratio increases and while joins may increase cpu usage, the usage, even on labs, is normally negligible. In particular, the addition of the comment table will reduce storage footprint by 2/3ds, allow for better compression, allowed and easy path to increase of comment size limit (very necessary for non-English languages) and will save around 30TB of storage and iops overally in the infrastructure (in an environment where scaling is king). The problem with wikireplicas is that for the comment table, special queries have to be done, like the ones Amir suggested, due to data sanitization (not because normalization). If there was access to base (private) tables, speed would actually be better. That would be fixed with Bryan's work- which was the original idea behind analytics vs web separation (analytics potentially in the future being a separate store or the same, with analytics-friendly schema, but maybe not real time). Right now, wikireplica databases can only follow production without any kind of stable api (and yes, the focus in production is OLTP not OLAP. Actor refactoring, MCR, comment, link* refactoring,... all wikireplicas can do at the moment is follow production (because the data model is changing all the time). One thing that I have offered in the past, and that could be done much sooner, is to maintain through triggers or events pre-computed queries of frequently used public data- but I never got wikireplica users to coordinate and decide on which would be the most valuable pregenerated reports (but I am open to tickets about that). On Tue, Jun 4, 2019 at 3:10 AM Huji Lee <huji.huji(a)gmail.com> wrote:

...

On Mon, Jun 3, 2019 at 2:56 PM John <phoenixoverride(a)gmail.com> wrote:

-- Jaime Crespo <http://wikimedia.org>

Jaime Crespo

9:23 a.m.

There is another factor that contributes both to the thinking of too much normalization and performance issues: There is now multiple "*_temp" tables that are required to join things like comment and revision, or actor and revision. These tables are temporary and only necessary during the migration, so it is fully online (we cannot just stop wiki edits to do maintenance). Once the migration is over, those tables will be dropped, indexes effectiveness will improve (to some point) and the number of joins needed will decrease. For example, to check the contributions of a user by name one has to do now (on production; on wikireplicas is slightly easier): SELECT * FROM revision_actor_temp JOIN actor ON actor_id = revactor_actor JOIN revision ON revactor_rev = rev_id JOIN revision_comment_temp ON rev_id = revcomment_rev JOIN comment ON revcomment_comment_id = comment_id WHERE actor_name = '<user_id_or_ip>'; This is indeed silly, and will soon be summarized as: SELECT * FROM actor JOIN revision ON actor_id = rev_actor JOIN comment ON rev_comment_id = comment_id WHERE actor_name = '<user_id_or_ip>'; On Tue, Jun 4, 2019 at 9:00 AM Jaime Crespo <jcrespo(a)wikimedia.org> wrote:

...

On Mon, Jun 3, 2019 at 2:56 PM John <phoenixoverride(a)gmail.com> wrote:

-- Jaime Crespo <http://wikimedia.org>

Bryan Davis

3 Jun 3 Jun

3:30 a.m.

On Sun, Jun 2, 2019 at 7:43 PM Jesse Plamondon-Willard <pathoschild(a)gmail.com> wrote:

...

Hello, Some Toolforge SQL queries have much worse performance when updated for the new comment table. For example, see one previous SQL query and its updated version which just adds a join on the comment table. The change adds ten dependent subqueries and increases Stalktoy load times from ≈15 seconds to ≈245 seconds, and the same change for another query was enough to have my tools temporarily rate-limited. I haven't found a way to update those queries efficiently. Is there an optimisation I'm missing? Why does the comment view need subqueries on ten other tables, and is there an alternate version without the subqueries for cases where we just need to join by ID?

I don't have a good answer for the performance question, but I can provide some information about where the large number of subqueries come from. They are an artifact of the way that the comment table is exposed by the Wiki Replicas view layer. Comments can be suppressed and this suppression can be toggled back and forth over time. The flag for suppression lives in the related entity table (ipblocks in this case) rather than the comment table itself. Our current data sanitization system deals with this by deciding at runtime if a given comment should be visible or not to Wiki Replica users. This is done via a view that is defined in our Puppet repository [0]. This view is not aware of the context that it being used for in a given query, so it tests all of the entity table flags that could indicate suppression for each comment row. One "fix" I could imagine for this would be for the Wiki Replicas to also provide per-entity views of the comment table that only return rows related to a single entity table. For example a "comment_ipblocks" view could use a where clause specific to the ipblocks entity table like "WHERE exists( select 1 from ipblocks where ipb_reason_id = comment_id and ipb_deleted = 0)" and exclude the tests for other entities (image, filearchive, revision, etc). [0]: https://phabricator.wikimedia.org/source/operations-puppet/browse/productio… Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855

Brad Jorsch (Anomie)

2:28 p.m.

On Sun, Jun 2, 2019 at 11:30 PM Bryan Davis <bd808(a)wikimedia.org> wrote:

...

One "fix" I could imagine for this would be for the Wiki Replicas to also provide per-entity views of the comment table that only return rows related to a single entity table. For example a "comment_ipblocks" view could use a where clause specific to the ipblocks entity table like "WHERE exists( select 1 from ipblocks where ipb_reason_id = comment_id and ipb_deleted = 0)" and exclude the tests for other entities (image, filearchive, revision, etc).

See also https://phabricator.wikimedia.org/T215445. -- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation

Bryan Davis

10 Jun 10 Jun

3:57 p.m.

On Mon, Jun 3, 2019 at 4:30 AM Bryan Davis <bd808(a)wikimedia.org> wrote:

...

We have created new views to do this single entity table filtering for both the actor and comment tables. See <https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas#The_actor_table_seems_really_slow--so_does_comment> for more information. TL;DR: * actor_filearchive * actor_image * actor_ipblocks * actor_logging * actor_oldimage * actor_protected_titles * actor_recentchanges * actor_revision * comment_filearchive * comment_image * comment_ipblocks * comment_logging * comment_oldimage * comment_protected_titles * comment_recentchanges * comment_revision Bryan -- Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org> [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855

1775

days inactive

1782

days old

cloud@lists.wikimedia.org

Manage subscription

18 comments

10 participants

tags (0)

participants (10)

Alex Monk
Amir Sarabadani
Brad Jorsch (Anomie)
Bryan Davis
Chico Venancio
Daniel Kinzler
Huji Lee
Jaime Crespo
Jesse Plamondon-Willard
John