There is another factor that contributes both to the thinking of too much normalization and performance issues:
There is now multiple "*_temp" tables that are required to join things like comment and revision, or actor and revision. These tables are temporary and only necessary during the migration, so it is fully online (we cannot just stop wiki edits to do maintenance). Once the migration is over, those tables will be dropped, indexes effectiveness will improve (to some point) and the number of joins needed will decrease.
For example, to check the contributions of a user by name one has to do now (on production; on wikireplicas is slightly easier):
SELECT * FROM revision_actor_temp JOIN actor ON actor_id = revactor_actor JOIN revision ON revactor_rev = rev_id JOIN revision_comment_temp ON rev_id = revcomment_rev JOIN comment ON revcomment_comment_id = comment_id WHERE actor_name = '<user_id_or_ip>';
This is indeed silly, and will soon be summarized as: SELECT * FROM actor JOIN revision ON actor_id = rev_actor JOIN comment ON rev_comment_id = comment_id WHERE actor_name = '<user_id_or_ip>';
On Tue, Jun 4, 2019 at 9:00 AM Jaime Crespo jcrespo@wikimedia.org wrote:
I think there is a misunderstanding that normalization reduces performance (even for analytic queries)- all the opposite, due to a lower amount of actual data stored, the memory to disk ratio increases and while joins may increase cpu usage, the usage, even on labs, is normally negligible. In particular, the addition of the comment table will reduce storage footprint by 2/3ds, allow for better compression, allowed and easy path to increase of comment size limit (very necessary for non-English languages) and will save around 30TB of storage and iops overally in the infrastructure (in an environment where scaling is king).
The problem with wikireplicas is that for the comment table, special queries have to be done, like the ones Amir suggested, due to data sanitization (not because normalization). If there was access to base (private) tables, speed would actually be better. That would be fixed with Bryan's work- which was the original idea behind analytics vs web separation (analytics potentially in the future being a separate store or the same, with analytics-friendly schema, but maybe not real time). Right now, wikireplica databases can only follow production without any kind of stable api (and yes, the focus in production is OLTP not OLAP. Actor refactoring, MCR, comment, link* refactoring,... all wikireplicas can do at the moment is follow production (because the data model is changing all the time).
One thing that I have offered in the past, and that could be done much sooner, is to maintain through triggers or events pre-computed queries of frequently used public data- but I never got wikireplica users to coordinate and decide on which would be the most valuable pregenerated reports (but I am open to tickets about that).
On Tue, Jun 4, 2019 at 3:10 AM Huji Lee huji.huji@gmail.com wrote:
It may help to put things in context and remind everyone that among the most visited websites in the world (of which, Wikipedia is #5 per Alexa [0]) the vast majority do not even use a relational database, making real-time large-scale analytic queries something that is not easily available to the uninitiated or the third-party query writer. The fact that we can run a lot of these analytical queries using near-real-time data is an exceptional and amazing feat.
PS: Bryan misspelled ETL by accident :)
On Mon, Jun 3, 2019 at 7:24 PM Bryan Davis bd808@wikimedia.org wrote:
On Mon, Jun 3, 2019 at 2:56 PM John phoenixoverride@gmail.com wrote:
Please don’t be patronizing insulting. What did read. What used to be one fairly quick query will now be thousands of decoupled queries. I know a number of cases where the decoupled query count can get into the hundreds of thousands. So what used to take 1 query and <2 minutes is now going to hammer the database with ~300k queries and additional logic on the programming side.
The Wiki Replica databases are going though some tough changes for "analytic" queries right now. By analytic, I mean the sort of queries that tools want/need to do where the data generated by the wikis is being combed through to find correlations that are not needed to support the "operational" queries that are done by the MediaWiki software. The database industry terms OLTP (online transactional processing) [0] and OLAP (online analytic processing) [1] are used to describe these related, but very different use cases.
Folks involved in MediaWiki core work have been making a lot of changes to the OLTP schema used by all wikis to make it more efficient in storage size and to reduce data duplication. These changes are very much needed to keep the wikis the size of the larger Wikimedia wikis performant as the amount of pages grows. Unfortunately these same changes are making many OLAP queries more expensive. The DBA and Cloud Services teams will try to do what they can to minimize the amount of additional slowness that is introduced, but there are going to be limits to what we can accomplish.
Today the best advice I can give is to ask for help looking at queries which are too slow to work at all for the tools that need them. This mailing list, the #wikimedia-cloud Freenode IRC channel, or Phabricator are the best places to ask today. Some queries might be easy to make faster, others sadly might be as good as they can get and need other more complicated changes like splitting the query into smaller batches of work in various ways.
The Technical Engagement team is interested in starting a project sometime in the next fiscal year (July 2019 - June 2020) to gather requirements for an OLAP database schema that would be easier for tools to use. Once we have that we would move into a phase of trying to design a ELT (extract, transform, load) pipeline that would convert the OLTP schema that MediaWiki has (what you see on the Wiki Replica hosts today) into that OLAP schema. This part is likely to be tricky to do while still keeping the lag from the production databases reasonably low and the security of the data high. This project is all speculative at this point however. I would like to see it happen, but it will take help from a lot of folks and a fairly large investment in time and probably hardware as well. I can't give any timeline for completion or honestly even promise that it will eventually happen yet. I can tell you however that we know this is a growing problem, and that there are people who would like to help make it better.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
-- Jaime Crespo http://wikimedia.org