It may help to put things in context and remind everyone that among the most visited websites in the world (of which, Wikipedia is #5 per Alexa [0]) the vast majority do not even use a relational database, making real-time large-scale analytic queries something that is not easily available to the uninitiated or the third-party query writer. The fact that we can run a lot of these analytical queries using near-real-time data is an exceptional and amazing feat.
PS: Bryan misspelled ETL by accident :)
[0]: https://www.alexa.com/topsites
On Mon, Jun 3, 2019 at 7:24 PM Bryan Davis bd808@wikimedia.org wrote:
On Mon, Jun 3, 2019 at 2:56 PM John phoenixoverride@gmail.com wrote:
Please don’t be patronizing insulting. What did read. What used to be
one fairly quick query will now be thousands of decoupled queries. I know a number of cases where the decoupled query count can get into the hundreds of thousands. So what used to take 1 query and <2 minutes is now going to hammer the database with ~300k queries and additional logic on the programming side.
The Wiki Replica databases are going though some tough changes for "analytic" queries right now. By analytic, I mean the sort of queries that tools want/need to do where the data generated by the wikis is being combed through to find correlations that are not needed to support the "operational" queries that are done by the MediaWiki software. The database industry terms OLTP (online transactional processing) [0] and OLAP (online analytic processing) [1] are used to describe these related, but very different use cases.
Folks involved in MediaWiki core work have been making a lot of changes to the OLTP schema used by all wikis to make it more efficient in storage size and to reduce data duplication. These changes are very much needed to keep the wikis the size of the larger Wikimedia wikis performant as the amount of pages grows. Unfortunately these same changes are making many OLAP queries more expensive. The DBA and Cloud Services teams will try to do what they can to minimize the amount of additional slowness that is introduced, but there are going to be limits to what we can accomplish.
Today the best advice I can give is to ask for help looking at queries which are too slow to work at all for the tools that need them. This mailing list, the #wikimedia-cloud Freenode IRC channel, or Phabricator are the best places to ask today. Some queries might be easy to make faster, others sadly might be as good as they can get and need other more complicated changes like splitting the query into smaller batches of work in various ways.
The Technical Engagement team is interested in starting a project sometime in the next fiscal year (July 2019 - June 2020) to gather requirements for an OLAP database schema that would be easier for tools to use. Once we have that we would move into a phase of trying to design a ELT (extract, transform, load) pipeline that would convert the OLTP schema that MediaWiki has (what you see on the Wiki Replica hosts today) into that OLAP schema. This part is likely to be tricky to do while still keeping the lag from the production databases reasonably low and the security of the data high. This project is all speculative at this point however. I would like to see it happen, but it will take help from a lot of folks and a fairly large investment in time and probably hardware as well. I can't give any timeline for completion or honestly even promise that it will eventually happen yet. I can tell you however that we know this is a growing problem, and that there are people who would like to help make it better.
Bryan
Bryan Davis Wikimedia Foundation bd808@wikimedia.org [[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA irc: bd808 v:415.839.6885 x6855
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud