It may help to put things in context and remind everyone that among the most visited websites in the world (of which, Wikipedia is #5 per Alexa [0]) the vast majority do not even use a relational database, making real-time large-scale analytic queries something that is not easily available to the uninitiated or the third-party query writer. The fact that we can run a lot of these analytical queries using near-real-time data is an exceptional and amazing feat.

PS: Bryan misspelled ETL by accident :) 


On Mon, Jun 3, 2019 at 7:24 PM Bryan Davis <> wrote:
On Mon, Jun 3, 2019 at 2:56 PM John <> wrote:
> Please don’t be patronizing insulting. What did read. What used to be one fairly quick query will now be thousands of decoupled queries. I know a number of cases where the decoupled query count can get into the hundreds of thousands. So what used to take 1 query and <2 minutes is now going to hammer the database with ~300k queries and additional logic on the programming side.

The Wiki Replica databases are going though some tough changes for
"analytic" queries right now. By analytic, I mean the sort of queries
that tools want/need to do where the data generated by the wikis is
being combed through to find correlations that are not needed to
support the "operational" queries that are done by the MediaWiki
software. The database industry terms OLTP (online transactional
processing) [0] and OLAP (online analytic processing) [1] are used to
describe these related, but very different use cases.

Folks involved in MediaWiki core work have been making a lot of
changes to the OLTP schema used by all wikis to make it more efficient
in storage size and to reduce data duplication. These changes are very
much needed to keep the wikis the size of the larger Wikimedia wikis
performant as the amount of pages grows. Unfortunately these same
changes are making many OLAP queries more expensive. The DBA and Cloud
Services teams will try to do what they can to minimize the amount of
additional slowness that is introduced, but there are going to be
limits to what we can accomplish.

Today the best advice I can give is to ask for help looking at queries
which are too slow to work at all for the tools that need them. This
mailing list, the #wikimedia-cloud Freenode IRC channel, or
Phabricator are the best places to ask today. Some queries might be
easy to make faster, others sadly might be as good as they can get and
need other more complicated changes like splitting the query into
smaller batches of work in various ways.

The Technical Engagement team is interested in starting a project
sometime in the next fiscal year (July 2019 - June 2020) to gather
requirements for an OLAP database schema that would be easier for
tools to use. Once we have that we would move into a phase of trying
to design a ELT (extract, transform, load) pipeline that would convert
the OLTP schema that MediaWiki has (what you see on the Wiki Replica
hosts today) into that OLAP schema. This part is likely to be tricky
to do while still keeping the lag from the production databases
reasonably low and the security of the data high. This project is all
speculative at this point however. I would like to see it happen, but
it will take help from a lot of folks and a fairly large investment in
time and probably hardware as well. I can't give any timeline for
completion or honestly even promise that it will eventually happen
yet. I can tell you however that we know this is a growing problem,
and that there are people who would like to help make it better.


Bryan Davis              Wikimedia Foundation    <>
[[m:User:BDavis_(WMF)]] Manager, Technical Engagement    Boise, ID USA
irc: bd808                                        v:415.839.6885 x6855

Wikimedia Cloud Services mailing list (formerly