There is another factor that contributes both to the thinking of too
much normalization and performance issues:
There is now multiple "*_temp" tables that are required to join things
like comment and revision, or actor and revision. These tables are
temporary and only necessary during the migration, so it is fully
online (we cannot just stop wiki edits to do maintenance). Once the
migration is over, those tables will be dropped, indexes effectiveness
will improve (to some point) and the number of joins needed will
decrease.
For example, to check the contributions of a user by name one has to
do now (on production; on wikireplicas is slightly easier):
SELECT *
FROM revision_actor_temp
JOIN actor ON actor_id = revactor_actor
JOIN revision ON revactor_rev = rev_id
JOIN revision_comment_temp ON rev_id = revcomment_rev
JOIN comment ON revcomment_comment_id = comment_id
WHERE actor_name = '<user_id_or_ip>';
This is indeed silly, and will soon be summarized as:
SELECT *
FROM actor
JOIN revision ON actor_id = rev_actor
JOIN comment ON rev_comment_id = comment_id
WHERE actor_name = '<user_id_or_ip>';
On Tue, Jun 4, 2019 at 9:00 AM Jaime Crespo <jcrespo(a)wikimedia.org> wrote:
I think there is a misunderstanding that normalization reduces
performance (even for analytic queries)- all the opposite, due to a
lower amount of actual data stored, the memory to disk ratio increases
and while joins may increase cpu usage, the usage, even on labs, is
normally negligible. In particular, the addition of the comment table
will reduce storage footprint by 2/3ds, allow for better compression,
allowed and easy path to increase of comment size limit (very
necessary for non-English languages) and will save around 30TB of
storage and iops overally in the infrastructure (in an environment
where scaling is king).
The problem with wikireplicas is that for the comment table, special
queries have to be done, like the ones Amir suggested, due to data
sanitization (not because normalization). If there was access to base
(private) tables, speed would actually be better. That would be fixed
with Bryan's work- which was the original idea behind analytics vs web
separation (analytics potentially in the future being a separate store
or the same, with analytics-friendly schema, but maybe not real time).
Right now, wikireplica databases can only follow production without
any kind of stable api (and yes, the focus in production is OLTP not
OLAP. Actor refactoring, MCR, comment, link* refactoring,... all
wikireplicas can do at the moment is follow production (because the
data model is changing all the time).
One thing that I have offered in the past, and that could be done much
sooner, is to maintain through triggers or events pre-computed queries
of frequently used public data- but I never got wikireplica users to
coordinate and decide on which would be the most valuable pregenerated
reports (but I am open to tickets about that).
On Tue, Jun 4, 2019 at 3:10 AM Huji Lee <huji.huji(a)gmail.com> wrote:
It may help to put things in context and remind everyone that among the most visited
websites in the world (of which, Wikipedia is #5 per Alexa [0]) the vast majority do not
even use a relational database, making real-time large-scale analytic queries something
that is not easily available to the uninitiated or the third-party query writer. The fact
that we can run a lot of these analytical queries using near-real-time data is an
exceptional and amazing feat.
PS: Bryan misspelled ETL by accident :)
[0]:
https://www.alexa.com/topsites
On Mon, Jun 3, 2019 at 7:24 PM Bryan Davis <bd808(a)wikimedia.org> wrote:
On Mon, Jun 3, 2019 at 2:56 PM John <phoenixoverride(a)gmail.com> wrote:
Please don’t be patronizing insulting. What did read. What used to be one fairly quick
query will now be thousands of decoupled queries. I know a number of cases where the
decoupled query count can get into the hundreds of thousands. So what used to take 1 query
and <2 minutes is now going to hammer the database with ~300k queries and additional
logic on the programming side.
The Wiki Replica databases are going though some tough changes for
"analytic" queries right now. By analytic, I mean the sort of queries
that tools want/need to do where the data generated by the wikis is
being combed through to find correlations that are not needed to
support the "operational" queries that are done by the MediaWiki
software. The database industry terms OLTP (online transactional
processing) [0] and OLAP (online analytic processing) [1] are used to
describe these related, but very different use cases.
Folks involved in MediaWiki core work have been making a lot of
changes to the OLTP schema used by all wikis to make it more efficient
in storage size and to reduce data duplication. These changes are very
much needed to keep the wikis the size of the larger Wikimedia wikis
performant as the amount of pages grows. Unfortunately these same
changes are making many OLAP queries more expensive. The DBA and Cloud
Services teams will try to do what they can to minimize the amount of
additional slowness that is introduced, but there are going to be
limits to what we can accomplish.
Today the best advice I can give is to ask for help looking at queries
which are too slow to work at all for the tools that need them. This
mailing list, the #wikimedia-cloud Freenode IRC channel, or
Phabricator are the best places to ask today. Some queries might be
easy to make faster, others sadly might be as good as they can get and
need other more complicated changes like splitting the query into
smaller batches of work in various ways.
The Technical Engagement team is interested in starting a project
sometime in the next fiscal year (July 2019 - June 2020) to gather
requirements for an OLAP database schema that would be easier for
tools to use. Once we have that we would move into a phase of trying
to design a ELT (extract, transform, load) pipeline that would convert
the OLTP schema that MediaWiki has (what you see on the Wiki Replica
hosts today) into that OLAP schema. This part is likely to be tricky
to do while still keeping the lag from the production databases
reasonably low and the security of the data high. This project is all
speculative at this point however. I would like to see it happen, but
it will take help from a lot of folks and a fairly large investment in
time and probably hardware as well. I can't give any timeline for
completion or honestly even promise that it will eventually happen
yet. I can tell you however that we know this is a growing problem,
and that there are people who would like to help make it better.
[0]:
https://en.wikipedia.org/wiki/Online_transaction_processing
[1]:
https://en.wikipedia.org/wiki/Online_analytical_processing
[2]:
https://en.wikipedia.org/wiki/Extract,_transform,_load
Bryan
--
Bryan Davis Wikimedia Foundation <bd808(a)wikimedia.org>
[[m:User:BDavis_(WMF)]] Manager, Technical Engagement Boise, ID USA
irc: bd808 v:415.839.6885 x6855
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud(a)lists.wikimedia.org (formerly labs-l(a)lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud
--
Jaime Crespo
<http://wikimedia.org>