Re: [Cloud] Loading wikipedia dump onto Clouds

14 Apr 2020

On Mon, Apr 13, 2020 at 3:03 PM Huji Lee &lt;huji.huji(a)gmail.com&gt; wrote:
...

 On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis &lt;bd808(a)wikimedia.org&gt; wrote:

 On Sun, Apr 12, 2020 at 7:48 AM Huji Lee &lt;huji.huji(a)gmail.com&gt; wrote:

 One possible solution is to create a script which is scheduled to run once a month; the
script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB,
create some additional indexes that would make our desired queries run faster, and
generate the reports using this database. A separate script can then purge the data a few
days later. 
 If I am understanding your proposal here, I think the main difference
 from the current Wiki Replicas would be "create some additional
 indexes that would make our desired queries run faster". We do have
 some indexes and views in the Wiki Replicas which are specifically
 designed to make common things faster today. If possible, adding to
 these rather than building a one-off process of moving lots of data
 round for your tool would be nice.

 I say this not because what you are proposing is a ridiculous
 solution, but because it is a unique solution for your current problem
 that will not help others who are having similar problems. Having 1
 tool use ToolsDB or a custom Cloud VPS project like this is possible,
 but having 100 tools try to follow that pattern themselves is not. 
 I understand. However, I think that the use case we are looking at is relatively unique.
I also think that indexes we need may not be desirable for all the Wiki Replicas (they
would often be multi-column indexes geared towards a specific set of queries) and I
honestly don't want to go through the several weeks (months?) of discussion to justify
them. 
How about hours or minutes?

...
  Note that if we open the can of "more indexes on
Wiki Replicas" worms, this would all of a sudden become an all-wiki discussion.
I'm not sure if there are more than a handful wikis that do this level of page-level
and user-level analytics as fawiki does, which means for most wikis (and for most Wiki
Replica databases) those additional indexes may not even be justified. 
It is less of an "all-wiki" discussion and more of a discussion with
the Cloud Services folks (/me waves) and the DBA team managing the
Wiki Replicas. We do not have a well documented procedure for these
discussions, but a Phabricator task tagged with #data-services would
be a good place to start.

Ideally there you would also show us the SQL in question; it may turn
out that there are already better ways to get that data you are
looking for. The actor and comment table refactors make many queries
that have not been adjusted to the new normal much, much slower than
before. There is unfortunately no easy magic we can add on the backend
to fix that. The shape of the data is different now and searching it
has to adjust to that new shape.

With the amount of information you have provided up to now I honestly
can't say one way or the other if indexes you are thinking of will be
acceptable or not. I can say with a reasonable amount of confidence
that if your other alternative is requesting a Cloud VPS project,
installing and managing MySQL/MaridDB yourself, creating a repeatable
and reliable way to populate your database from the sql dumps, and
then running your scripts that is going to be a lot more work than
explaining your difficult/slow use case at the SQL level and the
desired outcome level.

In the end you may end up needing to go back and actually take the
longer, harder route, but you will know why. More importantly the
Cloud Services and DBA folks will know why too and we can use that
knowledge to try and find better general solutions.

...
  Even if we were to generalize parts of this approach
and bring it to Wiki Replicas, I would still argue that doing it at a smaller extent (one
wiki DB for now) would be a reasonable starting point, no? 
Maybe? Again I believe you are arguing to create and maintain an
entirely parallel system rather than trying to improve the current
one. For the Wiki Replicas the only material difference between
enwiki, fawiki, and nvwiki is the number of rows in the tables. The
same tables exist for each and things are managed across all of them
in the same manner. We add custom indexes with a python script that
reads from a YAML config file [0].

Choose your own path from here. You can request some help making your
SQL run faster on the Wiki Replicas by filing a Phabricator ticket
tagged with #data-services, and that discussion might find some index
additions to be useful for you and others. You can request a Cloud VPS
project by following the process documented in Phabricator [1]. Or you
can find a 3rd option on your own. I'm a part of the group that
reviews and approves requests for new Cloud VPS projects, so you may
have to deal with me asking "why" again there if you go that
direction. :)

[0]:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/producti…
[1]: https://phabricator.wikimedia.org/project/view/2875/

Bryan
-- 
Bryan Davis              Technical Engagement      Wikimedia Foundation
Principal Software Engineer                               Boise, ID USA
[[m:User:BDavis_(WMF)]]                                      irc: bd808

2024

2023

2022

2021

2020

2019

2018

2017

Re: [Cloud] Loading wikipedia dump onto Clouds