Re: [Cloud] Loading wikipedia dump onto Clouds

14 Apr 2020


      Actually, as an idea, I don't think it is a bad one. In fact, there
already exists the ticket:
https://phabricator.wikimedia.org/T59617
Which basically is one where a summary of data can be shared, but not
the individual rows, so once a summary was created and written into
the tables as a sort of "materialized view", as otherwise, querying it
would be too slow.
I wouldn't go through the idea of importing data from dumps, but I
would go through creating this sort of "materialized views" based on
requested user queries. However, there are many issues why this hasn't
been done before:
* This is not a small project- it is not just a cron and a script- it
requires following production security patterns and best practices,
creating infrastructure code to manage everything, as well as
modifying our public data checks to make it work. I would calculate
that at least this is work for 6 months of a full time position
* The technical solution is not easy: While single row edits (e.g.
updates coming from replication) is a non-issue, imports on tables
with long running selects would make impossible to "load" a table
(even if it was loaded with another name and then renamed). We would
fall into metadata locking, which means we would have to kill
connections (which ones, all wikireplica connections?) every time it
is loaded. In any case, for some time every day or every week, those
would be unavailable to query
* The tables would not have real-time data. Most complaints we have
about wikireplica service happen when they start lagging due to
maintenance. We have had complaints after only 5 minutes of lag. There
is an expectation of real-timeness, that these would violate- probably
causing complaints, no matter how documented or announced that was. I
would prefer if these were on a separate service, like the tools user
dbs, to prevent that.
* Extra write load- the more write load we have, the less well
replication would work. There is only so many writes that a
wikireplica host can handle until it starts lagging behind- resulting
either on slow imports, or lag on wikireplicas, both cases also
causing annoyances to users
* Extra disk space- the new tables should either be large to need the
dataset be on tables, or too smalle to justify its existence. Who
would decide what gets included and what's not (given everybody should
be given the same chance to get their request done), given the finite
amount of disk space available?
* This is not a priority project because, while we have seen a lot of
suggestions on how to improve the wikireplica service, we only have
seen a couple of people asking for this possibility, so these other
projects got priority.
* This would need continuous maintenance- aside from the initial,
non-trivial work, this project would need continuous updates- every
time the os is updated, every time the mariadb server version changes
slightly the syntax, every time the underlying data structures suffer
an alter, react to security and data issues promptly.
Having said that, you seem to be volunteering for this work- which I
would be happy to support. Please send puppet patches for T59617 and
we can talk about setting up the foundations of infrastructure for
materialized views/reports on wikireplicas defined by users (Which
must live in production, not on cloud, as it would have access to
production data). I would prefer, however, this to be a separate
service, where everybody understands they are not working with
real-time data. In my mind, the analytics service would be based on
unchanged weekly imports, and the web one would be more real-time-y.
I would, however, strongly suggest you share your queries and try to
exhaust the "extra indexes" path Bryan and MA suggest, rather than
trying to maintain a parallel setup, which would have such an
overhead, we DBAs haven't been able to even get started with it.
On Sun, Apr 12, 2020 at 3:48 PM Huji Lee huji.huji@gmail.com wrote:
...
Hi all,
We have a set of database reports (on users, articles, etc.) that we used to generate on a weekly basis.[1] Ever since the introduction of the actor table,[2] many of the reports that have to do with users have become so slow that the SQL query cannot finish within a reasonable time and is killed. Some other reports have also become slower over time; all of these are shown in red in [1].
One possible solution is to create a script which is scheduled to run once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.
We can use the current-version-only DB dumps for this purpose. I am guessing that this process would take several hours to run (somewhere between 2 and 10) and would require about 2 GB of storage just to download and decompress the dump file, and some additional space on the DB side (for data, indexes, etc.)
Out of abundance of caution, I thought I should ask for permission now, rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?
Please advise!
Huji
[1] https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%...
  [2] https://phabricator.wikimedia.org/T223406
  [3] https://dumps.wikimedia.org/fawiki/20200401/
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud
-- 
Jaime Crespo
http://wikimedia.org

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [Cloud] Loading wikipedia dump onto Clouds