Actually, as an idea, I don't think it is a bad one. In fact, there already exists the ticket:
https://phabricator.wikimedia.org/T59617
Which basically is one where a summary of data can be shared, but not the individual rows, so once a summary was created and written into the tables as a sort of "materialized view", as otherwise, querying it would be too slow.
I wouldn't go through the idea of importing data from dumps, but I would go through creating this sort of "materialized views" based on requested user queries. However, there are many issues why this hasn't been done before:
* This is not a small project- it is not just a cron and a script- it requires following production security patterns and best practices, creating infrastructure code to manage everything, as well as modifying our public data checks to make it work. I would calculate that at least this is work for 6 months of a full time position * The technical solution is not easy: While single row edits (e.g. updates coming from replication) is a non-issue, imports on tables with long running selects would make impossible to "load" a table (even if it was loaded with another name and then renamed). We would fall into metadata locking, which means we would have to kill connections (which ones, all wikireplica connections?) every time it is loaded. In any case, for some time every day or every week, those would be unavailable to query * The tables would not have real-time data. Most complaints we have about wikireplica service happen when they start lagging due to maintenance. We have had complaints after only 5 minutes of lag. There is an expectation of real-timeness, that these would violate- probably causing complaints, no matter how documented or announced that was. I would prefer if these were on a separate service, like the tools user dbs, to prevent that. * Extra write load- the more write load we have, the less well replication would work. There is only so many writes that a wikireplica host can handle until it starts lagging behind- resulting either on slow imports, or lag on wikireplicas, both cases also causing annoyances to users * Extra disk space- the new tables should either be large to need the dataset be on tables, or too smalle to justify its existence. Who would decide what gets included and what's not (given everybody should be given the same chance to get their request done), given the finite amount of disk space available? * This is not a priority project because, while we have seen a lot of suggestions on how to improve the wikireplica service, we only have seen a couple of people asking for this possibility, so these other projects got priority. * This would need continuous maintenance- aside from the initial, non-trivial work, this project would need continuous updates- every time the os is updated, every time the mariadb server version changes slightly the syntax, every time the underlying data structures suffer an alter, react to security and data issues promptly.
Having said that, you seem to be volunteering for this work- which I would be happy to support. Please send puppet patches for T59617 and we can talk about setting up the foundations of infrastructure for materialized views/reports on wikireplicas defined by users (Which must live in production, not on cloud, as it would have access to production data). I would prefer, however, this to be a separate service, where everybody understands they are not working with real-time data. In my mind, the analytics service would be based on unchanged weekly imports, and the web one would be more real-time-y.
I would, however, strongly suggest you share your queries and try to exhaust the "extra indexes" path Bryan and MA suggest, rather than trying to maintain a parallel setup, which would have such an overhead, we DBAs haven't been able to even get started with it.
On Sun, Apr 12, 2020 at 3:48 PM Huji Lee huji.huji@gmail.com wrote:
Hi all,
We have a set of database reports (on users, articles, etc.) that we used to generate on a weekly basis.[1] Ever since the introduction of the actor table,[2] many of the reports that have to do with users have become so slow that the SQL query cannot finish within a reasonable time and is killed. Some other reports have also become slower over time; all of these are shown in red in [1].
One possible solution is to create a script which is scheduled to run once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.
We can use the current-version-only DB dumps for this purpose. I am guessing that this process would take several hours to run (somewhere between 2 and 10) and would require about 2 GB of storage just to download and decompress the dump file, and some additional space on the DB side (for data, indexes, etc.)
Out of abundance of caution, I thought I should ask for permission now, rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?
Please advise! Huji
[1] https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%... [2] https://phabricator.wikimedia.org/T223406 [3] https://dumps.wikimedia.org/fawiki/20200401/ _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud