Hi all,

We have a set of database reports (on users, articles, etc.) that we used to generate on a weekly basis.[1] Ever since the introduction of the actor table,[2] many of the reports that have to do with users have become so slow that the SQL query cannot finish within a reasonable time and is killed. Some other reports have also become slower over time; all of these are shown in red in [1].

One possible solution is to create a script which is scheduled to run once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.

We can use the current-version-only DB dumps for this purpose. I am guessing that this process would take several hours to run (somewhere between 2 and 10) and would require about 2 GB of storage just to download and decompress the dump file, and some additional space on the DB side (for data, indexes, etc.)

Out of abundance of caution, I thought I should ask for permission now, rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?

Please advise!
Huji


  [1] https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%DB%8C%D8%A7:%DA%AF%D8%B2%D8%A7%D8%B1%D8%B4_%D8%AF%DB%8C%D8%AA%D8%A7%D8%A8%DB%8C%D8%B3
  [2] https://phabricator.wikimedia.org/T223406
  [3] https://dumps.wikimedia.org/fawiki/20200401/