On Mon, Apr 13, 2020 at 3:03 PM Huji Lee huji.huji@gmail.com wrote:
On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis bd808@wikimedia.org wrote:
On Sun, Apr 12, 2020 at 7:48 AM Huji Lee huji.huji@gmail.com wrote:
One possible solution is to create a script which is scheduled to run once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.
If I am understanding your proposal here, I think the main difference from the current Wiki Replicas would be "create some additional indexes that would make our desired queries run faster". We do have some indexes and views in the Wiki Replicas which are specifically designed to make common things faster today. If possible, adding to these rather than building a one-off process of moving lots of data round for your tool would be nice.
I say this not because what you are proposing is a ridiculous solution, but because it is a unique solution for your current problem that will not help others who are having similar problems. Having 1 tool use ToolsDB or a custom Cloud VPS project like this is possible, but having 100 tools try to follow that pattern themselves is not.
I understand. However, I think that the use case we are looking at is relatively unique. I also think that indexes we need may not be desirable for all the Wiki Replicas (they would often be multi-column indexes geared towards a specific set of queries) and I honestly don't want to go through the several weeks (months?) of discussion to justify them.
How about hours or minutes?
Note that if we open the can of "more indexes on Wiki Replicas" worms, this would all of a sudden become an all-wiki discussion. I'm not sure if there are more than a handful wikis that do this level of page-level and user-level analytics as fawiki does, which means for most wikis (and for most Wiki Replica databases) those additional indexes may not even be justified.
It is less of an "all-wiki" discussion and more of a discussion with the Cloud Services folks (/me waves) and the DBA team managing the Wiki Replicas. We do not have a well documented procedure for these discussions, but a Phabricator task tagged with #data-services would be a good place to start.
Ideally there you would also show us the SQL in question; it may turn out that there are already better ways to get that data you are looking for. The actor and comment table refactors make many queries that have not been adjusted to the new normal much, much slower than before. There is unfortunately no easy magic we can add on the backend to fix that. The shape of the data is different now and searching it has to adjust to that new shape.
With the amount of information you have provided up to now I honestly can't say one way or the other if indexes you are thinking of will be acceptable or not. I can say with a reasonable amount of confidence that if your other alternative is requesting a Cloud VPS project, installing and managing MySQL/MaridDB yourself, creating a repeatable and reliable way to populate your database from the sql dumps, and then running your scripts that is going to be a lot more work than explaining your difficult/slow use case at the SQL level and the desired outcome level.
In the end you may end up needing to go back and actually take the longer, harder route, but you will know why. More importantly the Cloud Services and DBA folks will know why too and we can use that knowledge to try and find better general solutions.
Even if we were to generalize parts of this approach and bring it to Wiki Replicas, I would still argue that doing it at a smaller extent (one wiki DB for now) would be a reasonable starting point, no?
Maybe? Again I believe you are arguing to create and maintain an entirely parallel system rather than trying to improve the current one. For the Wiki Replicas the only material difference between enwiki, fawiki, and nvwiki is the number of rows in the tables. The same tables exist for each and things are managed across all of them in the same manner. We add custom indexes with a python script that reads from a YAML config file [0].
Choose your own path from here. You can request some help making your SQL run faster on the Wiki Replicas by filing a Phabricator ticket tagged with #data-services, and that discussion might find some index additions to be useful for you and others. You can request a Cloud VPS project by following the process documented in Phabricator [1]. Or you can find a 3rd option on your own. I'm a part of the group that reviews and approves requests for new Cloud VPS projects, so you may have to deal with me asking "why" again there if you go that direction. :)
[0]: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/productio... [1]: https://phabricator.wikimedia.org/project/view/2875/
Bryan