With several weeks of delay, I finally had a chance to submit three Phab tasks around this topic. I have at least 10 more queries that I could add to the list, but I don't want to overwhelm people, so I will wait for those three to be resolved first. Thanks! Huji
On Tue, Apr 14, 2020 at 8:53 AM Huji Lee huji.huji@gmail.com wrote:
I completely appreciate the points you are making, Bryan and Jaime. And I would very much enjoy "dealing with you" if we end up going to "Cloud VPS project" route! If anything, I keep learning new things from you all.
Let's start where you suggested. I will create Phab tickets on which I will seek advice about how to optimize those super-slow queries (if at all possible).
Thank you for your attention!
On Tue, Apr 14, 2020 at 8:49 AM Huji Lee huji.huji@gmail.com wrote:
Yes. If you go to the source of all those pages, there is a hidden HTML element (<!-- --> kind) that has the SQL code for that report.
Here is one example: [1]
[1] https://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C...
On Mon, Apr 13, 2020 at 7:19 PM MusikAnimal musikanimal@gmail.com wrote:
Is the source code public? Maybe the queries could be improved. I ran into many such issues too after the actor migration, but after taking advantage of specialized views[0] and join decomposition (get just the actor IDs, i.e. rev_actor, then the actor_names in a separate query), my tools are seemingly as fast as they were before.
~ MA
[0] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R...
On Mon, Apr 13, 2020 at 5:03 PM Huji Lee huji.huji@gmail.com wrote:
I understand. However, I think that the use case we are looking at is relatively unique. I also think that indexes we need may not be desirable for all the Wiki Replicas (they would often be multi-column indexes geared towards a specific set of queries) and I honestly don't want to go through the several weeks (months?) of discussion to justify them.
Note that if we open the can of "more indexes on Wiki Replicas" worms, this would all of a sudden become an all-wiki discussion. I'm not sure if there are more than a handful wikis that do this level of page-level and user-level analytics as fawiki does, which means for most wikis (and for most Wiki Replica databases) those additional indexes may not even be justified.
Even if we were to generalize parts of this approach and bring it to Wiki Replicas, I would still argue that doing it at a smaller extent (one wiki DB for now) would be a reasonable starting point, no?
On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis bd808@wikimedia.org wrote:
On Sun, Apr 12, 2020 at 7:48 AM Huji Lee huji.huji@gmail.com wrote:
One possible solution is to create a script which is scheduled to
run once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.
If I am understanding your proposal here, I think the main difference from the current Wiki Replicas would be "create some additional indexes that would make our desired queries run faster". We do have some indexes and views in the Wiki Replicas which are specifically designed to make common things faster today. If possible, adding to these rather than building a one-off process of moving lots of data round for your tool would be nice.
I say this not because what you are proposing is a ridiculous solution, but because it is a unique solution for your current problem that will not help others who are having similar problems. Having 1 tool use ToolsDB or a custom Cloud VPS project like this is possible, but having 100 tools try to follow that pattern themselves is not.
Out of abundance of caution, I thought I should ask for permission
now, rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?
https://phabricator.wikimedia.org/project/view/2875/
Bryan
Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud