Is the source code public? Maybe the queries could be improved. I ran into many such issues too after the actor migration, but after taking advantage of specialized views[0] and join decomposition (get just the actor IDs, i.e. rev_actor, then the actor_names in a separate query), my tools are seemingly as fast as they were before.

~ MA

[0] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas#Advanced_use_cases_of_specialized_views

On Mon, Apr 13, 2020 at 5:03 PM Huji Lee <huji.huji@gmail.com> wrote:
I understand. However, I think that the use case we are looking at is relatively unique. I also think that indexes we need may not be desirable for all the Wiki Replicas (they would often be multi-column indexes geared towards a specific set of queries) and I honestly don't want to go through the several weeks (months?) of discussion to justify them.

Note that if we open the can of "more indexes on Wiki Replicas" worms, this would all of a sudden become an all-wiki discussion. I'm not sure if there are more than a handful wikis that do this level of page-level and user-level analytics as fawiki does, which means for most wikis (and for most Wiki Replica databases) those additional indexes may not even be justified.

Even if we were to generalize parts of this approach and bring it to Wiki Replicas, I would still argue that doing it at a smaller extent (one wiki DB for now) would be a reasonable starting point, no?

On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis <bd808@wikimedia.org> wrote:
On Sun, Apr 12, 2020 at 7:48 AM Huji Lee <huji.huji@gmail.com> wrote:
>
> One possible solution is to create a script which is scheduled to run once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.

If I am understanding your proposal here, I think the main difference
from the current Wiki Replicas would be "create some additional
indexes that would make our desired queries run faster". We do have
some indexes and views in the Wiki Replicas which are specifically
designed to make common things faster today. If possible, adding to
these rather than building a one-off process of moving lots of data
round for your tool would be nice.

I say this not because what you are proposing is a ridiculous
solution, but because it is a unique solution for your current problem
that will not help others who are having similar problems. Having 1
tool use ToolsDB or a custom Cloud VPS project like this is possible,
but having 100 tools try to follow that pattern themselves is not.

> Out of abundance of caution, I thought I should ask for permission now, rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?

<https://phabricator.wikimedia.org/project/view/2875/>

Bryan
--
Bryan Davis              Technical Engagement      Wikimedia Foundation
Principal Software Engineer                               Boise, ID USA
[[m:User:BDavis_(WMF)]]                                      irc: bd808

_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud
_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud