Loading wikipedia dump onto Clouds

List overview All Threads
Download

newer

older

(no subject)

Maintainers wanted: wikistream

Huji Lee

12 Apr 2020 12 Apr '20

3:47 p.m.

Hi all,

We have a set of database reports (on users, articles, etc.) that we used to generate on a weekly basis.[1] Ever since the introduction of the *actor* table,[2] many of the reports that have to do with users have become so slow that the SQL query cannot finish within a reasonable time and is killed. Some other reports have also become slower over time; all of these are shown in red in [1].

One possible solution is to create a script which is scheduled to run once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.

We can use the current-version-only DB dumps for this purpose. I am guessing that this process would take several hours to run (somewhere between 2 and 10) and would require about 2 GB of storage just to download and decompress the dump file, and some additional space on the DB side (for data, indexes, etc.)

Out of abundance of caution, I thought I should ask for permission now, rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?

Please advise! Huji

[1] https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%... [2] https://phabricator.wikimedia.org/T223406 [3] https://dumps.wikimedia.org/fawiki/20200401/

Attachments:

attachment.htm (text/html — 3.1 KB)

Show replies by date

Bryan Davis

13 Apr 13 Apr

10:41 p.m.

On Sun, Apr 12, 2020 at 7:48 AM Huji Lee huji.huji@gmail.com wrote:

...

One possible solution is to create a script which is scheduled to run once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.

If I am understanding your proposal here, I think the main difference from the current Wiki Replicas would be "create some additional indexes that would make our desired queries run faster". We do have some indexes and views in the Wiki Replicas which are specifically designed to make common things faster today. If possible, adding to these rather than building a one-off process of moving lots of data round for your tool would be nice.

I say this not because what you are proposing is a ridiculous solution, but because it is a unique solution for your current problem that will not help others who are having similar problems. Having 1 tool use ToolsDB or a custom Cloud VPS project like this is possible, but having 100 tools try to follow that pattern themselves is not.

...

Out of abundance of caution, I thought I should ask for permission now, rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?

https://phabricator.wikimedia.org/project/view/2875/

Bryan

-- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

Huji Lee

11:01 p.m.

I understand. However, I think that the use case we are looking at is relatively unique. I also think that indexes we need may not be desirable for all the Wiki Replicas (they would often be multi-column indexes geared towards a specific set of queries) and I honestly don't want to go through the several weeks (months?) of discussion to justify them.

Note that if we open the can of "more indexes on Wiki Replicas" worms, this would all of a sudden become an all-wiki discussion. I'm not sure if there are more than a handful wikis that do this level of page-level and user-level analytics as fawiki does, which means for most wikis (and for most Wiki Replica databases) those additional indexes may not even be justified.

Even if we were to generalize parts of this approach and bring it to Wiki Replicas, I would still argue that doing it at a smaller extent (one wiki DB for now) would be a reasonable starting point, no?

On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis bd808@wikimedia.org wrote:

...

On Sun, Apr 12, 2020 at 7:48 AM Huji Lee huji.huji@gmail.com wrote:

...
One possible solution is to create a script which is scheduled to run

once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.

If I am understanding your proposal here, I think the main difference from the current Wiki Replicas would be "create some additional indexes that would make our desired queries run faster". We do have some indexes and views in the Wiki Replicas which are specifically designed to make common things faster today. If possible, adding to these rather than building a one-off process of moving lots of data round for your tool would be nice.

I say this not because what you are proposing is a ridiculous solution, but because it is a unique solution for your current problem that will not help others who are having similar problems. Having 1 tool use ToolsDB or a custom Cloud VPS project like this is possible, but having 100 tools try to follow that pattern themselves is not.

...
Out of abundance of caution, I thought I should ask for permission now,

rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?

https://phabricator.wikimedia.org/project/view/2875/

Bryan

Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

MusikAnimal

14 Apr 14 Apr

1:19 a.m.

Is the source code public? Maybe the queries could be improved. I ran into many such issues too after the actor migration, but after taking advantage of specialized views[0] and join decomposition (get just the actor IDs, i.e. rev_actor, then the actor_names in a separate query), my tools are seemingly as fast as they were before.

~ MA

[0] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R...

On Mon, Apr 13, 2020 at 5:03 PM Huji Lee huji.huji@gmail.com wrote:

...

I understand. However, I think that the use case we are looking at is relatively unique. I also think that indexes we need may not be desirable for all the Wiki Replicas (they would often be multi-column indexes geared towards a specific set of queries) and I honestly don't want to go through the several weeks (months?) of discussion to justify them.

Note that if we open the can of "more indexes on Wiki Replicas" worms, this would all of a sudden become an all-wiki discussion. I'm not sure if there are more than a handful wikis that do this level of page-level and user-level analytics as fawiki does, which means for most wikis (and for most Wiki Replica databases) those additional indexes may not even be justified.

Even if we were to generalize parts of this approach and bring it to Wiki Replicas, I would still argue that doing it at a smaller extent (one wiki DB for now) would be a reasonable starting point, no?

On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis bd808@wikimedia.org wrote:

...
On Sun, Apr 12, 2020 at 7:48 AM Huji Lee huji.huji@gmail.com wrote:

...
One possible solution is to create a script which is scheduled to run

once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.

If I am understanding your proposal here, I think the main difference from the current Wiki Replicas would be "create some additional indexes that would make our desired queries run faster". We do have some indexes and views in the Wiki Replicas which are specifically designed to make common things faster today. If possible, adding to these rather than building a one-off process of moving lots of data round for your tool would be nice.

I say this not because what you are proposing is a ridiculous solution, but because it is a unique solution for your current problem that will not help others who are having similar problems. Having 1 tool use ToolsDB or a custom Cloud VPS project like this is possible, but having 100 tools try to follow that pattern themselves is not.

...
Out of abundance of caution, I thought I should ask for permission now,

rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?

https://phabricator.wikimedia.org/project/view/2875/

Bryan

Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Huji Lee

2:49 p.m.

Yes. If you go to the source of all those pages, there is a hidden HTML element ( kind) that has the SQL code for that report.

Here is one example: [1]

[1] https://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C...

On Mon, Apr 13, 2020 at 7:19 PM MusikAnimal musikanimal@gmail.com wrote:

...

Is the source code public? Maybe the queries could be improved. I ran into many such issues too after the actor migration, but after taking advantage of specialized views[0] and join decomposition (get just the actor IDs, i.e. rev_actor, then the actor_names in a separate query), my tools are seemingly as fast as they were before.

~ MA

[0] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R...

On Mon, Apr 13, 2020 at 5:03 PM Huji Lee huji.huji@gmail.com wrote:

...
I understand. However, I think that the use case we are looking at is relatively unique. I also think that indexes we need may not be desirable for all the Wiki Replicas (they would often be multi-column indexes geared towards a specific set of queries) and I honestly don't want to go through the several weeks (months?) of discussion to justify them.

Note that if we open the can of "more indexes on Wiki Replicas" worms, this would all of a sudden become an all-wiki discussion. I'm not sure if there are more than a handful wikis that do this level of page-level and user-level analytics as fawiki does, which means for most wikis (and for most Wiki Replica databases) those additional indexes may not even be justified.

Even if we were to generalize parts of this approach and bring it to Wiki Replicas, I would still argue that doing it at a smaller extent (one wiki DB for now) would be a reasonable starting point, no?

On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis bd808@wikimedia.org wrote:

...
On Sun, Apr 12, 2020 at 7:48 AM Huji Lee huji.huji@gmail.com wrote:

...
One possible solution is to create a script which is scheduled to run

once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.

If I am understanding your proposal here, I think the main difference from the current Wiki Replicas would be "create some additional indexes that would make our desired queries run faster". We do have some indexes and views in the Wiki Replicas which are specifically designed to make common things faster today. If possible, adding to these rather than building a one-off process of moving lots of data round for your tool would be nice.

I say this not because what you are proposing is a ridiculous solution, but because it is a unique solution for your current problem that will not help others who are having similar problems. Having 1 tool use ToolsDB or a custom Cloud VPS project like this is possible, but having 100 tools try to follow that pattern themselves is not.

...
Out of abundance of caution, I thought I should ask for permission

now, rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?

https://phabricator.wikimedia.org/project/view/2875/

Bryan

Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Huji Lee

2:53 p.m.

I completely appreciate the points you are making, Bryan and Jaime. And I would very much enjoy "dealing with you" if we end up going to "Cloud VPS project" route! If anything, I keep learning new things from you all.

Let's start where you suggested. I will create Phab tickets on which I will seek advice about how to optimize those super-slow queries (if at all possible).

Thank you for your attention!

On Tue, Apr 14, 2020 at 8:49 AM Huji Lee huji.huji@gmail.com wrote:

...

Yes. If you go to the source of all those pages, there is a hidden HTML element ( kind) that has the SQL code for that report.

Here is one example: [1]

[1] https://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C...

On Mon, Apr 13, 2020 at 7:19 PM MusikAnimal musikanimal@gmail.com wrote:

...
Is the source code public? Maybe the queries could be improved. I ran into many such issues too after the actor migration, but after taking advantage of specialized views[0] and join decomposition (get just the actor IDs, i.e. rev_actor, then the actor_names in a separate query), my tools are seemingly as fast as they were before.

~ MA

[0] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R...

On Mon, Apr 13, 2020 at 5:03 PM Huji Lee huji.huji@gmail.com wrote:

...
I understand. However, I think that the use case we are looking at is relatively unique. I also think that indexes we need may not be desirable for all the Wiki Replicas (they would often be multi-column indexes geared towards a specific set of queries) and I honestly don't want to go through the several weeks (months?) of discussion to justify them.

Note that if we open the can of "more indexes on Wiki Replicas" worms, this would all of a sudden become an all-wiki discussion. I'm not sure if there are more than a handful wikis that do this level of page-level and user-level analytics as fawiki does, which means for most wikis (and for most Wiki Replica databases) those additional indexes may not even be justified.

Even if we were to generalize parts of this approach and bring it to Wiki Replicas, I would still argue that doing it at a smaller extent (one wiki DB for now) would be a reasonable starting point, no?

On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis bd808@wikimedia.org wrote:

...
On Sun, Apr 12, 2020 at 7:48 AM Huji Lee huji.huji@gmail.com wrote:

...
One possible solution is to create a script which is scheduled to run

once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.

If I am understanding your proposal here, I think the main difference from the current Wiki Replicas would be "create some additional indexes that would make our desired queries run faster". We do have some indexes and views in the Wiki Replicas which are specifically designed to make common things faster today. If possible, adding to these rather than building a one-off process of moving lots of data round for your tool would be nice.

I say this not because what you are proposing is a ridiculous solution, but because it is a unique solution for your current problem that will not help others who are having similar problems. Having 1 tool use ToolsDB or a custom Cloud VPS project like this is possible, but having 100 tools try to follow that pattern themselves is not.

...
Out of abundance of caution, I thought I should ask for permission

now, rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?

https://phabricator.wikimedia.org/project/view/2875/

Bryan

Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Huji Lee

4 May 4 May

8:10 p.m.

With several weeks of delay, I finally had a chance to submit three Phab tasks around this topic. I have at least 10 more queries that I could add to the list, but I don't want to overwhelm people, so I will wait for those three to be resolved first. Thanks! Huji

On Tue, Apr 14, 2020 at 8:53 AM Huji Lee huji.huji@gmail.com wrote:

...

I completely appreciate the points you are making, Bryan and Jaime. And I would very much enjoy "dealing with you" if we end up going to "Cloud VPS project" route! If anything, I keep learning new things from you all.

Let's start where you suggested. I will create Phab tickets on which I will seek advice about how to optimize those super-slow queries (if at all possible).

Thank you for your attention!

On Tue, Apr 14, 2020 at 8:49 AM Huji Lee huji.huji@gmail.com wrote:

...
Yes. If you go to the source of all those pages, there is a hidden HTML element ( kind) that has the SQL code for that report.

Here is one example: [1]

[1] https://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C...

On Mon, Apr 13, 2020 at 7:19 PM MusikAnimal musikanimal@gmail.com wrote:

...
Is the source code public? Maybe the queries could be improved. I ran into many such issues too after the actor migration, but after taking advantage of specialized views[0] and join decomposition (get just the actor IDs, i.e. rev_actor, then the actor_names in a separate query), my tools are seemingly as fast as they were before.

~ MA

[0] https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_R...

On Mon, Apr 13, 2020 at 5:03 PM Huji Lee huji.huji@gmail.com wrote:

...
I understand. However, I think that the use case we are looking at is relatively unique. I also think that indexes we need may not be desirable for all the Wiki Replicas (they would often be multi-column indexes geared towards a specific set of queries) and I honestly don't want to go through the several weeks (months?) of discussion to justify them.

Note that if we open the can of "more indexes on Wiki Replicas" worms, this would all of a sudden become an all-wiki discussion. I'm not sure if there are more than a handful wikis that do this level of page-level and user-level analytics as fawiki does, which means for most wikis (and for most Wiki Replica databases) those additional indexes may not even be justified.

Even if we were to generalize parts of this approach and bring it to Wiki Replicas, I would still argue that doing it at a smaller extent (one wiki DB for now) would be a reasonable starting point, no?

On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis bd808@wikimedia.org wrote:

...
On Sun, Apr 12, 2020 at 7:48 AM Huji Lee huji.huji@gmail.com wrote:

...
One possible solution is to create a script which is scheduled to

run once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.

If I am understanding your proposal here, I think the main difference from the current Wiki Replicas would be "create some additional indexes that would make our desired queries run faster". We do have some indexes and views in the Wiki Replicas which are specifically designed to make common things faster today. If possible, adding to these rather than building a one-off process of moving lots of data round for your tool would be nice.

I say this not because what you are proposing is a ridiculous solution, but because it is a unique solution for your current problem that will not help others who are having similar problems. Having 1 tool use ToolsDB or a custom Cloud VPS project like this is possible, but having 100 tools try to follow that pattern themselves is not.

...
Out of abundance of caution, I thought I should ask for permission

now, rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?

https://phabricator.wikimedia.org/project/view/2875/

Bryan

Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

Bryan Davis

14 Apr 14 Apr

2:04 a.m.

On Mon, Apr 13, 2020 at 3:03 PM Huji Lee huji.huji@gmail.com wrote:

...

On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis bd808@wikimedia.org wrote:

...
On Sun, Apr 12, 2020 at 7:48 AM Huji Lee huji.huji@gmail.com wrote:

...
One possible solution is to create a script which is scheduled to run once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.

If I am understanding your proposal here, I think the main difference from the current Wiki Replicas would be "create some additional indexes that would make our desired queries run faster". We do have some indexes and views in the Wiki Replicas which are specifically designed to make common things faster today. If possible, adding to these rather than building a one-off process of moving lots of data round for your tool would be nice.

I say this not because what you are proposing is a ridiculous solution, but because it is a unique solution for your current problem that will not help others who are having similar problems. Having 1 tool use ToolsDB or a custom Cloud VPS project like this is possible, but having 100 tools try to follow that pattern themselves is not.

I understand. However, I think that the use case we are looking at is relatively unique. I also think that indexes we need may not be desirable for all the Wiki Replicas (they would often be multi-column indexes geared towards a specific set of queries) and I honestly don't want to go through the several weeks (months?) of discussion to justify them.

How about hours or minutes?

...

Note that if we open the can of "more indexes on Wiki Replicas" worms, this would all of a sudden become an all-wiki discussion. I'm not sure if there are more than a handful wikis that do this level of page-level and user-level analytics as fawiki does, which means for most wikis (and for most Wiki Replica databases) those additional indexes may not even be justified.

It is less of an "all-wiki" discussion and more of a discussion with the Cloud Services folks (/me waves) and the DBA team managing the Wiki Replicas. We do not have a well documented procedure for these discussions, but a Phabricator task tagged with #data-services would be a good place to start.

Ideally there you would also show us the SQL in question; it may turn out that there are already better ways to get that data you are looking for. The actor and comment table refactors make many queries that have not been adjusted to the new normal much, much slower than before. There is unfortunately no easy magic we can add on the backend to fix that. The shape of the data is different now and searching it has to adjust to that new shape.

With the amount of information you have provided up to now I honestly can't say one way or the other if indexes you are thinking of will be acceptable or not. I can say with a reasonable amount of confidence that if your other alternative is requesting a Cloud VPS project, installing and managing MySQL/MaridDB yourself, creating a repeatable and reliable way to populate your database from the sql dumps, and then running your scripts that is going to be a lot more work than explaining your difficult/slow use case at the SQL level and the desired outcome level.

In the end you may end up needing to go back and actually take the longer, harder route, but you will know why. More importantly the Cloud Services and DBA folks will know why too and we can use that knowledge to try and find better general solutions.

...

Even if we were to generalize parts of this approach and bring it to Wiki Replicas, I would still argue that doing it at a smaller extent (one wiki DB for now) would be a reasonable starting point, no?

Maybe? Again I believe you are arguing to create and maintain an entirely parallel system rather than trying to improve the current one. For the Wiki Replicas the only material difference between enwiki, fawiki, and nvwiki is the number of rows in the tables. The same tables exist for each and things are managed across all of them in the same manner. We add custom indexes with a python script that reads from a YAML config file [0].

Choose your own path from here. You can request some help making your SQL run faster on the Wiki Replicas by filing a Phabricator ticket tagged with #data-services, and that discussion might find some index additions to be useful for you and others. You can request a Cloud VPS project by following the process documented in Phabricator [1]. Or you can find a 3rd option on your own. I'm a part of the group that reviews and approves requests for new Cloud VPS projects, so you may have to deal with me asking "why" again there if you go that direction. :)

[0]: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/productio... [1]: https://phabricator.wikimedia.org/project/view/2875/

Bryan

-- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808

Jaime Crespo

9:24 a.m.

Actually, as an idea, I don't think it is a bad one. In fact, there already exists the ticket:

https://phabricator.wikimedia.org/T59617

Which basically is one where a summary of data can be shared, but not the individual rows, so once a summary was created and written into the tables as a sort of "materialized view", as otherwise, querying it would be too slow.

I wouldn't go through the idea of importing data from dumps, but I would go through creating this sort of "materialized views" based on requested user queries. However, there are many issues why this hasn't been done before:

* This is not a small project- it is not just a cron and a script- it requires following production security patterns and best practices, creating infrastructure code to manage everything, as well as modifying our public data checks to make it work. I would calculate that at least this is work for 6 months of a full time position * The technical solution is not easy: While single row edits (e.g. updates coming from replication) is a non-issue, imports on tables with long running selects would make impossible to "load" a table (even if it was loaded with another name and then renamed). We would fall into metadata locking, which means we would have to kill connections (which ones, all wikireplica connections?) every time it is loaded. In any case, for some time every day or every week, those would be unavailable to query * The tables would not have real-time data. Most complaints we have about wikireplica service happen when they start lagging due to maintenance. We have had complaints after only 5 minutes of lag. There is an expectation of real-timeness, that these would violate- probably causing complaints, no matter how documented or announced that was. I would prefer if these were on a separate service, like the tools user dbs, to prevent that. * Extra write load- the more write load we have, the less well replication would work. There is only so many writes that a wikireplica host can handle until it starts lagging behind- resulting either on slow imports, or lag on wikireplicas, both cases also causing annoyances to users * Extra disk space- the new tables should either be large to need the dataset be on tables, or too smalle to justify its existence. Who would decide what gets included and what's not (given everybody should be given the same chance to get their request done), given the finite amount of disk space available? * This is not a priority project because, while we have seen a lot of suggestions on how to improve the wikireplica service, we only have seen a couple of people asking for this possibility, so these other projects got priority. * This would need continuous maintenance- aside from the initial, non-trivial work, this project would need continuous updates- every time the os is updated, every time the mariadb server version changes slightly the syntax, every time the underlying data structures suffer an alter, react to security and data issues promptly.

Having said that, you seem to be volunteering for this work- which I would be happy to support. Please send puppet patches for T59617 and we can talk about setting up the foundations of infrastructure for materialized views/reports on wikireplicas defined by users (Which must live in production, not on cloud, as it would have access to production data). I would prefer, however, this to be a separate service, where everybody understands they are not working with real-time data. In my mind, the analytics service would be based on unchanged weekly imports, and the web one would be more real-time-y.

I would, however, strongly suggest you share your queries and try to exhaust the "extra indexes" path Bryan and MA suggest, rather than trying to maintain a parallel setup, which would have such an overhead, we DBAs haven't been able to even get started with it.

On Sun, Apr 12, 2020 at 3:48 PM Huji Lee huji.huji@gmail.com wrote:

...

Hi all,

We have a set of database reports (on users, articles, etc.) that we used to generate on a weekly basis.[1] Ever since the introduction of the actor table,[2] many of the reports that have to do with users have become so slow that the SQL query cannot finish within a reasonable time and is killed. Some other reports have also become slower over time; all of these are shown in red in [1].

One possible solution is to create a script which is scheduled to run once a month; the script would download the latest dump of the wiki database,[3] load it into MySQL/MariaDB, create some additional indexes that would make our desired queries run faster, and generate the reports using this database. A separate script can then purge the data a few days later.

We can use the current-version-only DB dumps for this purpose. I am guessing that this process would take several hours to run (somewhere between 2 and 10) and would require about 2 GB of storage just to download and decompress the dump file, and some additional space on the DB side (for data, indexes, etc.)

Out of abundance of caution, I thought I should ask for permission now, rather than forgiveness later. Do we have a process for getting approval for projects that require gigabytes of storage and hours of computation, or is what I proposed not even remotely considered a "large" project, meaning I am being overly cautious?

Please advise! Huji

[1] https://fa.wikipedia.org/wiki/%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%... [2] https://phabricator.wikimedia.org/T223406 [3] https://dumps.wikimedia.org/fawiki/20200401/ _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud

-- Jaime Crespo http://wikimedia.org

1708

Age (days ago)

1730

Last active (days ago)

cloud@lists.wikimedia.org

8 comments

4 participants

tags (0)

participants (4)

Bryan Davis
Huji Lee
Jaime Crespo
MusikAnimal