I wrote this query to find all page moves done in fawiki in 2017, and determine how many edits the performing user had prior to that page move. The query tries to use indexes, as much as I could think of, and yet it runs for a very long time (more than 20 min, at which point it gets killed).
Any ideas on how to further optimize this query is appreciated!
Thanks, Huji
use fawiki_p; select log_id, log_timestamp, log_user, log_user_text, log_title, log_comment, log_page, page_namespace, case when ug_group = 'bot' then 1 else 0 end as user_is_bot, ( select count(*) from revision_userindex where rev_user = log_user and rev_timestamp < log_timestamp ) as rev_count_before_move from logging join page on page_id = log_page left join user_groups on log_user = ug_user and ug_group = 'bot' where log_action = 'move' and log_timestamp > '20170101000000'
Use the logging_userindex table instead of logging
On Sat, Dec 30, 2017 at 12:01 PM Huji Lee huji.huji@gmail.com wrote:
I wrote this query to find all page moves done in fawiki in 2017, and determine how many edits the performing user had prior to that page move. The query tries to use indexes, as much as I could think of, and yet it runs for a very long time (more than 20 min, at which point it gets killed).
Any ideas on how to further optimize this query is appreciated!
Thanks, Huji
use fawiki_p; select log_id, log_timestamp, log_user, log_user_text, log_title, log_comment, log_page, page_namespace, case when ug_group = 'bot' then 1 else 0 end as user_is_bot, ( select count(*) from revision_userindex where rev_user = log_user and rev_timestamp < log_timestamp ) as rev_count_before_move from logging join page on page_id = log_page left join user_groups on log_user = ug_user and ug_group = 'bot' where log_action = 'move' and log_timestamp > '20170101000000' _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Replace count(*) with count(1) in the subquery. It could help (?) to improve the performance.
Dennis Tobar Calderón (Enviado desde un móvil)
El 30 dic. 2017 3:10 p. m., "John" phoenixoverride@gmail.com escribió:
I would also find the first log of 2017 and use that instead of the timestamp
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
All fantastic suggestions. The EXPLAIN results showed that changing to logging_userindex had potentially the largest impact (reducing the rows to be retrieved from 10mil to 5mil). I am rerunning the query with all suggested recommendations, and will report back.
On Sat, Dec 30, 2017 at 1:32 PM, Dennis Tobar dennis.tobar@gmail.com wrote:
Replace count(*) with count(1) in the subquery. It could help (?) to improve the performance.
Dennis Tobar Calderón (Enviado desde un móvil)
El 30 dic. 2017 3:10 p. m., "John" phoenixoverride@gmail.com escribió:
I would also find the first log of 2017 and use that instead of the timestamp
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On Sat, Dec 30, 2017 at 1:07 PM, John phoenixoverride@gmail.com wrote:
Use the logging_userindex table instead of logging
That won't make much difference, since the select on the logging table isn't targeting any user columns.
On Sat, Dec 30, 2017 at 1:09 PM, John phoenixoverride@gmail.com wrote:
I would also find the first log of 2017 and use that instead of the timestamp
That would make it worse, since there's no index on (log_type, log_id). It'll either have to use the primary key and filter out all rows with a different log_type, or use one of the indexes that begins with log_type and filter out all the rows with an earlier log_id.
On Sat, Dec 30, 2017 at 1:32 PM, Dennis Tobar dennis.tobar@gmail.com wrote:
Replace count(*) with count(1) in the subquery. It could help (?) to improve the performance.
"count(*)" and "count(1)" should be treated equivalently. The "*" in "count(*)" does not cause the database to fetch all fields.
If anything, "count(*)" might be ever so slightly faster since it's literally staying "count the number of rows" rather than "count the number of rows where the constant 1 is not null". But the DB probably optimizes counting of a constant to make them identical.
Isn’t there a logging_logindex table to use that should optimize this?
Cyberpower678 English Wikipedia Account Creation Team English Wikipedia Administrator Global User Renamer
On Dec 30, 2017, at 14:07, Brad Jorsch (Anomie) bjorsch@wikimedia.org wrote:
On Sat, Dec 30, 2017 at 1:07 PM, John <phoenixoverride@gmail.com mailto:phoenixoverride@gmail.com> wrote: Use the logging_userindex table instead of logging
That won't make much difference, since the select on the logging table isn't targeting any user columns.
On Sat, Dec 30, 2017 at 1:09 PM, John <phoenixoverride@gmail.com mailto:phoenixoverride@gmail.com> wrote: I would also find the first log of 2017 and use that instead of the timestamp
That would make it worse, since there's no index on (log_type, log_id). It'll either have to use the primary key and filter out all rows with a different log_type, or use one of the indexes that begins with log_type and filter out all the rows with an earlier log_id.
On Sat, Dec 30, 2017 at 1:32 PM, Dennis Tobar <dennis.tobar@gmail.com mailto:dennis.tobar@gmail.com> wrote: Replace count(*) with count(1) in the subquery. It could help (?) to improve the performance.
"count(*)" and "count(1)" should be treated equivalently. The "*" in "count(*)" does not cause the database to fetch all fields.
If anything, "count(*)" might be ever so slightly faster since it's literally staying "count the number of rows" rather than "count the number of rows where the constant 1 is not null". But the DB probably optimizes counting of a constant to make them identical.
-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
There is a logging_logindex table. I incorporated all suggested changes into https://quarry.wmflabs.org/query/23829
Since I still think they query will take more than 20 minutes, I think quarry is not the way to go. I saved the query on the grid and tried to run it using this command:
jsub -once -N "custom_sql" -mem 1g -o /data/project/huji/err/custom_sql.out -e /data/project/huji/err/custom_sql.err sql fawiki_p < /data/project/huji/query.sql
But even though when I run this I get the response that my job was submitted, running qstat immediately thereafter shows no results. Is it okay, or am I running the job incorrectly? (Note that both the .out and the .err files are empty).
On Sat, Dec 30, 2017 at 2:10 PM, Maximilian Doerr < maximilian.doerr@gmail.com> wrote:
Isn’t there a logging_logindex table to use that should optimize this?
Cyberpower678 English Wikipedia Account Creation Team English Wikipedia Administrator Global User Renamer
On Dec 30, 2017, at 14:07, Brad Jorsch (Anomie) bjorsch@wikimedia.org wrote:
On Sat, Dec 30, 2017 at 1:07 PM, John phoenixoverride@gmail.com wrote:
Use the logging_userindex table instead of logging
That won't make much difference, since the select on the logging table isn't targeting any user columns.
On Sat, Dec 30, 2017 at 1:09 PM, John phoenixoverride@gmail.com wrote:
I would also find the first log of 2017 and use that instead of the timestamp
That would make it worse, since there's no index on (log_type, log_id). It'll either have to use the primary key and filter out all rows with a different log_type, or use one of the indexes that begins with log_type and filter out all the rows with an earlier log_id.
On Sat, Dec 30, 2017 at 1:32 PM, Dennis Tobar dennis.tobar@gmail.com wrote:
Replace count(*) with count(1) in the subquery. It could help (?) to improve the performance.
"count(*)" and "count(1)" should be treated equivalently. The "*" in "count(*)" does not cause the database to fetch all fields.
If anything, "count(*)" might be ever so slightly faster since it's literally staying "count the number of rows" rather than "count the number of rows where the constant 1 is not null". But the DB probably optimizes counting of a constant to make them identical.
-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation _______________________________________________ Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On Sat, Dec 30, 2017 at 2:10 PM, Maximilian Doerr < maximilian.doerr@gmail.com> wrote:
Isn’t there a logging_logindex table to use that should optimize this?
It looks like logging_logindex is useful for queries that are targeting indexes page_time, log_page_id_time, and type_action.
It looks like logging_userindex is useful for queries that are targeting indexes user_time, log_user_type_time, log_user_text_type_time, and log_user_text_time.
Any of the three views should work for the primary key and the type_time and times indexes.
The query took more than an hour, so I killed it.
I think a better strategy is to first run the query on logging, and get a list of logs, then run the subquery for retrieving the edit counts.
My new query is shown below; however, I cannot create temporary tables with my user on Toolforge. I tried creating a database on "tools" server, but then i cannot access it from the server that hosts fawiki_p
What is the proper strategy for temporary (physical or in-memory) tables on Toolforge? Huji
create temporary table logs select log_id, log_timestamp, log_user, log_user_text, log_title, log_comment, log_page, page_namespace, case when ug_group = 'bot' then 1 else 0 end as user_is_bot from logging_userindex join page on page_id = log_page left join user_groups on log_user = ug_user and ug_group = 'bot' where log_type = 'move' and log_id > 9406768;
select logs.*, ( select count(1) from revision_userindex where rev_user = log_user and rev_timestamp < log_timestamp ) as rev_count_before_move from logs;
On Sat, Dec 30, 2017 at 2:21 PM, Brad Jorsch (Anomie) <bjorsch@wikimedia.org
wrote:
On Sat, Dec 30, 2017 at 2:10 PM, Maximilian Doerr < maximilian.doerr@gmail.com> wrote:
Isn’t there a logging_logindex table to use that should optimize this?
It looks like logging_logindex is useful for queries that are targeting indexes page_time, log_page_id_time, and type_action.
It looks like logging_userindex is useful for queries that are targeting indexes user_time, log_user_type_time, log_user_text_type_time, and log_user_text_time.
Any of the three views should work for the primary key and the type_time and times indexes.
-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On Sat, Dec 30, 2017 at 10:55 PM, Huji Lee huji.huji@gmail.com wrote:
and log_id > 9406768;
As I said earlier, it was bad advice to use log_id rather than log_timestamp for this query.
Sorry Brad, I had missed your email.
Putting log_timestamp back in and restricting the query to a shorter time period made it work under 20 minutes: https://quarry.wmflabs.org/query/23829
Thank you all for your assistance!
On Sun, Dec 31, 2017 at 1:44 PM, Brad Jorsch (Anomie) <bjorsch@wikimedia.org
wrote:
On Sat, Dec 30, 2017 at 10:55 PM, Huji Lee huji.huji@gmail.com wrote:
and log_id > 9406768;
As I said earlier, it was bad advice to use log_id rather than log_timestamp for this query.
-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation
Wikimedia Cloud Services mailing list Cloud@lists.wikimedia.org (formerly labs-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/cloud
On Sat, Dec 30, 2017 at 8:55 PM, Huji Lee huji.huji@gmail.com wrote:
What is the proper strategy for temporary (physical or in-memory) tables on Toolforge?
Allowing temp tables on the new Wiki Replica servers was investigated in https://phabricator.wikimedia.org/T179628. The resulting decision was that `CREATE TEMPORARY TABLE...` can cause too many issues with replication to be allowed at this time.
For this particular query, I think you could simulate having a temporary table by "joining" in your application by selecting counts of revision_userindex table rows using batches of (log_user, log_timestamp) values. It would depend on the size of the inner query dataset whether you are better off trying to cache all of the results in your application memory or instead using two database connections and only reading in a "reasonable" number of results at a time (e.g. 1000).
Bryan
On Sat, Dec 30, 2017 at 1:00 PM, Huji Lee huji.huji@gmail.com wrote:
The query tries to use indexes, as much as I could think of
You missed using effective indexes.
Let's start by getting the EXPLAIN output for your query. https://tools.wmflabs.org/tools-info/optimizer.py? doesn't seem to like the subquery for some reason, but doing it manually by running the query on one connection and doing SHOW EXPLAIN FROM on another tells us
+------+--------------------+-------------+--------+------------------------------+----------------+---------+------------+----------+------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+--------------------+-------------+--------+------------------------------+----------------+---------+------------+----------+------------------------------------+ | 1 | PRIMARY | logging | ALL | type_time,times,type_action | NULL | NULL | NULL | 10809607 | Using where | | 1 | PRIMARY | page | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | Using where | | 1 | PRIMARY | user_groups | eq_ref | PRIMARY,ug_group | PRIMARY | 261 | func,const | 1 | Using where; Using index | | 2 | DEPENDENT SUBQUERY | revision | ref | rev_timestamp,user_timestamp | user_timestamp | 4 | func | 46 | Using index condition; Using where | +------+--------------------+-------------+--------+------------------------------+----------------+---------+------------+----------+------------------------------------+
That "ALL" on the first line isn't very encouraging, nor is the estimate that it's going to have to touch 10 million rows.
Your WHERE clause is selecting on log_action and log_timestamp, for which there isn't an index in https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance.... You probably meant to use log_type rather than log_action. Making that change gives a much nicer looking plan:
+------+--------------------+-------------+--------+------------------------------+----------------+---------+------------+-------+------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+--------------------+-------------+--------+------------------------------+----------------+---------+------------+-------+------------------------------------+ | 1 | PRIMARY | logging | range | type_time,times,type_action | type_time | 50 | NULL | 76090 | Using index condition | | 1 | PRIMARY | user_groups | eq_ref | PRIMARY,ug_group | PRIMARY | 261 | func,const | 1 | Using where; Using index | | 1 | PRIMARY | page | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | Using where | | 2 | DEPENDENT SUBQUERY | revision | ref | rev_timestamp,user_timestamp | user_timestamp | 4 | func | 46 | Using index condition; Using where | +------+--------------------+-------------+--------+------------------------------+----------------+---------+------------+-------+------------------------------------+
It's still going to be slow, though, just because it still has to touch and return tens of thousands of rows. And that slow subquery isn't going to help matters either, since it's having to do it individually for each row in the result set. You might wind up having to do the main query alone and then do the subquery in batches to merge in your client code.