I was investigating a better version of Special:Lonelypages (ignore disambig pages, exclude the already tagged, identify 1- and 2- link orphans as well as no-link orphans, etc.) but when I run the original Lonelypages SQL query on nightshade against enwiki_p it times out after 10 minutes.
SELECT page_namespace, page_title FROM page LEFT JOIN pagelinks ON page_namespace=pl_namespace AND page_title=pl_title WHERE pl_namespace IS NULL AND page_namespace=".NS_MAIN." AND page_is_redirect=0;
I know pagelinks is HUGE (like 265,000,000+ entries), but I thought it would work since it's what we use for Lonelypages. I'm looking at LIMIT queries, etc., but before I go further, is this just too much to ask of nightshade?
- Jason
Ja Ga wrote:
I was investigating a better version of Special:Lonelypages (ignore disambig pages, exclude the already tagged, identify 1- and 2- link orphans as well as no-link orphans, etc.) but when I run the original Lonelypages SQL query on nightshade against enwiki_p it times out after 10 minutes.
SELECT page_namespace, page_title FROM page LEFT JOIN pagelinks ON page_namespace=pl_namespace AND page_title=pl_title WHERE pl_namespace IS NULL AND page_namespace=".NS_MAIN." AND page_is_redirect=0;
I think the problem is simply that, no matter how you optimize it, this is a slow query: it has to go through every page in the main namespace and check if it has incoming links or not. Adding a limit might help, but not much: unlinked pages are fairly rare, so it still has to search through a _lot_ of pages just to find a few dozen unlinked ones.
Anyway, I just tried it with "LIMIT 50", and it finished in two minutes. So it certainly works, it just takes time. I'd suggest just running it with /* SLOW_OK */ and session transaction isolation level set to "READ UNCOMMITTED".
I've implemented a bot, which for Ruwiki solves the problem of lonelypages and even isolated articles. The query you've just cited (an analogue to be honest) is the key point there as well and anyway there is a possibility to make it faster than one you cited on the TS. The problem here is that we deal with views, not the db itself, so this blocks further optimization somehow.
If you are interested in the implementation, refer to https://fisheye.toolserver.org/browse/golem/isolated/namespacer.sql, especially the place where pl table is created and where p0 table is created (cache_namespace_pages function).
Mashiah
2008/11/26 Ilmari Karonen nospam@vyznev.net
Ja Ga wrote:
I was investigating a better version of Special:Lonelypages (ignore disambig pages, exclude the already tagged, identify 1- and 2- link orphans as well as no-link orphans, etc.) but when I run the original Lonelypages SQL query on nightshade against enwiki_p it times out after 10 minutes.
SELECT page_namespace, page_title FROM page LEFT JOIN pagelinks ON page_namespace=pl_namespace AND page_title=pl_title WHERE pl_namespace IS NULL AND page_namespace=".NS_MAIN." AND page_is_redirect=0;
I think the problem is simply that, no matter how you optimize it, this is a slow query: it has to go through every page in the main namespace and check if it has incoming links or not. Adding a limit might help, but not much: unlinked pages are fairly rare, so it still has to search through a _lot_ of pages just to find a few dozen unlinked ones.
Anyway, I just tried it with "LIMIT 50", and it finished in two minutes. So it certainly works, it just takes time. I'd suggest just running it with /* SLOW_OK */ and session transaction isolation level set to "READ UNCOMMITTED".
-- Ilmari Karonen
Toolserver-l mailing list Toolserver-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Mashiah Davidson wrote:
I've implemented a bot, which for Ruwiki solves the problem of lonelypages and even isolated articles. The query you've just cited (an analogue to be honest) is the key point there as well and anyway there is a possibility to make it faster than one you cited on the TS. The problem here is that we deal with views, not the db itself, so this blocks further optimization somehow.
I was going to disagree with you, but it does seem there's something weird going on. Compare these queries:
mysql> SELECT page_namespace, page_title FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (4.63 sec)
mysql> SELECT page_namespace, page_title, page_id FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (4.90 sec)
mysql> SELECT page_namespace, page_title, page_is_redirect FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (30.56 sec)
mysql> SELECT * FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (41.23 sec)
This isn't just random variation either, but seems completely repeatable: including the page_is_redirect field (or, apparently, any field other than page_namespace, page_title or page_id) in the query, whether in the field list or in the WHERE clause, makes it run much more slowly. WTF?
Ilmari Karonen wrote:
Mashiah Davidson wrote:
I've implemented a bot, which for Ruwiki solves the problem of lonelypages and even isolated articles. The query you've just cited (an analogue to be honest) is the key point there as well and anyway there is a possibility to make it faster than one you cited on the TS. The problem here is that we deal with views, not the db itself, so this blocks further optimization somehow.
I was going to disagree with you, but it does seem there's something weird going on. Compare these queries:
mysql> SELECT page_namespace, page_title FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (4.63 sec)
mysql> SELECT page_namespace, page_title, page_id FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (4.90 sec)
mysql> SELECT page_namespace, page_title, page_is_redirect FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (30.56 sec)
mysql> SELECT * FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (41.23 sec)
This isn't just random variation either, but seems completely repeatable: including the page_is_redirect field (or, apparently, any field other than page_namespace, page_title or page_id) in the query, whether in the field list or in the WHERE clause, makes it run much more slowly. WTF?
That behavior is consistent with lack of indices, or the wrong *type* of indices, on the culpable columns forcing a seq scan. I should point out that it's not always a good idea to index less frequently selected columns because it increases the cost of insertions and updates significantly, so the solution isn't necessarily to add some either. Sometimes, if you are going to be doing a lot of selects on those columns, the creation of a temporary table with the needed index is, in fact, the most efficient solution despite the setup cost.
-- Marc
Marc A. Pelletier wrote:
Ilmari Karonen wrote:
mysql> SELECT page_namespace, page_title, page_id FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (4.90 sec)
mysql> SELECT page_namespace, page_title, page_is_redirect FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (30.56 sec)
That behavior is consistent with lack of indices, or the wrong *type* of indices, on the culpable columns forcing a seq scan.
That may well be what's happening, but if so, it's a bug in MySQL. Note that both of the queries have identical WHERE clauses. Given that said WHERE clauses match no rows, there's really no reason why any other part of the SELECT should make any difference. No rows is no rows.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Ilmari Karonen:
That may well be what's happening, but if so, it's a bug in MySQL. Note that both of the queries have identical WHERE clauses. Given that said WHERE clauses match no rows, there's really no reason why any other part of the SELECT should make any difference. No rows is no rows.
while i spent no time investigating this, a possible explanation is that page_namespace and page_title can be loaded from the index, without touching the data. but once you ask for page_is_redirect, a column not in the index being used, it has to touch the data for each row to retrieve that field.
to put that another way: when your query's results can be retrieved using the data already stored in the index, InnoDB will do that, and not touch the "real" data at all.
it's usually faster, as the index is already loaded to do the query in the first place.
- river.
On Thu, Nov 27, 2008 at 10:22 AM, Ilmari Karonen nospam@vyznev.net wrote:
I was going to disagree with you, but it does seem there's something weird going on. Compare these queries:
mysql> SELECT page_namespace, page_title FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (4.63 sec)
mysql> SELECT page_namespace, page_title, page_id FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (4.90 sec)
mysql> SELECT page_namespace, page_title, page_is_redirect FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (30.56 sec)
mysql> SELECT * FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0; Empty set (41.23 sec)
This isn't just random variation either, but seems completely repeatable: including the page_is_redirect field (or, apparently, any field other than page_namespace, page_title or page_id) in the query, whether in the field list or in the WHERE clause, makes it run much more slowly. WTF?
The EXPLAINs indicate what the difference is, although why it should make a difference with no rows returned is a separate issue.
mysql> EXPLAIN SELECT page_namespace FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: page type: ref possible_keys: name_title key: name_title key_len: 4 ref: const rows: 7863987 Extra: Using where; Using index 1 row in set (0.01 sec)
mysql> EXPLAIN SELECT page_is_redirect FROM page WHERE page_title LIKE '%fnord%' AND page_namespace=0\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: page type: ref possible_keys: name_title key: name_title key_len: 4 ref: const rows: 7863988 Extra: Using where 1 row in set (0.00 sec)
"Using index" is the critical bit here. The name_title index is on (page_namespace, page_title); it also includes (page_id), the primary key. Therefore a query for any of those three columns can be satisfied by examining the index alone, as long as the name_title index is being used for retrieval. Any other columns require examination of the data.
The question is why this should make any difference when no rows are returned. If you have to do something so close to a table scan, the best execution strategy might be to just read through the data and ignore the index altogether. So if MySQL determines that it can use the index for the whole scan, it might use that; if it thinks it will need to read the table data anyway, it might ignore the index and scan the table itself, on the theory that this is going to be more efficient if a large percentage of rows are returned. The latter strategy would turn out to be ineffective here, because no rows are actually returned and the first strategy would have resulted in scanning less data, but MySQL doesn't know that in advance.
The problem with that theory is that the EXPLAINs indicate identical execution strategies. If I were right, then the second query should be of type ALL, not ref. It could be EXPLAIN is lying, or that the reason for the discrepancy is something different. Running SHOW PROFILE (http://dev.mysql.com/doc/refman/5.0/en/show-profiles.html) on localhost for a similar query on an old simplewiki database, I determine that the significant difference is in the "sending data" phase of the query, at least on my machine with that database (the effect is still visible, they're 0.04s vs. 0.14s or so). That doesn't tell me much, unfortunately.
The fact that the effect occurs on my local database indicates that this particular difference has absolutely nothing to do with views, though.
toolserver-l@lists.wikimedia.org