In r46845 [1], the issue raised in bug 11430 [2] a year and a half ago was finally addressed: when the API was asked to produce huge amounts of data (for instance the content of 500 revisions at 280 KB each), it would run out of memory trying to store and process it. To prevent this from happening, the amount of data the API can return is now limited. This means that the behavior of requests that used to run out of memory has changed: they will return fewer results than the limit, even though there are more results available (they'll still set query-continue right, though). For instance, the aforementioned request would return about 300 revisions and set a query-continue for the rest.
Roan Kattouw (Catrope)
[1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/46845 [2] https://bugzilla.wikimedia.org/show_bug.cgi?id=11430
_______________________________________________ Mediawiki-api-announce mailing list Mediawiki-api-announce@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce
Roan Kattouw schreef:
In r46845 [1], the issue raised in bug 11430 [2] a year and a half ago was finally addressed: when the API was asked to produce huge amounts of data (for instance the content of 500 revisions at 280 KB each), it would run out of memory trying to store and process it. To prevent this from happening, the amount of data the API can return is now limited. This means that the behavior of requests that used to run out of memory has changed: they will return fewer results than the limit, even though there are more results available (they'll still set query-continue right, though). For instance, the aforementioned request would return about 300 revisions and set a query-continue for the rest.
Something I forgot to mention here is that continue parameters have been added to most modules that didn't have one: iicontinue for prop=imageinfo, rvcontinue for prop=revisions, cicontinue for prop=categoryinfo, incontinue for prop=info and amfrom for meta=allmessages; meta=siteinfo abuses siprop as query-continue parameter.
Roan Kattouw (Catrope)
_______________________________________________ Mediawiki-api-announce mailing list Mediawiki-api-announce@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce
On Fri, Feb 06, 2009 at 01:08:02PM +0100, Roan Kattouw wrote:
Roan Kattouw schreef:
In r46845 [1], the issue raised in bug 11430 [2] a year and a half ago was finally addressed: when the API was asked to produce huge amounts of data (for instance the content of 500 revisions at 280 KB each), it would run out of memory trying to store and process it. To prevent this from happening, the amount of data the API can return is now limited.
I see a possible bug: if the size of the first result is over the size limit, you'll never be able to get the result at all. For example, create a page larger than $wgAPIMaxResultSize and try to fetch the revision contents.
Yes, the 8 MiB default is unlikely to be hit by a single revision, but what if someone lowers it to 250 KiB and tries to fetch those 280 KB revisions? Or what if someone uploads an image that has a ridiculous amount of metadata?
Something I forgot to mention here is that continue parameters have been added to most modules that didn't have one: iicontinue for prop=imageinfo, rvcontinue for prop=revisions, cicontinue for prop=categoryinfo, incontinue for prop=info and amfrom for meta=allmessages; meta=siteinfo abuses siprop as query-continue parameter.
Looking at r46923, I see a number of bugs, including one or two existing bugs that turned up when I was looking at all these continue parameters.
rvcontinue doesn't work, it seems that $continue is never extracted from $params.
amfrom is broken if ammessages is not sorted in whatever order PHP uses for string comparison. For example, I tried ammessages=aboutsite|aboutpage with a ridiculously low $wgAPIMaxResultSize on a local installation, and it returned aboutsite with amfrom=aboutpage. When I then did ammessages=aboutsite|aboutpage&amfrom=aboutpage, it returned aboutsite and amfrom=aboutpage again.
alfrom should probably be used instead of alcontinue if alunique is specified.
drcontinue must be used instead of drstart in "all" mode.
prop=duplicatefiles may have an issue similar to the allmessages problem mentioned above, it looks like the results aren't sorted by i2.img_name which could cause repeats or skips if database decides to return them in any order other than would be given if i2.img_name were included in the ORDER BY clause.
All of these continues that are based on a count (which includes the existing eloffset, euoffset, and sroffset BTW) are potentially broken. There are two issues:
1. If the query isn't strictly ordered, the database is under no obligation to return the rows in the same order the second time as it did the first. That means that titles=A|B|C&prop=info could return A and B with incontinue=2, but titles=A|B|C&prop=info&incontinue=2 could then return A or B again instead of C if the database orders the rows differently for whatever reason.
2. Even if the results are sorted, something like titles=A|B|C|D|E&prop=info&incontinue=2 could skip C if A or B were deleted between the first and second queries, or repeat B if "A2" were inserted.
Brad Jorsch schreef:
On Fri, Feb 06, 2009 at 01:08:02PM +0100, Roan Kattouw wrote:
Roan Kattouw schreef:
In r46845 [1], the issue raised in bug 11430 [2] a year and a half ago was finally addressed: when the API was asked to produce huge amounts of data (for instance the content of 500 revisions at 280 KB each), it would run out of memory trying to store and process it. To prevent this from happening, the amount of data the API can return is now limited.
I see a possible bug: if the size of the first result is over the size limit, you'll never be able to get the result at all. For example, create a page larger than $wgAPIMaxResultSize and try to fetch the revision contents.
Yes, the 8 MiB default is unlikely to be hit by a single revision, but what if someone lowers it to 250 KiB and tries to fetch those 280 KB revisions? Or what if someone uploads an image that has a ridiculous amount of metadata?
In the comment above the default value there's a big warning saying you shouldn't set $wgAPIMaxResultSize to something lower than the maximum revision size. Wiki admins who do that anyway are asking for trouble.
The metadata thing is also a non-issue, because it's stored in a database field that holds at most 4 MB.
Something I forgot to mention here is that continue parameters have been added to most modules that didn't have one: iicontinue for prop=imageinfo, rvcontinue for prop=revisions, cicontinue for prop=categoryinfo, incontinue for prop=info and amfrom for meta=allmessages; meta=siteinfo abuses siprop as query-continue parameter.
Looking at r46923, I see a number of bugs, including one or two existing bugs that turned up when I was looking at all these continue parameters.
rvcontinue doesn't work, it seems that $continue is never extracted from $params.
Fixed in r47036; forgot to update the patch I applied for the extract() crusade I did a while ago.
amfrom is broken if ammessages is not sorted in whatever order PHP uses for string comparison. For example, I tried ammessages=aboutsite|aboutpage with a ridiculously low $wgAPIMaxResultSize on a local installation, and it returned aboutsite with amfrom=aboutpage. When I then did ammessages=aboutsite|aboutpage&amfrom=aboutpage, it returned aboutsite and amfrom=aboutpage again.
Fixed in r47036 by checking for inequality (!=) rather than less-than (<), using the fact that the order of message retrieval is pretty much constant (unless of course extensions are installed/removed between two requests).
alfrom should probably be used instead of alcontinue if alunique is specified.
drcontinue must be used instead of drstart in "all" mode.
Fixed in r47036; forgot to update for the recent paging fixes in alllinks and the revamp of deletedrevs.
prop=duplicatefiles may have an issue similar to the allmessages problem mentioned above, it looks like the results aren't sorted by i2.img_name which could cause repeats or skips if database decides to return them in any order other than would be given if i2.img_name were included in the ORDER BY clause.
I'll look into this; sorting by i2.img_name as well is less efficient according to a quick EXPLAIN, but I'm not sure if it's too inefficient (will ask a database guru). If memory serves, this problem shouldn't occur on MySQL databases, though, because they sort stuff by the primary key (img_name) by default IIRC.
All of these continues that are based on a count (which includes the existing eloffset, euoffset, and sroffset BTW) are potentially broken. There are two issues:
If the query isn't strictly ordered, the database is under no obligation to return the rows in the same order the second time as it did the first. That means that titles=A|B|C&prop=info could return A and B with incontinue=2, but titles=A|B|C&prop=info&incontinue=2 could then return A or B again instead of C if the database orders the rows differently for whatever reason.
Even if the results are sorted, something like titles=A|B|C|D|E&prop=info&incontinue=2 could skip C if A or B were deleted between the first and second queries, or repeat B if "A2" were inserted.
The existing UI for listing external links and search results also uses this offset-based approach, which is pretty much why the API does as well. You're right about these two issues, and I'll look into them deeper later (after I've caught up with what happened this weekend). The same thing about MySQL sorting by primary key (IIRC) applies here.
Thanks for testing and reporting all this stuff (in just 2 days).
Roan Kattouw (Catrope)
On Mon, Feb 09, 2009 at 02:27:15PM +0100, Roan Kattouw wrote:
In the comment above the default value there's a big warning saying you shouldn't set $wgAPIMaxResultSize to something lower than the maximum revision size. Wiki admins who do that anyway are asking for trouble.
Ok, as long as it's documented that should be good enough.
The metadata thing is also a non-issue, because it's stored in a database field that holds at most 4 MB.
It looks like img_metadata is a MEDIUMBLOB in MySQL, which can hold up to 2**24 bytes (16 MiB) according to MySQL's documentation. OTOH, in PostgreSQL it uses a BYTEA, which will go up to 512 MiB in 8.1 and there is no reason that maximum couldn't be increased in later versions. I don't see any code limiting the size of the stored data on the MediaWiki side of things, either.
Of course, I could be missing something.
rvcontinue doesn't work, it seems that $continue is never extracted from $params.
Fixed in r47036; forgot to update the patch I applied for the extract() crusade I did a while ago.
Looks good to me.
amfrom is broken if ammessages is not sorted in whatever order PHP uses for string comparison.
Fixed in r47036 by checking for inequality (!=) rather than less-than (<), using the fact that the order of message retrieval is pretty much constant (unless of course extensions are installed/removed between two requests).
Now amfrom is broken differently: meta=allmessages&amfrom=aboutpage returns only aboutpage instead of everything starting with aboutpage.
There may be a better way, but one easy way to do it would be like this: $skip = !is_null($params['from']); foreach( $messages_target as $message ) { if($message === $params['from']) $skip = false; if(!$skip) $messages[$message] = wfMsg( $message ); }
alfrom should probably be used instead of alcontinue if alunique is specified.
drcontinue must be used instead of drstart in "all" mode.
Fixed in r47036; forgot to update for the recent paging fixes in alllinks and the revamp of deletedrevs.
Looks good to me.
prop=duplicatefiles may have an issue similar to the allmessages problem mentioned above, it looks like the results aren't sorted by i2.img_name which could cause repeats or skips if database decides to return them in any order other than would be given if i2.img_name were included in the ORDER BY clause.
I'll look into this; sorting by i2.img_name as well is less efficient according to a quick EXPLAIN, but I'm not sure if it's too inefficient (will ask a database guru). If memory serves, this problem shouldn't occur on MySQL databases, though, because they sort stuff by the primary key (img_name) by default IIRC.
According to MySQL's documentation, you're right if InnoDB tables are being used and the query is doing a table scan (or presumably using the primary key for an index scan); if it uses some other index, this may not hold true. If MyISAM is used, I believe it's insertion order, although deletes, OPTIMIZE TABLE, ALTER TABLE ORDER BY, and probably other stuff screw that up.
If MySQL really is sorting by the i2.img_name by default, theoretically making that explicit shouldn't make a difference. But knowing MySQL, it wouldn't surprise me that it does.
All of these continues that are based on a count (which includes the existing eloffset, euoffset, and sroffset BTW) are potentially broken. There are two issues:
If the query isn't strictly ordered, the database is under no obligation to return the rows in the same order the second time as it did the first. That means that titles=A|B|C&prop=info could return A and B with incontinue=2, but titles=A|B|C&prop=info&incontinue=2 could then return A or B again instead of C if the database orders the rows differently for whatever reason.
Even if the results are sorted, something like titles=A|B|C|D|E&prop=info&incontinue=2 could skip C if A or B were deleted between the first and second queries, or repeat B if "A2" were inserted.
The existing UI for listing external links and search results also uses this offset-based approach, which is pretty much why the API does as well. You're right about these two issues, and I'll look into them deeper later (after I've caught up with what happened this weekend). The same thing about MySQL sorting by primary key (IIRC) applies here.
I don't know what else could be done for search results, so I wouldn't worry about that.
prop=categoryinfo should be easy if ORDER BY cat_title doesn't kill MySQL. Then you can do the standard thing there using cat_title as the continue parameter.
prop=extlinks must already be using the el_from index, so hopefully adding el_to to the ORDER BY won't kill MySQL. Then el_to can be the continue parameter.
list=exturlusage might be tough. The ideal thing would be to order by el_index and page_id (with page_id|el_index as the continue parameter), but I'm not at all confident that wouldn't filesort. If it does, I have no idea what we could do.
prop=imageinfo looks somewhat tricky. I might change "continue" to be the concatenation of the image title and the appropriate start value, and change the logic something like this: $titles = array_keys( $pageIds[NS_FILE] ); sort($titles); // Unless we can depend on the order to always be the same $images = RepoGroup::singleton()->findFiles( $titles ); if(is_null($params['continue'])){ $skip = false; } else { $skip = true; list($continue_title, $continue_start) = explode('|', $params['continue']); } foreach ( $titles as $title ) { $start = $params['start']; if($skip) { if($title != $continue_title) continue; $skip = false; $start = $continue_start; }
if(!isset($images[$title])) { // Add the "missing" record, as in lines 143-148 of r47043 // except add a break in the !$fit case continue; }
$img = $images[$title]; // Add the "not missing" record, as in lines 68-134 of r47043 // except use $start instead of $params['start'] }
prop=info could be done by merging $titles and $missing (if necessary), doing an asort, and using similar logic. The continue would be just the title.
For prop=revisions, I wonder if MySQL would filesort if you order by rev_id/page_id (depending on which is being used in the WHERE clause), since I would guess that's the index being used to fetch the rows anyway. If not, that could easily work for the continue parameter in non-enum mode (enum mode is already OK).
Thanks for testing and reporting all this stuff (in just 2 days).
No problem.
Brad Jorsch schreef:
On Mon, Feb 09, 2009 at 02:27:15PM +0100, Roan Kattouw wrote:
In the comment above the default value there's a big warning saying you shouldn't set $wgAPIMaxResultSize to something lower than the maximum revision size. Wiki admins who do that anyway are asking for trouble.
Ok, as long as it's documented that should be good enough.
The metadata thing is also a non-issue, because it's stored in a database field that holds at most 4 MB.
It looks like img_metadata is a MEDIUMBLOB in MySQL, which can hold up to 2**24 bytes (16 MiB) according to MySQL's documentation. OTOH, in PostgreSQL it uses a BYTEA, which will go up to 512 MiB in 8.1 and there is no reason that maximum couldn't be increased in later versions. I don't see any code limiting the size of the stored data on the MediaWiki side of things, either.
Of course, I could be missing something.
No, just my math going wrong (2^4 is 16, not 4). Fortunately, the metadata is (currently*) extracted from the image's EXIF data, which is limited to 64 KB.
* There are plans floating around on Commons to store UI-readable and modifiable metadata for images, and using the existing img_metadata field for that is an option
amfrom is broken if ammessages is not sorted in whatever order PHP uses for string comparison.
Fixed in r47036 by checking for inequality (!=) rather than less-than (<), using the fact that the order of message retrieval is pretty much constant (unless of course extensions are installed/removed between two requests).
Now amfrom is broken differently: meta=allmessages&amfrom=aboutpage returns only aboutpage instead of everything starting with aboutpage.
Doh! Of course.
There may be a better way, but one easy way to do it would be like this: $skip = !is_null($params['from']); foreach( $messages_target as $message ) { if($message === $params['from']) $skip = false; if(!$skip) $messages[$message] = wfMsg( $message ); }
Did exactly that in r47048.
I'll look into this; sorting by i2.img_name as well is less efficient according to a quick EXPLAIN, but I'm not sure if it's too inefficient (will ask a database guru). If memory serves, this problem shouldn't occur on MySQL databases, though, because they sort stuff by the primary key (img_name) by default IIRC.
According to MySQL's documentation, you're right if InnoDB tables are being used and the query is doing a table scan (or presumably using the primary key for an index scan); if it uses some other index, this may not hold true. If MyISAM is used, I believe it's insertion order, although deletes, OPTIMIZE TABLE, ALTER TABLE ORDER BY, and probably other stuff screw that up.
If MySQL really is sorting by the i2.img_name by default, theoretically making that explicit shouldn't make a difference. But knowing MySQL, it wouldn't surprise me that it does.
Even if it's already sorted, that won't always stop MySQL from sorting it again, which takes time. IIRC the i2 table is scanned using the img_sha1 index, though, which means sorting by i2.img_name is probably a real filesort, but because the query sorts by i1.img_name (using the primary key) first, the filesort will only sort small subsets of the result set and not the entire result set at once (which makes a difference, because sorting happens in worse-than-linear time). I'll consult with Aryeh or Domas before adding an ORDER BY on i2.img_name.
I don't know what else could be done for search results, so I wouldn't worry about that.
prop=categoryinfo should be easy if ORDER BY cat_title doesn't kill MySQL. Then you can do the standard thing there using cat_title as the continue parameter.
You're right about that. Done in r47050.
prop=extlinks must already be using the el_from index, so hopefully adding el_to to the ORDER BY won't kill MySQL. Then el_to can be the continue parameter.
Actually, it can't, because the index only covers the first 40 chars of el_to, which means it can't use the index to resolve sorting conflicts past the 40th char of el_to. Whether it's actually smart enough to use the index optimally I don't know (it's MySQL after all); if it does, the filesort would presumably affect small subsets and might be acceptable. Again, I'll have to ask more performance-educated people.
list=exturlusage might be tough. The ideal thing would be to order by el_index and page_id (with page_id|el_index as the continue parameter), but I'm not at all confident that wouldn't filesort. If it does, I have no idea what we could do.
Actually, we'd have to page the other way around (first el_index, then el_from; the latter is equal to page_id) because the query has to use the el_index index to resolve the LIKE query on the el_index field. Again, only the first 60 chars of el_index are covered by the index, and there is no index that covers both el_index and el_from, so we're looking at filesorts again (again, probably on small subsets). I'll run this by the DB gurus as well.
prop=imageinfo looks somewhat tricky. I might change "continue" to be the concatenation of the image title and the appropriate start value, and change the logic something like this: $titles = array_keys( $pageIds[NS_FILE] ); sort($titles); // Unless we can depend on the order to always be the same
Actually, I believe these things are ordered by page IDs somewhere in ApiPageSet, so the order would be the same every time except when a file is moved/renamed.
$images = RepoGroup::singleton()->findFiles( $titles ); if(is_null($params['continue'])){ $skip = false; } else { $skip = true; list($continue_title, $continue_start) = explode('|', $params['continue']); } foreach ( $titles as $title ) { $start = $params['start']; if($skip) { if($title != $continue_title) continue; $skip = false; $start = $continue_start; }
if(!isset($images[$title])) { // Add the "missing" record, as in lines 143-148 of r47043 // except add a break in the !$fit case continue; } $img = $images[$title]; // Add the "not missing" record, as in lines 68-134 of r47043 // except use $start instead of $params['start']
}
Looks sane, I'll probably do this tomorrow.
prop=info could be done by merging $titles and $missing (if necessary), doing an asort, and using similar logic. The continue would be just the title.
I'll look into that tomorrow as well.
For prop=revisions, I wonder if MySQL would filesort if you order by rev_id/page_id (depending on which is being used in the WHERE clause), since I would guess that's the index being used to fetch the rows anyway. If not, that could easily work for the continue parameter in non-enum mode (enum mode is already OK).
Apart from non-enum mode, there are two other modes. In titles= mode, the query fetches rows using the PRIMARY (rev_page, rev_id) index, which can be paged by rev_page (equal to page_id) just fine. In revids= mode, it uses the rev_id (rev_id) index, which we can be on as well. All in all, this should be an easy one; I'll look into it tomorrow as well.
Roan Kattouw (Catrope)
Roan Kattouw schreef:
prop=imageinfo looks somewhat tricky. I might change "continue" to be the concatenation of the image title and the appropriate start value, and change the logic something like this: $titles = array_keys( $pageIds[NS_FILE] ); sort($titles); // Unless we can depend on the order to always be the same
Actually, I believe these things are ordered by page IDs somewhere in ApiPageSet, so the order would be the same every time except when a file is moved/renamed.
$images = RepoGroup::singleton()->findFiles( $titles ); if(is_null($params['continue'])){ $skip = false; } else { $skip = true; list($continue_title, $continue_start) = explode('|', $params['continue']); } foreach ( $titles as $title ) { $start = $params['start']; if($skip) { if($title != $continue_title) continue; $skip = false; $start = $continue_start; }
if(!isset($images[$title])) { // Add the "missing" record, as in lines 143-148 of r47043 // except add a break in the !$fit case continue; } $img = $images[$title]; // Add the "not missing" record, as in lines 68-134 of r47043 // except use $start instead of $params['start']
}
Looks sane, I'll probably do this tomorrow.
Implemented something similar to your suggestion in r47108.
prop=info could be done by merging $titles and $missing (if necessary), doing an asort, and using similar logic. The continue would be just the title.
I'll look into that tomorrow as well.
On second thought, I don't think we need it: prop=info doesn't have the issues other modules have, because it doesn't return multiple entries per page. Just throwing in an asort() (which I'll do tomorrow) should suffice.
For prop=revisions, I wonder if MySQL would filesort if you order by rev_id/page_id (depending on which is being used in the WHERE clause), since I would guess that's the index being used to fetch the rows anyway. If not, that could easily work for the continue parameter in non-enum mode (enum mode is already OK).
Apart from non-enum mode, there are two other modes. In titles= mode, the query fetches rows using the PRIMARY (rev_page, rev_id) index, which can be paged by rev_page (equal to page_id) just fine. In revids= mode, it uses the rev_id (rev_id) index, which we can be on as well. All in all, this should be an easy one; I'll look into it tomorrow as well.
Did this as well in r47078.
Roan Kattouw (Catrope)
On Tue, Feb 10, 2009 at 10:14:38PM +0100, Roan Kattouw wrote:
Implemented something similar to your suggestion in r47108.
Note there's no need for asort in this one, asort is only needed when you want to keep the array keys.
I suppose if you want to be tricky, you could do something like this: sort($titles);
for ( $i = count($titles) - 1; $i >= 0 && $titles[$i] >= $fromTitle; $i-- ); $titles = array_slice($titles, $i); but I don't know if it would really gain you much.
BTW, it looks like any missing images could be returned for all queries now. Consider if there are 4 images: three have huge numbers of revisions, and the last one does not exist. 1. The first query will include image1, decide image2 is too big, and then tack on image4 (0 size). 2. The second query will skip image1, include image2, decide image3 is too big, and then tack on image4 again (0 size). 3. The third query will skip image1 and image2, include image3, and tack on image4 yet again (0 size). I suppose the quick fix is to skip all missing images if continue is set, since all missing images will always be included in the first query.
prop=info could be done by merging $titles and $missing (if necessary), doing an asort, and using similar logic. The continue would be just the title.
I'll look into that tomorrow as well.
On second thought, I don't think we need it: prop=info doesn't have the issues other modules have, because it doesn't return multiple entries per page. Just throwing in an asort() (which I'll do tomorrow) should suffice.
If you're removing the continuation (as you did for missing images above) then you don't even need to asort if you don't want to. If you're keeping continuation, though, then both existing and missing pages could trigger a continuation so you really would need the merge (otherwise consider what happens if titles=A|Bdoesnotexist|C|D|E|F, D triggers a continuation, and Bdoesnotexist also doesn't fit after).
BTW, I just noticed prop=info is broken for missing pages if it decides to continue in the middle of one, in that it will give you partial results that could easily break program logic. It should really be all-or-none like it is for non-missing pages. That should also fix the bug I just discovered where it crashes with a PHP fatal error when trying to get a token for a missing page.
Brad Jorsch schreef:
On Tue, Feb 10, 2009 at 10:14:38PM +0100, Roan Kattouw wrote:
Implemented something similar to your suggestion in r47108.
Note there's no need for asort in this one, asort is only needed when you want to keep the array keys.
Oops, meant ksort(). You can't really sort Title objects :P , but the keys are page IDs.
I suppose if you want to be tricky, you could do something like this: sort($titles);
for ( $i = count($titles) - 1; $i >= 0 && $titles[$i] >= $fromTitle; $i-- ); $titles = array_slice($titles, $i); but I don't know if it would really gain you much.
No, I only did it in imageinfo so the FileRepo wouldn't be queried for stuff we're gonna throw away right after. For prop=info, there's no gain in doing this over something like if($title < $contTitle) continue;
BTW, it looks like any missing images could be returned for all queries now. Consider if there are 4 images: three have huge numbers of revisions, and the last one does not exist.
- The first query will include image1, decide image2 is too big, and then tack on image4 (0 size).
- The second query will skip image1, include image2, decide image3 is too big, and then tack on image4 again (0 size).
- The third query will skip image1 and image2, include image3, and tack on image4 yet again (0 size).
I suppose the quick fix is to skip all missing images if continue is set, since all missing images will always be included in the first query.
I don't see how that's a problem. The <page> element for the missing image is always gonna be there (hell, it's even there for images that are already done and skipped thanks to iicontinue), it just has an extra missing="" filerepo="" thingy on it. Don't see much of a problem there.
prop=info could be done by merging $titles and $missing (if necessary), doing an asort, and using similar logic. The continue would be just the title.
I'll look into that tomorrow as well.
On second thought, I don't think we need it: prop=info doesn't have the issues other modules have, because it doesn't return multiple entries per page. Just throwing in an asort() (which I'll do tomorrow) should suffice.
If you're removing the continuation (as you did for missing images above) then you don't even need to asort if you don't want to. If you're keeping continuation, though, then both existing and missing pages could trigger a continuation so you really would need the merge (otherwise consider what happens if titles=A|Bdoesnotexist|C|D|E|F, D triggers a continuation, and Bdoesnotexist also doesn't fit after).
Actually I wouldn't need to merge them: I'll just do all existing pages first, then all missing ones (this already happens), and sort those arrays separately. The offset continue thing will work just fine.
BTW, I just noticed prop=info is broken for missing pages if it decides to continue in the middle of one, in that it will give you partial results that could easily break program logic. It should really be all-or-none like it is for non-missing pages. That should also fix the bug I just discovered where it crashes with a PHP fatal error when trying to get a token for a missing page.
Yeah, that's probably a good point. Will fix that tomorrow.
Roan Kattouw (Catrope)
On Tue, Feb 10, 2009 at 11:24:51PM +0100, Roan Kattouw wrote:
Brad Jorsch schreef:
On Tue, Feb 10, 2009 at 10:14:38PM +0100, Roan Kattouw wrote:
Implemented something similar to your suggestion in r47108.
Note there's no need for asort in this one, asort is only needed when you want to keep the array keys.
Oops, meant ksort(). You can't really sort Title objects :P , but the keys are page IDs.
$titles isn't title objects there, though. It's the keys off of $pageIds[NS_FILE], which seems to be page title strings.
I don't see how that's a problem. The <page> element for the missing image is always gonna be there (hell, it's even there for images that are already done and skipped thanks to iicontinue), it just has an extra missing="" filerepo="" thingy on it. Don't see much of a problem there.
Good point.
prop=info could be done by merging $titles and $missing (if necessary), doing an asort, and using similar logic. The continue would be just the title.
I'll look into that tomorrow as well.
On second thought, I don't think we need it: prop=info doesn't have the issues other modules have, because it doesn't return multiple entries per page. Just throwing in an asort() (which I'll do tomorrow) should suffice.
If you're removing the continuation (as you did for missing images above) then you don't even need to asort if you don't want to. If you're keeping continuation, though, then both existing and missing pages could trigger a continuation so you really would need the merge (otherwise consider what happens if titles=A|Bdoesnotexist|C|D|E|F, D triggers a continuation, and Bdoesnotexist also doesn't fit after).
Actually I wouldn't need to merge them: I'll just do all existing pages first, then all missing ones (this already happens), and sort those arrays separately. The offset continue thing will work just fine.
Will it really?
** First query: 1. You'll process 'A' for the output. 2. You'll process 'C' for the output. 3. You'll try to process 'D' for the output, but it will fail for being too big. You'll set incontinue=2. 4. You may or may not try to process 'Bdoesnotexist'. If so, it fails too. Do you set incontinue=0 now? Or leave it as 2? ** Someone creates Bdoesnotexist. ** Second query: (incontinue=2) 1. You'll skip 'A', because 0<2. 2. You'll skip the newly created 'Bdoesnotexist', because 1<2. (oops!) 3. You'll process 'C' for the output, again. (oops!) 4. You'll process 'D' for the output. 5. You'll process 'E' for the output. 6. You'll process 'F' for the output. ** Alternate second query: (incontinue=0) 1. You'll process 'A' for the output, because 0 is not < 0. (oops!) 2. You'll process 'C' for the output. (oops!) 3. You'll try to process 'D' for the output, but it will fail for being too big. You'll set incontinue=2. 4. You try to process 'Bdoesnotexist'. It fails too. You set incontinue=0 again. Lather, rinse, repeat. (oops!)
** First query: 1. You'll process 'A' for the output. 2. You'll process 'C' for the output. 3. You'll try to process 'D' for the output, but it will fail for being too big. You'll set incontinue=2. 4. You may or may not try to process 'Bdoesnotexist'. If so, it fails too. Do you set incontinue=0 now? Or leave it as 2? ** Someone deletes A. ** Second query: (incontinue=2) 1. You'll skip 'C', because 0<2. 2. You'll skip 'D', because 1<2. (oops!) 3. You'll process 'E' for the output. 4. You'll process 'F' for the output. 5. You'll process now-missing page 'A' for the output, again. (oops!) [Or not, because 0<2.] 6. You'll process 'Bdoesnotexist' for the output. [Or not, because 1<2. (oops!)] ** Alternate second query: (incontinue=0) 1. You'll process 'C' for the output, again, because 0 is not < 0. (oops!) 2. You'll process 'D' for the output. 3. You'll process 'E' for the output. 4. You'll process 'F' for the output. 5. You'll process now-missing page 'A' for the output, again. (oops!) 6. You'll process 'Bdoesnotexist' for the output.
Brad Jorsch schreef:
On Tue, Feb 10, 2009 at 11:24:51PM +0100, Roan Kattouw wrote:
Brad Jorsch schreef:
On Tue, Feb 10, 2009 at 10:14:38PM +0100, Roan Kattouw wrote:
Implemented something similar to your suggestion in r47108.
Note there's no need for asort in this one, asort is only needed when you want to keep the array keys.
Oops, meant ksort(). You can't really sort Title objects :P , but the keys are page IDs.
$titles isn't title objects there, though. It's the keys off of $pageIds[NS_FILE], which seems to be page title strings.
Yeah, you can either do array_keys() then sort(), or ksort() then array_keys(). I seem to have thought the other way.
On second thought, I don't think we need it: prop=info doesn't have the issues other modules have, because it doesn't return multiple entries per page. Just throwing in an asort() (which I'll do tomorrow) should suffice.
If you're removing the continuation (as you did for missing images above) then you don't even need to asort if you don't want to. If you're keeping continuation, though, then both existing and missing pages could trigger a continuation so you really would need the merge (otherwise consider what happens if titles=A|Bdoesnotexist|C|D|E|F, D triggers a continuation, and Bdoesnotexist also doesn't fit after).
Actually I wouldn't need to merge them: I'll just do all existing pages first, then all missing ones (this already happens), and sort those arrays separately. The offset continue thing will work just fine.
Will it really?
** First query:
- You'll process 'A' for the output.
- You'll process 'C' for the output.
- You'll try to process 'D' for the output, but it will fail for being too big. You'll set incontinue=2.
- You may or may not try to process 'Bdoesnotexist'. If so, it fails too. Do you set incontinue=0 now? Or leave it as 2?
The continue parameter can never be set twice; trying to do that throws a fatal error. I added a guard against processing missing pages when the existing pages didn't fit in my working copy, which I'll commit today.
And you don't set incontinue=0 in this case, but incontinue=3 (because $count isn't reset).
** Someone creates Bdoesnotexist.
Yes, that sucks. I'm still thinking about how to handle this best (because creating a page could cause another page to be skipped entirely).
** Second query: (incontinue=2)
- You'll skip 'A', because 0<2.
- You'll skip the newly created 'Bdoesnotexist', because 1<2. (oops!)
- You'll process 'C' for the output, again. (oops!)
- You'll process 'D' for the output.
- You'll process 'E' for the output.
- You'll process 'F' for the output.
** Alternate second query: (incontinue=0)
- You'll process 'A' for the output, because 0 is not < 0. (oops!)
- You'll process 'C' for the output. (oops!)
- You'll try to process 'D' for the output, but it will fail for being too big. You'll set incontinue=2.
- You try to process 'Bdoesnotexist'. It fails too. You set incontinue=0 again. Lather, rinse, repeat. (oops!)
This one won't happen (see above).
** First query:
- You'll process 'A' for the output.
- You'll process 'C' for the output.
- You'll try to process 'D' for the output, but it will fail for being too big. You'll set incontinue=2.
- You may or may not try to process 'Bdoesnotexist'. If so, it fails too. Do you set incontinue=0 now? Or leave it as 2?
** Someone deletes A.
Yes, like I said above, page creations can cause trouble too.
Roan Kattouw (Catrope)
On Wed, Feb 11, 2009 at 05:05:05PM +0100, Roan Kattouw wrote:
Yeah, you can either do array_keys() then sort(), or ksort() then array_keys(). I seem to have thought the other way.
At the moment, you have array_keys() then asort()/ksort().
- You may or may not try to process 'Bdoesnotexist'. If so, it fails too. Do you set incontinue=0 now? Or leave it as 2?
The continue parameter can never be set twice; trying to do that throws a fatal error. I added a guard against processing missing pages when the existing pages didn't fit in my working copy, which I'll commit today.
That's good.
And you don't set incontinue=0 in this case, but incontinue=3 (because $count isn't reset).
That would still be broken, but it can't happen as you said above so it doesn't matter.
** Someone creates Bdoesnotexist.
Yes, that sucks. I'm still thinking about how to handle this best (because creating a page could cause another page to be skipped entirely).
[...]
** Someone deletes A.
Yes, like I said above, page creations can cause trouble too.
I don't think there is a way while still keeping the continue parameter offset-based.
Brad Jorsch schreef:
On Wed, Feb 11, 2009 at 05:05:05PM +0100, Roan Kattouw wrote:
Yeah, you can either do array_keys() then sort(), or ksort() then array_keys(). I seem to have thought the other way.
At the moment, you have array_keys() then asort()/ksort().
- You may or may not try to process 'Bdoesnotexist'. If so, it fails too. Do you set incontinue=0 now? Or leave it as 2?
The continue parameter can never be set twice; trying to do that throws a fatal error. I added a guard against processing missing pages when the existing pages didn't fit in my working copy, which I'll commit today.
That's good.
Did that just now in r47189.
And you don't set incontinue=0 in this case, but incontinue=3 (because $count isn't reset).
That would still be broken, but it can't happen as you said above so it doesn't matter.
** Someone creates Bdoesnotexist.
Yes, that sucks. I'm still thinking about how to handle this best (because creating a page could cause another page to be skipped entirely).
[...]
** Someone deletes A.
Yes, like I said above, page creations can cause trouble too.
I don't think there is a way while still keeping the continue parameter offset-based.
Even title-based continue parameters have that problem. I'm thinking I might wanna merge $titles and $missing and use a more generic handler for them. The prop=info module needs some cleanup anyway (mostly splitting stuff into different functions), so I guess that's gonna be my project for the weekend.
Roan Kattouw (Catrope)
Roan Kattouw schreef:
Even title-based continue parameters have that problem. I'm thinking I might wanna merge $titles and $missing and use a more generic handler for them. The prop=info module needs some cleanup anyway (mostly splitting stuff into different functions), so I guess that's gonna be my project for the weekend.
Did that in r47214 (and followups r47215 and r47216). prop=info now merges $titles and $missing, sorts that by (ns, title) and iterates over that. incontinue now uses an ns|title format. This should fix prop=info's paging issues. Are there any other modules left you signaled paging issues for?
Roan Kattouw (Catrope)
On Fri, Feb 13, 2009 at 09:29:51PM +0100, Roan Kattouw wrote:
Roan Kattouw schreef:
Even title-based continue parameters have that problem. I'm thinking I might wanna merge $titles and $missing and use a more generic handler for them. The prop=info module needs some cleanup anyway (mostly splitting stuff into different functions), so I guess that's gonna be my project for the weekend.
Did that in r47214 (and followups r47215 and r47216). prop=info now merges $titles and $missing, sorts that by (ns, title) and iterates over that. incontinue now uses an ns|title format. This should fix prop=info's paging issues.
(:
Are there any other modules left you signaled paging issues for?
There's still the question about whether the SQL query for prop=duplicatefiles can have i2.img_name added to the ORDER BY without blowing up MySQL.
And there's extlinks and exturlusage, but I think nothing can be done about them due to the impossibility of using any index for sorting the el_to or el_index columns (MySQL requires an index on a blob column to specify a prefix length, and the total length of the index key can be only 256).
Brad Jorsch schreef:
On Fri, Feb 13, 2009 at 09:29:51PM +0100, Roan Kattouw wrote:
Are there any other modules left you signaled paging issues for?
There's still the question about whether the SQL query for prop=duplicatefiles can have i2.img_name added to the ORDER BY without blowing up MySQL.
I talked to Domas, and it appears MySQL doesn't support the hybrid sorting (first by index, then filesort to break ties) I alluded to earlier. Maybe that an index on (img_sha1, img_name) will make it possible, I'll investigate that tomorrow.
And there's extlinks and exturlusage, but I think nothing can be done about them due to the impossibility of using any index for sorting the el_to or el_index columns (MySQL requires an index on a blob column to specify a prefix length, and the total length of the index key can be only 256).
Yeah, that sucks. Guess we'll just have to accept that for those two modules.
Roan Kattouw (Catrope)
I just noticed that list=users is broken, you forgot to add the "good" users to the result set.
On Mon, Feb 16, 2009 at 11:20:27AM -0500, Brad Jorsch wrote:
$fit = $result->addValue(array('query', $this->getModuleName()),
null, $vals);
That should be $data[$u], not $vals. For some reason it worked when I tested it, I have no idea why.
Brad Jorsch schreef:
On Mon, Feb 16, 2009 at 11:20:27AM -0500, Brad Jorsch wrote:
$fit = $result->addValue(array('query', $this->getModuleName()),
null, $vals);
That should be $data[$u], not $vals. For some reason it worked when I tested it, I have no idea why.
Patch applied in r47324.
Roan Kattouw (Catrope)
I've been thinking about the case where you have two things trying to continue. For example, prop=categories|templates.
Previously, you might have just kept replacing continue parameters until you reach the end:
Q: prop=categories|templates R: clcontinue=1&tlcontinue=1 Q: prop=categories|templates&clcontinue=1&tlcontinue=1 R: clcontinue=2 Q: prop=categories|templates&clcontinue=2&tlcontinue=1 R: -> done!
Or you could have ignored continue parameters that have already been run to completion:
Q: prop=categories|templates R: clcontinue=1&tlcontinue=1 Q: prop=categories|templates&clcontinue=1&tlcontinue=1 R: clcontinue=2 Q: prop=categories|templates&clcontinue=2 R: tlcontinue=1, but ignore it because we already did iistart -> done!
But now, the continue parameters are not independent; depending on the result size, one can restrict the other. So the first algorithm could go like this:
Q: prop=categories|templates R: clcontinue=1&tlcontinue=0 Q: prop=categories|templates&clcontinue=1&tlcontinue=0 R: clcontinue=2&tlcontinue=0 Q: prop=categories|templates&clcontinue=2&tlcontinue=0 R: tlcontinue=0 Q: prop=categories|templates&clcontinue=2&tlcontinue=0 R: tlcontinue=0 Q: prop=categories|templates&clcontinue=2&tlcontinue=0 -> loop forever
And the second like this:
Q: prop=categories|templates R: clcontinue=1&tlcontinue=0 Q: prop=categories|templates&clcontinue=1&tlcontinue=0 R: clcontinue=2&tlcontinue=0 Q: prop=categories|templates&clcontinue=2&tlcontinue=0 R: tlcontinue=0 Q: prop=categories|templates&tlcontinue=0 R: clcontinue=1&tlcontinue=0, but clcontinue is ignored Q: prop=categories|templates&tlcontinue=0 -> loop forever
To work right now, you need to remove property specifiers once you run out of continues:
Q: prop=categories|templates R: clcontinue=1&tlcontinue=0 Q: prop=categories|templates&clcontinue=1&tlcontinue=0 R: clcontinue=2&tlcontinue=0 Q: prop=categories|templates&clcontinue=2&tlcontinue=0 R: tlcontinue=0 ** Remove revisions! Q: prop=imageinfo&tlcontinue=0 R: tlcontinue=1 Q: prop=imageinfo&tlcontinue=1 R: -> done!
I just thought I'd mention this additional aspect to the breaking change. The "new" algorithm is superior anyway, but its quite possible people's bots are using one of the other two (mine was using the first one, BTW).
Brad Jorsch schreef:
I've been thinking about the case where you have two things trying to continue. For example, prop=categories|templates.
Previously, you might have just kept replacing continue parameters until you reach the end:
Q: prop=categories|templates R: clcontinue=1&tlcontinue=1 Q: prop=categories|templates&clcontinue=1&tlcontinue=1 R: clcontinue=2 Q: prop=categories|templates&clcontinue=2&tlcontinue=1 R: -> done!
Or you could have ignored continue parameters that have already been run to completion:
Q: prop=categories|templates R: clcontinue=1&tlcontinue=1 Q: prop=categories|templates&clcontinue=1&tlcontinue=1 R: clcontinue=2 Q: prop=categories|templates&clcontinue=2 R: tlcontinue=1, but ignore it because we already did iistart -> done!
But now, the continue parameters are not independent; depending on the result size, one can restrict the other. So the first algorithm could go like this:
Q: prop=categories|templates R: clcontinue=1&tlcontinue=0 Q: prop=categories|templates&clcontinue=1&tlcontinue=0 R: clcontinue=2&tlcontinue=0 Q: prop=categories|templates&clcontinue=2&tlcontinue=0 R: tlcontinue=0 Q: prop=categories|templates&clcontinue=2&tlcontinue=0 R: tlcontinue=0 Q: prop=categories|templates&clcontinue=2&tlcontinue=0 -> loop forever
And the second like this:
Q: prop=categories|templates R: clcontinue=1&tlcontinue=0 Q: prop=categories|templates&clcontinue=1&tlcontinue=0 R: clcontinue=2&tlcontinue=0 Q: prop=categories|templates&clcontinue=2&tlcontinue=0 R: tlcontinue=0 Q: prop=categories|templates&tlcontinue=0 R: clcontinue=1&tlcontinue=0, but clcontinue is ignored Q: prop=categories|templates&tlcontinue=0 -> loop forever
To work right now, you need to remove property specifiers once you run out of continues:
Q: prop=categories|templates R: clcontinue=1&tlcontinue=0 Q: prop=categories|templates&clcontinue=1&tlcontinue=0 R: clcontinue=2&tlcontinue=0 Q: prop=categories|templates&clcontinue=2&tlcontinue=0 R: tlcontinue=0 ** Remove revisions! Q: prop=imageinfo&tlcontinue=0 R: tlcontinue=1 Q: prop=imageinfo&tlcontinue=1 R: -> done!
I just thought I'd mention this additional aspect to the breaking change. The "new" algorithm is superior anyway, but its quite possible people's bots are using one of the other two (mine was using the first one, BTW).
Yes, this is nasty, and should probably be documented. At the moment, this result cut-off thing isn't even documented anywhere (except for the mailing list archives), I'll work on that.
Roan Kattouw (Catrope)
mediawiki-api@lists.wikimedia.org