Brad Jorsch schreef:
On Fri, Feb 06, 2009 at 01:08:02PM +0100, Roan Kattouw wrote:
Roan Kattouw schreef:
In r46845 [1], the issue raised in bug 11430 [2] a year and a half ago was finally addressed: when the API was asked to produce huge amounts of data (for instance the content of 500 revisions at 280 KB each), it would run out of memory trying to store and process it. To prevent this from happening, the amount of data the API can return is now limited.
I see a possible bug: if the size of the first result is over the size limit, you'll never be able to get the result at all. For example, create a page larger than $wgAPIMaxResultSize and try to fetch the revision contents.
Yes, the 8 MiB default is unlikely to be hit by a single revision, but what if someone lowers it to 250 KiB and tries to fetch those 280 KB revisions? Or what if someone uploads an image that has a ridiculous amount of metadata?
In the comment above the default value there's a big warning saying you shouldn't set $wgAPIMaxResultSize to something lower than the maximum revision size. Wiki admins who do that anyway are asking for trouble.
The metadata thing is also a non-issue, because it's stored in a database field that holds at most 4 MB.
Something I forgot to mention here is that continue parameters have been added to most modules that didn't have one: iicontinue for prop=imageinfo, rvcontinue for prop=revisions, cicontinue for prop=categoryinfo, incontinue for prop=info and amfrom for meta=allmessages; meta=siteinfo abuses siprop as query-continue parameter.
Looking at r46923, I see a number of bugs, including one or two existing bugs that turned up when I was looking at all these continue parameters.
rvcontinue doesn't work, it seems that $continue is never extracted from $params.
Fixed in r47036; forgot to update the patch I applied for the extract() crusade I did a while ago.
amfrom is broken if ammessages is not sorted in whatever order PHP uses for string comparison. For example, I tried ammessages=aboutsite|aboutpage with a ridiculously low $wgAPIMaxResultSize on a local installation, and it returned aboutsite with amfrom=aboutpage. When I then did ammessages=aboutsite|aboutpage&amfrom=aboutpage, it returned aboutsite and amfrom=aboutpage again.
Fixed in r47036 by checking for inequality (!=) rather than less-than (<), using the fact that the order of message retrieval is pretty much constant (unless of course extensions are installed/removed between two requests).
alfrom should probably be used instead of alcontinue if alunique is specified.
drcontinue must be used instead of drstart in "all" mode.
Fixed in r47036; forgot to update for the recent paging fixes in alllinks and the revamp of deletedrevs.
prop=duplicatefiles may have an issue similar to the allmessages problem mentioned above, it looks like the results aren't sorted by i2.img_name which could cause repeats or skips if database decides to return them in any order other than would be given if i2.img_name were included in the ORDER BY clause.
I'll look into this; sorting by i2.img_name as well is less efficient according to a quick EXPLAIN, but I'm not sure if it's too inefficient (will ask a database guru). If memory serves, this problem shouldn't occur on MySQL databases, though, because they sort stuff by the primary key (img_name) by default IIRC.
All of these continues that are based on a count (which includes the existing eloffset, euoffset, and sroffset BTW) are potentially broken. There are two issues:
If the query isn't strictly ordered, the database is under no obligation to return the rows in the same order the second time as it did the first. That means that titles=A|B|C&prop=info could return A and B with incontinue=2, but titles=A|B|C&prop=info&incontinue=2 could then return A or B again instead of C if the database orders the rows differently for whatever reason.
Even if the results are sorted, something like titles=A|B|C|D|E&prop=info&incontinue=2 could skip C if A or B were deleted between the first and second queries, or repeat B if "A2" were inserted.
The existing UI for listing external links and search results also uses this offset-based approach, which is pretty much why the API does as well. You're right about these two issues, and I'll look into them deeper later (after I've caught up with what happened this weekend). The same thing about MySQL sorting by primary key (IIRC) applies here.
Thanks for testing and reporting all this stuff (in just 2 days).
Roan Kattouw (Catrope)