On Mon, Apr 30, 2012 at 11:08:40PM +0200, Petr Onderka wrote:
In other words, if I find some results for some page on API page n and
no results on API page n+1, can I be sure there will be no results on
pages > n?
Not necessarily. In most cases that assumption should be true, but I see
a few cases offhand where it wouldn't be:
* If you're using prop=revisions&revids=...&rvprop=content with
revisions big enough that the API response size limit comes into play,
you could wind up in a situation where the initial query returns
revision 1 from page A, the second returns revision 2 from page B, and
the third returns revision 3 from page A again.
* Some modules, such as prop=extlinks, cannot use anything sane for the
continue parameter (or else MySQL blows up), so they just use "offset
into the arbitrarily-ordered set of results". It's possible that edits
made to the wiki between your calls could change the result set so
that values are repeated, skipped, or both.
* If you are using multiple modules, it might be the case that one
goes through the pages in order by page_id while the other goes by
title, or something along those lines. In practice it seems that all
modules that commonly continue will order by the page_id, so the only
way you might run into this is if the API response size limit causes
modules like categoryinfo or imageinfo that usually don't continue to
do so.
I haven't checked any of the prop modules provided by extensions, BTW.
Chances are most of those are well-behaved and order by page_id, but
it's possible some of them may do things differently.
I am writing a library to access the API and every
collection in my
library is lazy.
For example, a user requests to know categories of pages in
Category:Query languages.
When he starts iterating over the result, I execute the query:
http://en.wikipedia.org/w/api.php?action=query&generator=categorymember…
When he then requests to know the categories of the third page in the
result (Access query language),
I will return to him the categories from the first query. If he
requests more, I execute the query:
http://en.wikipedia.org/w/api.php?action=query&generator=categorymember…
How do you determine that you should look at "Access query language"
first rather than one of the other pages?
In my bot code, I have something that behaves similarly: you give it a
query, and it gives back a series of result pages. But my version will
process clcontinue all the way to the end right away; the laziness is
only in handling gcmcontinue. That way I can be sure that the page nodes
returned by successive calls will have all the necessary data without
worrying about the ordering of the prop module results.