On Tue, May 1, 2012 at 10:25 PM, Brad Jorsch
<b-jorsch(a)alum.northwestern.edu> wrote:
On Mon, Apr 30, 2012 at 11:08:40PM +0200, Petr Onderka
wrote:
In other words, if I find some results for some page on API page n and
no results on API page n+1, can I be sure there will be no results on
pages > n?
Not necessarily. In most cases that assumption should be true, but I see
a few cases offhand where it wouldn't be:
* If you're using prop=revisions&revids=...&rvprop=content with
revisions big enough that the API response size limit comes into play,
you could wind up in a situation where the initial query returns
revision 1 from page A, the second returns revision 2 from page B, and
the third returns revision 3 from page A again.
Interesting, I didn't know there was a limit for the response size.
* Some modules, such as prop=extlinks, cannot use
anything sane for the
continue parameter (or else MySQL blows up), so they just use "offset
into the arbitrarily-ordered set of results". It's possible that edits
made to the wiki between your calls could change the result set so
that values are repeated, skipped, or both.
That's exactly what I wanted to know, thanks. This means I won't be
relying on the order of results.
Too bad this module behaves that way.
* If you are using multiple modules, it might be the
case that one
goes through the pages in order by page_id while the other goes by
title, or something along those lines. In practice it seems that all
modules that commonly continue will order by the page_id, so the only
way you might run into this is if the API response size limit causes
modules like categoryinfo or imageinfo that usually don't continue to
do so.
That wouldn't matter to me, I consider each module separately,
because each module has its own lazy collection,
even if they are paged together.
I haven't checked any of the prop modules provided
by extensions, BTW.
Chances are most of those are well-behaved and order by page_id, but
it's possible some of them may do things differently.
I am writing a library to access the API and
every collection in my
library is lazy.
For example, a user requests to know categories of pages in
Category:Query languages.
When he starts iterating over the result, I execute the query:
http://en.wikipedia.org/w/api.php?action=query&generator=categorymember…
When he then requests to know the categories of the third page in the
result (Access query language),
I will return to him the categories from the first query. If he
requests more, I execute the query:
http://en.wikipedia.org/w/api.php?action=query&generator=categorymember…
How do you determine that you should look at "Access query language"
first rather than one of the other pages?
I meant that the user could decide he wants to know categories of that
page and not the ones before it.
Something like (C# code, that's what I'm writing the library in):
pages.Where(p => p.title == "Access query language")
.Select(p => new { title = p.title, categories = p.categories.ToArray()})
.ToArray()
where `pages` represents the result of the API call.
This specific code wouldn't make much sense, but I can imagine wanting
to filter the results by something the API won't let you.
For example, if you wanted to know categories of pages that are both
in Category:Foo and Category:Bar.
In my bot code, I have something that behaves
similarly: you give it a
query, and it gives back a series of result pages. But my version will
process clcontinue all the way to the end right away; the laziness is
only in handling gcmcontinue. That way I can be sure that the page nodes
returned by successive calls will have all the necessary data without
worrying about the ordering of the prop module results.
Thanks for your response, this really helped me.
Petr Onderka
[[en:User:Svick]]