Hi,
if I'm making a query using one of the prop modules, can I rely on the fact that all results for certain page will be together in the API pages? In other words, if I find some results for some page on API page n and no results on API page n+1, can I be sure there will be no results on pages > n?
Why do I want to know this:
I am writing a library to access the API and every collection in my library is lazy.
For example, a user requests to know categories of pages in Category:Query languages.
When he starts iterating over the result, I execute the query: http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers...
When he then requests to know the categories of the third page in the result (Access query language), I will return to him the categories from the first query. If he requests more, I execute the query: http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers...
This query doesn't give me any new categories for that page. My question is: can I be sure that I won't get any further results for this page, if I continue to increment clcontinue?
I think the answer is yes in this specific case, because the first part of clcontinue is pageid, but I am interested in the general answer: does the same apply for all prop modules?
Thanks, Petr Onderka [[en:User:Svick]]
On Mon, Apr 30, 2012 at 11:08:40PM +0200, Petr Onderka wrote:
In other words, if I find some results for some page on API page n and no results on API page n+1, can I be sure there will be no results on pages > n?
Not necessarily. In most cases that assumption should be true, but I see a few cases offhand where it wouldn't be:
* If you're using prop=revisions&revids=...&rvprop=content with revisions big enough that the API response size limit comes into play, you could wind up in a situation where the initial query returns revision 1 from page A, the second returns revision 2 from page B, and the third returns revision 3 from page A again. * Some modules, such as prop=extlinks, cannot use anything sane for the continue parameter (or else MySQL blows up), so they just use "offset into the arbitrarily-ordered set of results". It's possible that edits made to the wiki between your calls could change the result set so that values are repeated, skipped, or both. * If you are using multiple modules, it might be the case that one goes through the pages in order by page_id while the other goes by title, or something along those lines. In practice it seems that all modules that commonly continue will order by the page_id, so the only way you might run into this is if the API response size limit causes modules like categoryinfo or imageinfo that usually don't continue to do so.
I haven't checked any of the prop modules provided by extensions, BTW. Chances are most of those are well-behaved and order by page_id, but it's possible some of them may do things differently.
I am writing a library to access the API and every collection in my library is lazy.
For example, a user requests to know categories of pages in Category:Query languages.
When he starts iterating over the result, I execute the query: http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers...
When he then requests to know the categories of the third page in the result (Access query language), I will return to him the categories from the first query. If he requests more, I execute the query: http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers...
How do you determine that you should look at "Access query language" first rather than one of the other pages?
In my bot code, I have something that behaves similarly: you give it a query, and it gives back a series of result pages. But my version will process clcontinue all the way to the end right away; the laziness is only in handling gcmcontinue. That way I can be sure that the page nodes returned by successive calls will have all the necessary data without worrying about the ordering of the prop module results.
On Tue, May 1, 2012 at 10:25 PM, Brad Jorsch b-jorsch@alum.northwestern.edu wrote:
On Mon, Apr 30, 2012 at 11:08:40PM +0200, Petr Onderka wrote:
In other words, if I find some results for some page on API page n and no results on API page n+1, can I be sure there will be no results on pages > n?
Not necessarily. In most cases that assumption should be true, but I see a few cases offhand where it wouldn't be:
- If you're using prop=revisions&revids=...&rvprop=content with
revisions big enough that the API response size limit comes into play, you could wind up in a situation where the initial query returns revision 1 from page A, the second returns revision 2 from page B, and the third returns revision 3 from page A again.
Interesting, I didn't know there was a limit for the response size.
- Some modules, such as prop=extlinks, cannot use anything sane for the
continue parameter (or else MySQL blows up), so they just use "offset into the arbitrarily-ordered set of results". It's possible that edits made to the wiki between your calls could change the result set so that values are repeated, skipped, or both.
That's exactly what I wanted to know, thanks. This means I won't be relying on the order of results. Too bad this module behaves that way.
- If you are using multiple modules, it might be the case that one
goes through the pages in order by page_id while the other goes by title, or something along those lines. In practice it seems that all modules that commonly continue will order by the page_id, so the only way you might run into this is if the API response size limit causes modules like categoryinfo or imageinfo that usually don't continue to do so.
That wouldn't matter to me, I consider each module separately, because each module has its own lazy collection, even if they are paged together.
I haven't checked any of the prop modules provided by extensions, BTW. Chances are most of those are well-behaved and order by page_id, but it's possible some of them may do things differently.
I am writing a library to access the API and every collection in my library is lazy.
For example, a user requests to know categories of pages in Category:Query languages.
When he starts iterating over the result, I execute the query: http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers...
When he then requests to know the categories of the third page in the result (Access query language), I will return to him the categories from the first query. If he requests more, I execute the query: http://en.wikipedia.org/w/api.php?action=query&generator=categorymembers...
How do you determine that you should look at "Access query language" first rather than one of the other pages?
I meant that the user could decide he wants to know categories of that page and not the ones before it. Something like (C# code, that's what I'm writing the library in):
pages.Where(p => p.title == "Access query language") .Select(p => new { title = p.title, categories = p.categories.ToArray()}) .ToArray()
where `pages` represents the result of the API call.
This specific code wouldn't make much sense, but I can imagine wanting to filter the results by something the API won't let you. For example, if you wanted to know categories of pages that are both in Category:Foo and Category:Bar.
In my bot code, I have something that behaves similarly: you give it a query, and it gives back a series of result pages. But my version will process clcontinue all the way to the end right away; the laziness is only in handling gcmcontinue. That way I can be sure that the page nodes returned by successive calls will have all the necessary data without worrying about the ordering of the prop module results.
Thanks for your response, this really helped me.
Petr Onderka [[en:User:Svick]]
mediawiki-api@lists.wikimedia.org