On Sat, Dec 15, 2012 at 5:37 AM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:
My first idea for this email is "dumb continue":
Continuing *is* confusing. In fact, I think you have made an error in your example:
Now there is even a high bug potential -- if there are no more links, API returns just two continues - clcontinue & gapcontinue - which means that if the client makes the same request with the two additional "continue" parameters, API will return the same result again, possibly producing duplicate errors and consuming extra server resources.
Actually, if the client makes the request with both the clcontinue and gapcontinue parameters, it will wind up skipping some results.
Say gaplimit was 3, so the original query returns pages A, B, and C but manages to includes only the categories for A and B. A correct continue would return the remaining categories for B and C. But if you include gapcontinue, you'll instead get pages D, E, and F and never see those categories from C.
Proposal: Query() method from above should be able to take ALL continue values and append ALL of them to the next query, without knowing anything about them, and without removing or changing any of the original request parameters. Query() will do this until server returns a data block with no more <query-continue> section.
That would be quite a change. It would mean the API wouldn't return gapcontinue at all until plcontinue and clcontinue are both exhausted, and then would keep returning the *old* gapcontinue until plcontinue and clcontinue are both exhausted again.
This would break some possible use cases which I'm not entirely sure we should break. For example, I can imagine a bot that would use generator=foo&gfoolimit=1&prop=revisions, follow rvcontinue until it finds whichever revision it is looking for, and then ignore rvcontinue in favor of gfoocontinue to move on to the next page. With "dumb continue", it wouldn't be able to do that.
If I were to redesign continuing right now, I'd just structure it a little more. Instead of something like this like we get now:
<query-continue> <links plcontinue="..." /> <categories clcontinue="..." gclcontinue="..." /> <watchlist wlstart="..." /> <allmessages amfrom="..." /> </query-continue>
I'd return something like this:
<query-continue> <prop> <links plcontinue="..." /> <categories clcontinue="..." /> </prop> <generator> <categories gclcontinue="..." /> </generator> <list> <watchlist wlstart="..." /> </list> <meta> <allmessages amfrom="..." /> </meta> </query-continue>
The client would still have to know how to manipulate list=/meta=/generator=/prop=, particularly when using more than one of these in the same query. But the rules are simpler, it wouldn't have to know that gclcontinue is for generator=categories while clcontinue is for prop=categories, and it would be easy to know what exactly to include in prop= when continuing to avoid repeated results.
API Implementation details: In the example above where we have a generator & two properties, the next continue would be set to the very first item that had any of the properties incomplete. The properties continue will be as before, except that if there is no more categories, clcategory is set to some magic value like '|' to indicate that it is done and no more SQL requests to categories tables are needed on subsequent calls. The server should not return the maximum number of pages from the generator, if properties enumeration have not reached them yet (e.g. if generatorLimit=max & linksLimit=1 -> will return just the first page with one link on each return)
You can't get away with changing the generator's continue like that and still get correct results, because you can't assume the generator generates pages in the same order every prop module processes them. Nor can you assume each prop module will process pages in the same order. For example, many prop modules order by page_id but may be ASC or DESC on their "dir" parameter.
IMO, if a client wants to ensure it has complete results for any page objects in the result, it should just process all of the prop continuation parameters to completion.
Backwards compatibility: This change might impact any client that will use the presence of the "plcontinue" or "clcontinue" fields as a guide to not use the next "gapcontinue".
That at least is easy enough to avoid: when all non-generator continues are whatever magic value is "ignore", then don't output any of them. You have to be able to detect this anyway to know when to output the new value for the generator's continue.
A less solvable problem is the one I raised above.