Migrating to "dumb query-continue"

List overview All Threads
Download

newer

older

beta readers wanted for new...

RFC: Encoding of log entry...

Yuri Astrakhan

15 Dec 2012 15 Dec '12

5:37 a.m.

Hi everyone, there seem to have been many great changes in the API, so I decided to take a look at improving my old bots a bit, together with the rest of the pywiki framework. While looking, a few thoughts and questions have occured that I hope someone could comment on.

I have been out of the loop for a long time, so do forgive me if I misunderstand some recent changes and how they are suppose to work, or if this is a non-issue. Also I appologise if these issues have already been discussed and/or resolved.

My first idea for this email is "*dumb continue*":

Can we change the continue so that clients are not required to understand parameters once the first request has been made? This way a user of a client library can iterate over query results without knowing how to continue, and the library would not need to understand what to do with each parameter (iterator scenario):

for datablock in mwapi.Query( { generator=*allpages*, prop=*links|categories *, otherParams=... } ): # # Process the returned data blocks one at a time #

The way it is done now, Query() method must understand how to do continue in depth. Which parameters to look at first, which - at second, how to handle when there are no more links while there are more categories to enumerate. Now there is even a high bug potential -- if there are no more links, API returns just two continues - clcontinue & gapcontinue - which means that if the client makes the same request with the two additional "continue" parameters, API will return the same result again, possibly producing duplicate errors and consuming extra server resources.

*Proposal:* Query() method from above should be able to take ALL continue values and append ALL of them to the next query, without knowing anything about them, and without removing or changing any of the original request parameters. Query() will do this until server returns a data block with no more <query-continue> section.

Also, because the "page" objects might be incomplete between different data blocks, the user might need to know when a complete "page" object is returned. API should probably introduce an "incomplete" attribute on the page to indicate that the client should merge it with the page from the following data blocks with the same ID until there is no more "incomplete" flag. Page revision number could be used on the client to see if the page has been changed between calls:

for page in mwapi.QueryCompletePages( { same parameters as example above } ): # process each page

*API Implementation details:* In the example above where we have a generator & two properties, the next continue would be set to the very first item that had any of the properties incomplete. The properties continue will be as before, except that if there is no more categories, clcategory is set to some magic value like '|' to indicate that it is done and no more SQL requests to categories tables are needed on subsequent calls. The server should not return the maximum number of pages from the generator, if properties enumeration have not reached them yet (e.g. if generatorLimit=max & linksLimit=1 -> will return just the first page with one link on each return)

*Backwards compatibility:* This change might impact any client that will use the presence of the "plcontinue" or "clcontinue" fields as a guide to not use the next "gapcontinue". The simplest (and long overdue) solution is to add the "version=" parameter.

While at it, we might want to expand the action=paraminfo to include meaninful version data. Better yet, make a new "moduleinfo" action that returns any requested specifics about each module, e.g.: action=moduleinfo & modules= parse | query | query+allpages & props= version | params

Thanks! Please let me know what you think.

--Yuri

Attachments:

attachment.htm (text/html — 4.4 KB)

Show replies by date

Platonides

15 Dec 15 Dec

5:42 a.m.

On 15/12/12 11:37, Yuri Astrakhan wrote:

...

Hi everyone, there seem to have been many great changes in the API, so I decided to take a look at improving my old bots a bit, together with the rest of the pywiki framework. While looking, a few thoughts and questions have occured that I hope someone could comment on.

Hi Yuri! It's nice to see you.

...

*Proposal:* Query() method from above should be able to take ALL continue values and append ALL of them to the next query, without knowing anything about them, and without removing or changing any of the original request parameters. Query() will do this until server returns a data block with no more <query-continue> section.

I'm not sure what's the case you mention of an incomplete page, can you provide an example?

...

Also, because the "page" objects might be incomplete between different data blocks, the user might need to know when a complete "page" object is returned. API should probably introduce an "incomplete" attribute on the page to indicate that the client should merge it with the page from the following data blocks with the same ID until there is no more "incomplete" flag. Page revision number could be used on the client to see if the page has been changed between calls:

Yuri Astrakhan

5:55 a.m.

Hi Platonides! Good to be back :)

By incomlete I meant that when you run an *allpages* generator and request * links*, the *pllimit *is for the total number of links combined from all pages. Which means that if there are 20 pages with 3 links each, and pllimit=10, the first result will have 3 complete pages and one page with just one link out of three.

The next result will return the missing 2 links for that page, two more complete pages, and another page with two links out of three. So my proposal is to mark those last pages "incomplete", so that users can handle this intelligently, and not assume that whatever links are listed are all the links there are.

On Sat, Dec 15, 2012 at 5:42 AM, Platonides platonides@gmail.com wrote:

...

On 15/12/12 11:37, Yuri Astrakhan wrote:

...
Hi everyone, there seem to have been many great changes in the API, so I decided to take a look at improving my old bots a bit, together with the rest of the pywiki framework. While looking, a few thoughts and questions have occured that I hope someone could comment on.

Hi Yuri! It's nice to see you.

...
*Proposal:* Query() method from above should be able to take ALL continue values and append ALL of them to the next query, without knowing anything about them, and without removing or changing any of the original request parameters. Query() will do this until server returns a data block with no more <query-continue> section.

+1

I'm not sure what's the case you mention of an incomplete page, can you provide an example?

...
Also, because the "page" objects might be incomplete between different data blocks, the user might need to know when a complete "page" object is returned. API should probably introduce an "incomplete" attribute on the page to indicate that the client should merge it with the page from the following data blocks with the same ID until there is no more "incomplete" flag. Page revision number could be used on the client to see if the page has been changed between calls:

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Brad Jorsch

17 Dec 17 Dec

12:54 p.m.

On Sat, Dec 15, 2012 at 5:37 AM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

My first idea for this email is "dumb continue":

Continuing *is* confusing. In fact, I think you have made an error in your example:

...

Now there is even a high bug potential -- if there are no more links, API returns just two continues - clcontinue & gapcontinue - which means that if the client makes the same request with the two additional "continue" parameters, API will return the same result again, possibly producing duplicate errors and consuming extra server resources.

Actually, if the client makes the request with both the clcontinue and gapcontinue parameters, it will wind up skipping some results.

Say gaplimit was 3, so the original query returns pages A, B, and C but manages to includes only the categories for A and B. A correct continue would return the remaining categories for B and C. But if you include gapcontinue, you'll instead get pages D, E, and F and never see those categories from C.

...

Proposal: Query() method from above should be able to take ALL continue values and append ALL of them to the next query, without knowing anything about them, and without removing or changing any of the original request parameters. Query() will do this until server returns a data block with no more <query-continue> section.

That would be quite a change. It would mean the API wouldn't return gapcontinue at all until plcontinue and clcontinue are both exhausted, and then would keep returning the *old* gapcontinue until plcontinue and clcontinue are both exhausted again.

This would break some possible use cases which I'm not entirely sure we should break. For example, I can imagine a bot that would use generator=foo&gfoolimit=1&prop=revisions, follow rvcontinue until it finds whichever revision it is looking for, and then ignore rvcontinue in favor of gfoocontinue to move on to the next page. With "dumb continue", it wouldn't be able to do that.

If I were to redesign continuing right now, I'd just structure it a little more. Instead of something like this like we get now:

<query-continue> <links plcontinue="..." /> <categories clcontinue="..." gclcontinue="..." /> <watchlist wlstart="..." /> <allmessages amfrom="..." /> </query-continue>

I'd return something like this:

<query-continue> <prop> <links plcontinue="..." /> <categories clcontinue="..." /> </prop> <generator> <categories gclcontinue="..." /> </generator> <list> <watchlist wlstart="..." /> </list> <meta> <allmessages amfrom="..." /> </meta> </query-continue>

The client would still have to know how to manipulate list=/meta=/generator=/prop=, particularly when using more than one of these in the same query. But the rules are simpler, it wouldn't have to know that gclcontinue is for generator=categories while clcontinue is for prop=categories, and it would be easy to know what exactly to include in prop= when continuing to avoid repeated results.

...

API Implementation details: In the example above where we have a generator & two properties, the next continue would be set to the very first item that had any of the properties incomplete. The properties continue will be as before, except that if there is no more categories, clcategory is set to some magic value like '|' to indicate that it is done and no more SQL requests to categories tables are needed on subsequent calls. The server should not return the maximum number of pages from the generator, if properties enumeration have not reached them yet (e.g. if generatorLimit=max & linksLimit=1 -> will return just the first page with one link on each return)

You can't get away with changing the generator's continue like that and still get correct results, because you can't assume the generator generates pages in the same order every prop module processes them. Nor can you assume each prop module will process pages in the same order. For example, many prop modules order by page_id but may be ASC or DESC on their "dir" parameter.

IMO, if a client wants to ensure it has complete results for any page objects in the result, it should just process all of the prop continuation parameters to completion.

...

Backwards compatibility: This change might impact any client that will use the presence of the "plcontinue" or "clcontinue" fields as a guide to not use the next "gapcontinue".

That at least is easy enough to avoid: when all non-generator continues are whatever magic value is "ignore", then don't output any of them. You have to be able to detect this anyway to know when to output the new value for the generator's continue.

A less solvable problem is the one I raised above.

Yuri Astrakhan

18 Dec 18 Dec

10:03 a.m.

Assume wiki has pages A and B with links and categories: A(l1,l2,l3,l4,l5,c1,c2,c3), B(l1,c1). This is how API behaves now:

1 req) prop=categories|links & generator=allpages & gaplimit=1 & pllimit=2 & cllimit=2 1 res) A(l1,l2,c1,c2), gapcontinue=B, plcontinue=l3, clcontinue=c3

client ignores gapcontinue because there are others, and adds pl & cl continues: 2 req) initial & plcontinue=l3 & clcontinue=c3 2 res) A(l3,l4,c3), gapcontinue=B, plcontinue=l5

this is where a *potential" for the bug is: client must understand that since there is no more clcontinue, but there is plcontinue, there are no more categories in this set of pages, so it should not ask for prop=categories until it finishes with plcontinue. Once done, it should resume prop=categories and also add gapcontinue=B.

3 bad req) initial & plcontinue=l5 3 bad res) A(l5,c1,c2), gapcontinue=B, clcontinue=c3

3 good req) initial but with prop=links only & plcontinue=l5 3 good res) A(l5) & gapcontinue=B

4 req) initial & gapcontinue=B 4 res) B(l1,c1) -- done

I think this puts too much unneeded burden on the client code to handle these cases correctly. Instead, API should be simplified to return clcontinue=| in result #2, and results 1 and 2 should have gapcontinue=A. Client could simply merge all resulting continue values into following requests, and greatly simplify all the code for the most common "get everything I requested" scenario, and hence should be the default behavior:

1 req) prop=categories|links & generator=allpages & gaplimit=1 & pllimit=2 & cllimit=2 1 res) A(l1,l2,c1,c2), gapcontinue=, plcontinue=l3, clcontinue=c3

2 req) initial & gapcontinue= & plcontinue=l3 & clcontinue=c3 2 res) A(l3,l4,c3), gapcontinue=, plcontinue=l5, clcontinue=|

3 req) initial & gapcontinue= & plcontinue=l5 & clcontinue=| 3 res) A(l5) & gapcontinue=B, plcontinue=, clcontinue=

4 req) initial & gapcontinue=B & plcontinue= & clcontinue= 4 res) B(l1,c1) -- no continue section, done

That would be quite a change. It would mean the API wouldn't return

...

gapcontinue at all until plcontinue and clcontinue are both exhausted, and then would keep returning the *old* gapcontinue until plcontinue and clcontinue are both exhausted again.

Correct, API would return an empty gapcontinue until it finishes with the first set, than it will return the beginning of the next set until that is exhausted as well, etc.

...

This would break some possible use cases which I'm not entirely sure we should break. For example, I can imagine a bot that would use generator=foo&gfoolimit=1&prop=revisions, follow rvcontinue until it finds whichever revision it is looking for, and then ignore rvcontinue in favor of gfoocontinue to move on to the next page. With "dumb continue", it wouldn't be able to do that.

I do not think API should support the case you described with gaplimit=1, because that fundamentally breaks the original API goal of "get data about many pages with lots of elements on them in one request". I would prefer the client do two separate queries: 1) list pages 2) many queries "list revisions for page X". Having generator with gaplimit=1 does not improve server performance or minimize traffic.

But even if we do find compelling reasons to include that, for the advanced scenario "skip subquery and follow on with the generator" it might make sense to introduce appendable "|next" value keyword gapcontinue=A|next or a gcommand=skipcurrent parameter. I am not sure it is the cleanest solution, but it is certainly cleaner than forcing every client out there to have the complex logic from above for all common cases.

1 req) prop=categories|links & generator=allpages & gaplimit=1 & pllimit=2 & cllimit=2 1 res) A(l1,l2,c1,c2), gapcontinue=, plcontinue=l3, clcontinue=c3

client decided it does not need anything else from A, so it adds |next to gapcontinue. API ignores all other property continues. 2 req) initial & gapcontinue=|next, plcontinue=l3, clcontinue=c3 2 res) B(l1,c1) -- done

The client would still have to know how to manipulate

...

list=/meta=/generator=/prop=, particularly when using more than one of these in the same query. But the rules are simpler, it wouldn't have to know that gclcontinue is for generator=categories while clcontinue is for prop=categories, and it would be easy to know what exactly to include in prop= when continuing to avoid repeated results.

Complex client logic is exactly what I am trying to avoid. Ideally all "continue" values should be joined into a single "query-continue = magic-value" of no interesting user-passable properties.

...

You can't get away with changing the generator's continue like that and still get correct results, because you can't assume the generator generates pages in the same order every prop module processes them. Nor can you assume each prop module will process pages in the same order. For example, many prop modules order by page_id but may be ASC or DESC on their "dir" parameter.

Totally agree - I forgot about the sub-ordering. So we either keep the same gapcontinue until the set is exhausted. The key here is that if we do not let the client manipulate the continue parameters, the server could later be optimized to return less results if they cannot yet be populated.

...

IMO, if a client wants to ensure it has complete results for any page objects in the result, it should just process all of the prop continuation parameters to completion.

The result set might be huge. It wouldn't be nice to have a 12GB x64 only client lib requirement :)

Brad Jorsch

10:36 a.m.

On Tue, Dec 18, 2012 at 10:03 AM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

I do not think API should support the case you described with gaplimit=1, because that fundamentally breaks the original API goal of "get data about many pages with lots of elements on them in one request".

Oh? I thought the goal of the API was to provide a machine-usable interface to MediaWiki so people don't have to screen-scrape the HTML pages, which alleviates the worry about whether changes to the user interface are going to break screen-scrapers. I never knew it was all about *bulk* data access *only*.

...

But even if we do find compelling reasons to include that, for the advanced scenario "skip subquery and follow on with the generator" it might make sense to introduce appendable "|next" value keyword gapcontinue=A|next

How do things decide whether "foocontinue=A|next" is saying "the next foocontinue after A" or really means "A|next"? For example, https://en.wiktionary.org/w/api.php?action=query&titles=secundus&pro... currently returns plcontinue "46486|0|next".

Or are you proposing every module be individually coded to recognize this "|next"?

...

Ideally all "continue" values should be joined into a single "query-continue = magic-value" of no interesting user-passable properties.

So clients can make absolutely no decisions about processing the data they get back? No thanks.

Why not propose adding something like that as an option, instead of trying to force everyone to do things your way? Say have a parameter dumbcontinue=1 that replaces query-continue with

<query-dumb-continue>prop=links|categories&plcontinue=...&clcontinue=...&wlstart=...&allmessages=...</query-dumb-continue>

Entirely compatible.

...

...
IMO, if a client wants to ensure it has complete results for any page objects in the result, it should just process all of the prop continuation parameters to completion.

The result set might be huge. It wouldn't be nice to have a 12GB x64 only client lib requirement :)

Then use a smaller limit on your generator. And don't do this for prop=revisions&rvprop=content.

Yuri Astrakhan

11:39 a.m.

...

...
I do not think API should support the case you described with gaplimit=1, because that fundamentally breaks the original API goal of "get data

about

...
many pages with lots of elements on them in one request".

Oh? I thought the goal of the API was to provide a machine-usable interface to MediaWiki so people don't have to screen-scrape the HTML pages, which alleviates the worry about whether changes to the user interface are going to break screen-scrapers. I never knew it was all about *bulk* data access *only*.

Brad, API has a clear goal (at least that was my goal when I wrote it), to provide access to all the functionality a wiki UI offers. The point here is how *easy* and at the same time *efficient* API is. The continue in the past few years has gotten too complex for client libraries to use efficiently to combine multiple requests. The example you gave seems very uncommon, and it can easily be solved with making one extra API call to get a list of articles first - which in this case of O(N) would only make it O(N+1) -- still O(N). That's why I don't think we should even go into the "|next" ability - it is very rare it will be used and can be done easily by another call without generator. See below on iteration point.

...

...
But even if we do find compelling reasons to include that, for the

advanced

...
scenario "skip subquery and follow on with the generator" it might make sense to introduce appendable "|next" value keyword gapcontinue=A|next

How do things decide whether "foocontinue=A|next" is saying "the next foocontinue after A" or really means "A|next"? For example,

https://en.wiktionary.org/w/api.php?action=query&titles=secundus&pro... currently returns plcontinue "46486|0|next".

Or are you proposing every module be individually coded to recognize this "|next"?

Again, unless there are good usage scenarios to keep this, I don't think we ever need this "|next" feature - it was a "just in case" idea, which I doubt we will need.

...

...
Ideally all "continue" values should be joined into a single "query-continue = magic-value" of no interesting user-passable properties.

So clients can make absolutely no decisions about processing the data they get back? No thanks.

When you make a SQL query to the server, you don't get to control the "continue" process. You can stop and make another query with different initial parameters. Same goes for iterating through a collection - none of the programming languages offering IEnumerable have stream control functionality - too complicated without clear benefits. API can be seen as a stream returning server - with some "continue" parameter. You don't like result - you do another query. That's how you control it. Documenting the "continue" properties is a sure way to over-complicate API usage and remove server's ability to optimize the process in the future, without adding any significant benefit.

Why not propose adding something like that as an option, instead of

...

trying to force everyone to do things your way? Say have a parameter dumbcontinue=1 that replaces query-continue with

<query-dumb-continue>prop=links|categories&plcontinue=...&clcontinue=...&wlstart=...&allmessages=...</query-dumb-continue>

Entirely compatible.

This might be a good solution. Need community feedback on this.

...

...
...
IMO, if a client wants to ensure it has complete results for any page objects in the result, it should just process all of the prop continuation parameters to completion.

The result set might be huge. It wouldn't be nice to have a 12GB x64 only client lib requirement :)

Then use a smaller limit on your generator. And don't do this for prop=revisions&rvprop=content.

My bad, didn't thee the "prop" continuation, thought you meant all of them. Lastly, lets try keeping sarcasm to the minimal with a technical discussion. We have Wikipedia talk pages for that.

Brad Jorsch

12:31 p.m.

On Tue, Dec 18, 2012 at 11:39 AM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

When you make a SQL query to the server, you don't get to control the "continue" process. You can stop and make another query with different initial parameters. Same goes for iterating through a collection - none of the programming languages offering IEnumerable have stream control functionality - too complicated without clear benefits.

The difference in all those examples is that you're iterating over one list of results. You're not iterating over a list of results and at the same time over multiple sublists of results inside each of the results in the main list.

...

Documenting the "continue" properties is a sure way to over-complicate API usage and remove server's ability to optimize the process in the future, without adding any significant benefit.

No one is documenting the values of the continue properties, just how the properties are supposed to be used to manipulate the original query.

It seems to me that you're removing the ability for the client to optimize the queries issued (besides forgoing the use of generators entirely and having to make 10× as many queries using titles= or pageids=) for no proposed benefit.

Yuri Astrakhan

1:13 p.m.

...

It seems to me that you're removing the ability for the client to optimize the queries issued (besides forgoing the use of generators entirely and having to make 10× as many queries using titles= or pageids=) for no proposed benefit.

not 10x queries --- one additional query per 5000+ requests, for an extremely edge case scenario you have given.

your example - run allpages generator with the gaplimit=1 -- and for each page get a list of revisions. That means - you do at least one API request per page. With the change -- you will need just one extra query per 5000+ requests to get the list first. A tiny load increase, for a very rare case. I tried to come up with more use cases, but nothing came to mind. Feel free to suggest other use cases.

On the other hand, the proposed benefit is huge for the vast majority of the API users. One simple "no-brainer" way to continue a query once it's issued, without any complex code by any api client frameworks. Right now client framework must understand what is being queried, what params should be set and removed to exhaust all properties, what to add later. And *every* framework must handle this, without any major benefit, but with additional chance of doing it either in inefficient or possibly buggy way. My previous email listed all the complex steps needed frameworks have to do.

Besides, if we introduce versions (which we definetly should, as it gives us a way to move forward, optimize and rework the api structure), we can always keep the old way for the compatibility sake. I think versions is a better overall way to move forward and to give warnings about incompatible changes than adding extra URL parameters.

Petr Onderka

1:52 p.m.

...

not 10x queries --- one additional query per 5000+ requests, for an extremely edge case scenario you have given.

I believe what Brad is talking about is that when you use pageids (or titles), you are usually limited to 50 of them per query. But if you use generator, the limit is usually 500. Which means your approach would lead to 10× as many queries.

Petr Onderka [[en:User:Svick]]

Yuri Astrakhan

2:14 p.m.

Petr, in Brad's example he used gaplimit=1, which meant he would get one page per result with many revisions.

This is no different than writing titles= or pageids= with just one value.

So if instead of using generator, the client would make just one extra api request to get a list of 5000 pages, it will continue as before. Total extra cost -- +1 more request per 5000 for an rare edge case, while getting a major benefit for all other usage cases.

On Tue, Dec 18, 2012 at 1:52 PM, Petr Onderka gsvick@gmail.com wrote:

...

...
not 10x queries --- one additional query per 5000+ requests, for an extremely edge case scenario you have given.

I believe what Brad is talking about is that when you use pageids (or titles), you are usually limited to 50 of them per query. But if you use generator, the limit is usually 500. Which means your approach would lead to 10× as many queries.

Petr Onderka [[en:User:Svick]]

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Brad Jorsch

19 Dec 19 Dec

9:36 a.m.

On Tue, Dec 18, 2012 at 1:13 PM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

On the other hand, the proposed benefit is huge for the vast majority of the API users.

The vast majority of API users use a framework like pywikipedia that has already solved the continuation problem.

...

your example - run allpages generator with the gaplimit=1

Yes, that was a contrived example to show that someone might not want to be forced into following the prop continues to the end..

...

Besides, if we introduce versions (which we definetly should, as it gives us a way to move forward, optimize and rework the api structure), we can always keep the old way for the compatibility sake. I think versions is a better overall way to move forward and to give warnings about incompatible changes than adding extra URL parameters.

The problem with versions is this: what if someone wants "version 1" of query-continue (because "version 2" removed all features), but the latest version of the rest of the API?

Yuri Astrakhan

10:52 a.m.

...

The vast majority of API users use a framework like pywikipedia that has already solved the continuation problem.

Not exactly correct - pywikipediabot have not solved it at all, instead they structured their library to avoid the whole problem and to get individual properties data separately with generators, which causes a much heavier server load. You can't ask pywiki to get page properties (links, categories, etc)- in fact it doesn't even perform prop=links|categories|.... Instead it uses a much more expensive generator=pagelinks for each page you query. Those bots that want this functionality have to go low level direct api call, and handle this issue by hand.

There are 30 frameworks listed on the docs site. If even the top one ignores this fundamental issue, how many do you think implement it correctly? I just spent considerable time trying to implement a generic query-agnostic continue, and was forced to do it in a very hacky way (like detecting /g..continue/ parameter, cutting it out, removing some prop=.. and ignoring warnings server sends due to me sending unneeded parameters. Not a good generic solution)

...

your example - run allpages generator with the gaplimit=1

Yes, that was a contrived example to show that someone might not want to be forced into following the prop continues to the end..

The problem with versions is this: what if someone wants "version 1" of query-continue (because "version 2" removed all features), but the latest version of the rest of the API?

But wouldn't we want people NOT to use API in a inefficient way, to

prevent extra server load? But anyway, I agree, lets not remove abilities - lets introduce a version parameter, and do a simple approach by default. Those who want use the old legacy, will add &legacycontirue="" parameter.

IMO, we need the version support regardless - it will allow us to restructure parameters and resulting data based on the client's version=xx request. Plus we can finally require 'agent' from all the clients (current javascript clients have no way to pass in the agent string). There was a discussion with Roan a few years ago about it, and versioning is needed in order to do most of these things. http://www.mediawiki.org/wiki/API/REST_proposal/Kickoff_meeting_notes

Brad Jorsch

20 Dec 20 Dec

9:52 a.m.

On Wed, Dec 19, 2012 at 10:52 AM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

Not exactly correct - pywikipediabot have not solved it at all, instead they structured their library to avoid the whole problem and to get individual properties data separately with generators,

Hmm.

...

There are 30 frameworks listed on the docs site. If even the top one ignores this fundamental issue, how many do you think implement it correctly?

Mine does, although it's probably not listed there. I haven't tried anyone else's.

...

I just spent considerable time trying to implement a generic query-agnostic continue, and was forced to do it in a very hacky way (like detecting /g..continue/ parameter, cutting it out, removing some prop=.. and ignoring warnings server sends due to me sending unneeded parameters. Not a good generic solution)

A way to reduce the hackiness is to use action=paraminfo to look up the prefixes for all the prop modules (this takes two queries, and could be cached). Then it's a simple matter to look up the prefix for each node under query-continue and see whether the attribute is $prefix or 'g'+$prefix. The warnings can also be addressed by using the results from action=paraminfo to filter out the params for modules as they're removed from prop=.

Although it would be nice to have a more straightforward method. Without throwing out the baby with the bathwater.

...

But anyway, I agree, lets not remove abilities - lets introduce a version parameter, and do a simple approach by default. Those who want use the old legacy, will add &legacycontirue="" parameter.

But "legacycontinue" isn't a version parameter, it's a feature selection parameter.

Petr Onderka

18 Dec 18 Dec

2:25 p.m.

On Tue, Dec 18, 2012 at 5:39 PM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

Same goes for iterating through a collection - none of the programming languages offering IEnumerable have stream control functionality - too complicated without clear benefits.

Actually in my C# library [1] (I plan to publicize it more later) a query like generator=allpages&prop=links might result in something like IEnumerable<IEnumerable<Link>> [2]. And iterating the outer IEnumerable corresponds to iterating gapcontinue, while iterating the inner IEnumerable corresponds to plcontinue (of course it's not that simple, since I'm not using limit=1, but I hope you get the idea).

And while this means some more work for the library writer (in this case, me) than your alternative, it also means the user has more control over what exactly is retrieved.

Petr Onderka [[en:User:Svick]]

[1] https://github.com/svick/LINQ-to-Wiki/ [2] Or, more realistically, IEnumerable<Tuple<Page, IEnumerable<Link>>>, but I didn't want to complicate it with even more generics.

Yuri Astrakhan

3 p.m.

Petr, thanks, I will look closely at your library and post my thoughts.

Could you take look at http://www.mediawiki.org/wiki/Manual:Pywikipediabot/Recipes and see how your library would solve these? Also, if you can think of other common use cases from your library users (not your library internals, as it is just an intermediary), please post them too. I posted cases I saw in interwiki & casechecker bots.

Thanks!

On Tue, Dec 18, 2012 at 2:25 PM, Petr Onderka gsvick@gmail.com wrote:

...

On Tue, Dec 18, 2012 at 5:39 PM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...
Same goes for iterating through a collection - none of the programming languages offering IEnumerable have stream control functionality - too complicated without clear benefits.

Actually in my C# library [1] (I plan to publicize it more later) a query like generator=allpages&prop=links might result in something like IEnumerable<IEnumerable<Link>> [2]. And iterating the outer IEnumerable corresponds to iterating gapcontinue, while iterating the inner IEnumerable corresponds to plcontinue (of course it's not that simple, since I'm not using limit=1, but I hope you get the idea).

And while this means some more work for the library writer (in this case, me) than your alternative, it also means the user has more control over what exactly is retrieved.

Petr Onderka [[en:User:Svick]]

[1] https://github.com/svick/LINQ-to-Wiki/ [2] Or, more realistically, IEnumerable<Tuple<Page, IEnumerable<Link>>>, but I didn't want to complicate it with even more generics.

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Petr Onderka

3:59 p.m.

Well, I can't tell you any use cases from my library users, because there aren't any (like I said, I didn't actually publicize it yet).

And my library would solve most of those cases the way I explained before: IEnumerable inside IEnumerable (the exact shape depends on the user).

In the case of more than one prop being used, it always continues all props, even if the user iterates only one of them.

Petr Onderka [[en:User:Svick]]

On Tue, Dec 18, 2012 at 9:00 PM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

Petr, thanks, I will look closely at your library and post my thoughts.

Could you take look at http://www.mediawiki.org/wiki/Manual:Pywikipediabot/Recipes and see how your library would solve these? Also, if you can think of other common use cases from your library users (not your library internals, as it is just an intermediary), please post them too. I posted cases I saw in interwiki & casechecker bots.

Thanks!

On Tue, Dec 18, 2012 at 2:25 PM, Petr Onderka gsvick@gmail.com wrote:

...
On Tue, Dec 18, 2012 at 5:39 PM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...
Same goes for iterating through a collection - none of the programming languages offering IEnumerable have stream control functionality - too complicated without clear benefits.

Actually in my C# library [1] (I plan to publicize it more later) a query like generator=allpages&prop=links might result in something like IEnumerable<IEnumerable<Link>> [2]. And iterating the outer IEnumerable corresponds to iterating gapcontinue, while iterating the inner IEnumerable corresponds to plcontinue (of course it's not that simple, since I'm not using limit=1, but I hope you get the idea).

And while this means some more work for the library writer (in this case, me) than your alternative, it also means the user has more control over what exactly is retrieved.

Petr Onderka [[en:User:Svick]]

[1] https://github.com/svick/LINQ-to-Wiki/ [2] Or, more realistically, IEnumerable<Tuple<Page, IEnumerable<Link>>>, but I didn't want to complicate it with even more generics.

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Yuri Astrakhan

5:10 p.m.

Petr, I played with your library a bit. Its has some interesting and creative pieces and uses some cool tech (love Roslyn). Might need a bit of love and polishing, as I think the syntax is too verbose, but that's irrelevant here.

This is your code to list link titles from all non-redirect pages in a wiki.

var source = wiki.Query.allpages() .Where(p => p.filterredir == allpagesfilterredir.nonredirects) .Pages .Select(p => PageResult.Create(p.info, p.links().Select(l => l.title).ToEnumerable()));;

foreach ( var page in source.Take(2000)) // just the first 10 pages foreach( var linkTitle in page.Data.Take(1)) // first 1 link from each page Console.WriteLine(linkTitle);

The "page" foreach starts by getting http://en.wikipedia.org/w/api.php: action=query & meta=siteinfo & siprop=namespaces

The linkTitle foreach causes 18 more api calls to start getting the links, all with plcontinue, before it yeilds even a single link.

And the reason for it, as Brad correctly noted, is that links are sorted in a different order from titles. At this point, you are half way through the current block, you have made 19 fairly expensive api calls, and if (and that's a big if) you decide to continue with the next gapcontinue, based on the first link you get, you still need to do each "plcontinue" so that you don't miss any pages.

The only thing you can really do, with minimal calls is -- get a block of data, take a RANDOM page with links on it, check the first link, and decide to go on to the next block. I see absolutelly no sense in this use.

In short - there are no way you can say "next page" until you iterate through every plcontinue in the current set. EXCEPT! if you go one page at a time (gaplimit=1) - in which case you can safely skip to the next gapcontinue. But this is exactly what I am trying to avoid, because it does not give any benefit whatsoever in using the generator. I might even suspect that it costs much more - because running generator, even with limit=1 has a bigger cost than just querying one specific page info and filling it out.

On Tue, Dec 18, 2012 at 3:59 PM, Petr Onderka gsvick@gmail.com wrote:

...

Well, I can't tell you any use cases from my library users, because there aren't any (like I said, I didn't actually publicize it yet).

And my library would solve most of those cases the way I explained before: IEnumerable inside IEnumerable (the exact shape depends on the user).

In the case of more than one prop being used, it always continues all props, even if the user iterates only one of them.

Petr Onderka [[en:User:Svick]]

On Tue, Dec 18, 2012 at 9:00 PM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...
Petr, thanks, I will look closely at your library and post my thoughts.

Could you take look at http://www.mediawiki.org/wiki/Manual:Pywikipediabot/Recipes and see how

your

...
library would solve these? Also, if you can think of other common use

cases

...
from your library users (not your library internals, as it is just an intermediary), please post them too. I posted cases I saw in interwiki & casechecker bots.

Thanks!

On Tue, Dec 18, 2012 at 2:25 PM, Petr Onderka gsvick@gmail.com wrote:

...
On Tue, Dec 18, 2012 at 5:39 PM, Yuri Astrakhan <

yuriastrakhan@gmail.com>

...
...
wrote:

...
Same goes for iterating through a collection - none of the programming languages offering IEnumerable have stream control functionality - too complicated without clear benefits.

Actually in my C# library [1] (I plan to publicize it more later) a query like generator=allpages&prop=links might result in something like IEnumerable<IEnumerable<Link>> [2]. And iterating the outer IEnumerable corresponds to iterating

gapcontinue,

...
...
while iterating the inner IEnumerable corresponds to plcontinue (of course it's not that simple, since I'm not using limit=1, but I hope you get the idea).

And while this means some more work for the library writer (in this

case,

...
...
me) than your alternative, it also means the user has more control over what exactly is retrieved.

Petr Onderka [[en:User:Svick]]

[1] https://github.com/svick/LINQ-to-Wiki/ [2] Or, more realistically, IEnumerable<Tuple<Page, IEnumerable<Link>>>, but I didn't want to complicate it with even more generics.

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Yuri Astrakhan

5:16 p.m.

Sorry, hit send too fast:

Petr, when you say you have two nested foreach(), the outer foreach does not iterate through the blocks, it iterates through pages. Which means you still must iterate through every plcontinue in the set before issuing next gapcontinue. In other words - your library does exactly that - a simple iteration. You don't skip blocks of results midway, and you lib would benefit from the change. (all this assumes I understood your code correctly)

On Tue, Dec 18, 2012 at 5:10 PM, Yuri Astrakhan yuriastrakhan@gmail.comwrote:

...

Petr, I played with your library a bit. Its has some interesting and creative pieces and uses some cool tech (love Roslyn). Might need a bit of love and polishing, as I think the syntax is too verbose, but that's irrelevant here.

This is your code to list link titles from all non-redirect pages in a wiki.

var source = wiki.Query.allpages() .Where(p => p.filterredir == allpagesfilterredir.nonredirects) .Pages .Select(p => PageResult.Create(p.info, p.links().Select(l => l.title).ToEnumerable()));;

foreach ( var page in source.Take(2000)) // just the first 10 pages foreach( var linkTitle in page.Data.Take(1)) // first 1 link from each page Console.WriteLine(linkTitle);

The "page" foreach starts by getting http://en.wikipedia.org/w/api.php: action=query & meta=siteinfo & siprop=namespaces

The linkTitle foreach causes 18 more api calls to start getting the links, all with plcontinue, before it yeilds even a single link.

And the reason for it, as Brad correctly noted, is that links are sorted in a different order from titles. At this point, you are half way through the current block, you have made 19 fairly expensive api calls, and if (and that's a big if) you decide to continue with the next gapcontinue, based on the first link you get, you still need to do each "plcontinue" so that you don't miss any pages.

The only thing you can really do, with minimal calls is -- get a block of data, take a RANDOM page with links on it, check the first link, and decide to go on to the next block. I see absolutelly no sense in this use.

In short - there are no way you can say "next page" until you iterate through every plcontinue in the current set. EXCEPT! if you go one page at a time (gaplimit=1) - in which case you can safely skip to the next gapcontinue. But this is exactly what I am trying to avoid, because it does not give any benefit whatsoever in using the generator. I might even suspect that it costs much more - because running generator, even with limit=1 has a bigger cost than just querying one specific page info and filling it out.

On Tue, Dec 18, 2012 at 3:59 PM, Petr Onderka gsvick@gmail.com wrote:

...
Well, I can't tell you any use cases from my library users, because there aren't any (like I said, I didn't actually publicize it yet).

And my library would solve most of those cases the way I explained before: IEnumerable inside IEnumerable (the exact shape depends on the user).

In the case of more than one prop being used, it always continues all props, even if the user iterates only one of them.

Petr Onderka [[en:User:Svick]]

On Tue, Dec 18, 2012 at 9:00 PM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...
Petr, thanks, I will look closely at your library and post my thoughts.

Could you take look at http://www.mediawiki.org/wiki/Manual:Pywikipediabot/Recipes and see

how your

...
library would solve these? Also, if you can think of other common use

cases

...
from your library users (not your library internals, as it is just an intermediary), please post them too. I posted cases I saw in interwiki & casechecker bots.

Thanks!

On Tue, Dec 18, 2012 at 2:25 PM, Petr Onderka gsvick@gmail.com wrote:

...
On Tue, Dec 18, 2012 at 5:39 PM, Yuri Astrakhan <

yuriastrakhan@gmail.com>

...
...
wrote:

...
Same goes for iterating through a collection - none of the

programming

...
...
...
languages offering IEnumerable have stream control functionality -

too

...
...
...
complicated without clear benefits.

Actually in my C# library [1] (I plan to publicize it more later) a query like generator=allpages&prop=links might result in something like IEnumerable<IEnumerable<Link>> [2]. And iterating the outer IEnumerable corresponds to iterating

gapcontinue,

...
...
while iterating the inner IEnumerable corresponds to plcontinue (of course it's not that simple, since I'm not using limit=1, but I hope you get the idea).

And while this means some more work for the library writer (in this

case,

...
...
me) than your alternative, it also means the user has more control over what exactly is retrieved.

Petr Onderka [[en:User:Svick]]

[1] https://github.com/svick/LINQ-to-Wiki/ [2] Or, more realistically, IEnumerable<Tuple<Page,

IEnumerable<Link>>>,

...
...
but I didn't want to complicate it with even more generics.

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

Petr Onderka

19 Dec 19 Dec

12:37 p.m.

On Tue, Dec 18, 2012 at 11:10 PM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

The linkTitle foreach causes 18 more api calls to start getting the links, all with plcontinue, before it yeilds even a single link.

Yeah, it's certainly possible there will be many plcontinue calls just to get the first link. But that doesn't mean you have to get all plcontinues when you want only some links.

On Tue, Dec 18, 2012 at 11:16 PM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

Petr, when you say you have two nested foreach(), the outer foreach does not iterate through the blocks, it iterates through pages. Which means you still must iterate through every plcontinue in the set before issuing next gapcontinue.

It doesn't mean that. For example, in the extreme case where you don't want to know any links from this page (say, because you want to filter the articles in a way that cannot be expressed directly by the API), you don't have to use plcontinue for this page at all.

A specific example might be changing your code into (yeah, built specifically to make my point):

foreach (var page in source.Where(p => p.Info.title.Contains("\u2014")).Take(2000))

In this case, the link for "—All You Zombies—" will be retrieved from the first call, so no plcontinue is needed. The link for "—And He Built a Crooked House—" will be retrieved using one plcontinue call. But there are no more articles with that character in title in the first page, so no more plcontinue calls are necessary, and gapcontinue can be used now. The second page contains no articles with that character at all, so without any plconitines, gapcontinue will be used right away.

With your “dumb query-continue”, doing this would require many more calls.

Petr Onderka [[en:User:Svick]]

Yuri Astrakhan

2:43 p.m.

Background:

max aplimit = max pllimit = 500 (5000 bots)

Server SQL : pageset = select * from pages where start='!' limit 5000 select * from links where id in pageset limit 5000

Since each wiki page has more than 1 link, you need to do about a 50-100 api calls to get all the links in a block. Btw, it also means that it is by far more efficient to set gaplimit = 50 -- 100 because otherwise the server populates and returns 5000 page headers each time, hugely wasting both SQL and network bandwidth.

Links are sorted by pageid, pages - by title. If you need links for the first page in a block, the chances are that you have to iterate through 50% of all other page links first.

Now lets look at your example:

* If you set your gaplimit=100 & pllimit=5000, you get all the links for 100 pages in one call, which is no different than simple-continue.

* If you set "max" to both, and you want 80% of the pages per block, you most likely will have to download 99% of the links - same as downloading everything -- simple-continue.

* If you want a small percentage of pages, like 1 per block, than on average you still have to download 50+% of links. Even in the best case scenario, if you are lucky, you need one additional links block to know that first page has no more links.

Proper way to do the last case it is to use allpages without links, go through them and make a list of 500 page ids you want. Afterwards, download all links with a different query -- pageids=list, not generator. Assuming 1 needed page per block, you just saved time and bandwidth of 250 blocks with links! A huge huge saving, without even counting how much less the SQL servers had to work. That's 250 queries you didn't have to make.

So you see, no matter how you look at this problem, you either 1) simple-stream the whole result set, or 2) do two separate queries - one to get the list of all titles and select needed, and another call to get their links for them. A much much faster, more efficient, green solution.

Lastly - if we enable simple query by default, the server can do much smarter logic - if gaplimit=max & pllimit=max, reduce gaplimit=pllimit/50. In other words, the server will return only pages it can fill with links, but not much more. This is open for discussion of course, and I haven't finalized how to do this properly.

I hope all this explains it enough. If you have other thoughts, please contact me privately, there is no need to involve the whole list in this.

Brad Jorsch

20 Dec 20 Dec

9:25 a.m.

On Wed, Dec 19, 2012 at 2:43 PM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

Lastly - if we enable simple query by default, the server can do much smarter logic - if gaplimit=max & pllimit=max, reduce gaplimit=pllimit/50.

Again, your trying to make the server "smarter" for one case makes it dumber in other cases. What about gaplimit=max&pllimit=max&pllinks=Foo? Or what if you're using generator=categorymembers&gcmtitle=Category:Pages_with_no_links (perhaps to check if any have links so they can be removed from the category) instead of generator=allpages?

Yuri Astrakhan

11:21 a.m.

...

Again, your trying to make the server "smarter" for one case makes it dumber in other cases. What about gaplimit=max&pllimit=max&pllinks=Foo? Or what if you're using generator=categorymembers&gcmtitle=Category:Pages_with_no_links (perhaps to check if any have links so they can be removed from the category) instead of generator=allpages?

First - I agreed we can keep both. With versions it won't break anything.

If the user wants to continue using this, they can specify legacycontinue parameter in addition to version. Its a win win - the simpler users will use the more direct approach without much thought, the more advanced users, in an unlikely case they find use for it - will use the legacycontinue. The important thing is that default is easy, not hard.

Second: Please look closely again at my description of different use cases and their server load. Your examples are absolutely no different -- It does not matter if the properties on the page have many links or few links - the case of pltitles=Foo or no-link categories - you still will want to look at either 80% of the page titles in one block - in which case you execute all the subquery continues, or you need just one or two page titles per block, in which case you would have saved server resources a lot by filtering them first and than asking the server with a pageid list. When I designed generator, I assumed that ALL needed properties will be returned at once, without any continuations. With the addition of the prop continue, this model breaks down and becomes very inefficient.

Mine does, although it's probably not listed there. I haven't tried

...

anyone else's.

Unfortunately most framework developers do not have your expertise. I wish

they did, but they don't.

...

But "legacycontinue" isn't a version parameter, it's a feature selection parameter.

Its an additional parameter required for legacy support in the new API

version. See above.

Brad Jorsch

11:37 a.m.

On Thu, Dec 20, 2012 at 11:21 AM, Yuri Astrakhan yuriastrakhan@gmail.com wrote:

...

Please look closely again at my description of different use cases and their server load. Your examples are absolutely no different -- It does not matter if the properties on the page have many links or few links - the case of pltitles=Foo or no-link categories - you still will want to look at either 80% of the page titles in one block - in which case you execute all the subquery continues, or you need just one or two page titles per block, in which case you would have saved server resources a lot by filtering them first and than asking the server with a pageid list.

Say your wiki is small, only 5000 pages, and you have apihighlimits. With gaplimit=max&pllimit=max&pllinks=Foo, right now no continuation will be required because each page will return just zero or one link. With your proposed change to force gaplimit=max to mean something lower than the normal max would require continuation to get all pages.

Yuri Astrakhan

11:54 a.m.

...

Say your wiki is small, only 5000 pages, and you have apihighlimits. With gaplimit=max&pllimit=max&pllinks=Foo, right now no continuation will be required because each page will return just zero or one link. With your proposed change to force gaplimit=max to mean something lower than the normal max would require continuation to get all pages.

You STILL have to do a plcontinue -- because without it you will not know

if the <page> element is empty because it has no links, or because the server didn't get to it yet!!! And with gaplimit == pllimit == 5000, having 1 or less links per page on average guarantees that you will get all 5000 pages without any plcontinue, which is the same as my proposal!

As for lowering gaplimit, it was just an idea, in case you get MANY links per page, and I suggested that the server may scale back if it sees that many <page> will not be populated. But this won't change anything - you still have to decide - you need many pages, or few pages!

Really, so far I have not heard of any reasonable bot or use case that would need this at all.

4389

Age (days ago)

4394

Last active (days ago)

mediawiki-api@lists.wikimedia.org

24 comments

4 participants

tags (0)

participants (4)

Brad Jorsch
Petr Onderka
Platonides
Yuri Astrakhan