https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
Bug ID: 72209 Summary: testExturlusage takes forever on test.wikipedia Product: Pywikibot Version: core (2.0) Hardware: All OS: All Status: NEW Severity: normal Priority: Unprioritized Component: tests Assignee: Pywikipedia-bugs@lists.wikimedia.org Reporter: jayvdb@gmail.com Web browser: --- Mobile Platform: ---
testExturlusage uses:
for link in mysite.exturlusage('www.google.com', namespaces=[2, 3], total=5)
This returns quickly on test.wikidata , as there is very little data that matches
https://test.wikidata.org/w/index.php?title=Special%3ALinkSearch&target=...
All of the other travis build platforms also provide the five records requested in a reasonable period of time.
While test.wikipedia has a lot of data that matches:
https://test.wikipedia.org/w/index.php?title=Special%3ALinkSearch&target...
PageGenerator on test.wikipedia yields four results after a few API calls, however after the fourth result it has backed off to requesting data with a geulimit of 1, resulting in the following data request/results sequence:
{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'20'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']} {u'query-continue': {u'exturlusage': {u'geuoffset': 21}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}
{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'21'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']} {u'query-continue': {u'exturlusage': {u'geuoffset': 22}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}
{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'22'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']} {u'query-continue': {u'exturlusage': {u'geuoffset': 23}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}
{'inprop': [u'protection'], 'geuprotocol': [u'http'], 'iiprop': [u'timestamp', u'user', u'comment', u'url', u'size', u'sha1', u'metadata'], 'maxlag': ['5'], u'geuoffset': [u'23'], 'generator': [u'exturlusage'], 'format': ['json'], 'prop': [u'info', u'imageinfo', u'categoryinfo'], 'meta': ['userinfo'], 'indexpageids': [u''], u'geulimit': [u'1'], 'action': [u'query'], u'geunamespace': [u'2', u'3'], 'geuquery': [u'www.google.com'], 'uiprop': ['blockinfo', 'hasmsg']} {u'query-continue': {u'exturlusage': {u'geuoffset': 24}}, u'query': {u'userinfo': {u'messages': u'', u'id': 25377, u'name': u'JVbot'}}}
It then proceeds to iterate continuously seemingly forever. (I killed it after 10 mins)
https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
Mpaa mpaa.wiki@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |mpaa.wiki@gmail.com
--- Comment #1 from Mpaa mpaa.wiki@gmail.com --- There are more results far away, I do not know if you reached there.
Try with different values of geuoffset=10000 in the link below. Between 12000 and 12500 they finish.
https://test.wikipedia.org/w/api.php?inprop=protection&geuprotocol=http&...
{ "query-continue": { "exturlusage": { "geuoffset": 10500 } }, "warnings": { "exturlusage": { "*": "geulimit may not be over 500 (set to 5000) for users" } }, "query": { "pageids": [ "12828" ], "pages": { "12828": { "pageid": 12828, "ns": 2, "title": "User:\u05dc\u05e2\u05e8\u05d9 \u05e8\u05d9\u05d9\u05e0\u05d4\u05d0\u05e8\u05d8/monobook.js", "contentmodel": "javascript", "pagelanguage": "en", "touched": "2012-04-10T19:34:24Z", "lastrevid": 112424, "counter": "", "length": 4432, "protection": [] } }, "userinfo": { "id": 25083, "name": "Mpaa" }
Between 12000 and 12500 they finish.
{ "warnings": { "exturlusage": { "*": "geulimit may not be over 500 (set to 5000) for users" } }, "query": { "userinfo": { "id": 25083, "name": "Mpaa" } } }
https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
--- Comment #2 from Mpaa mpaa.wiki@gmail.com --- A possible strategy could be to increase the new_limit if the code is in this condition in api.py, line 1090:
else: # if query-continue is present, self.resultkey might not have been # fetched yet if "query-continue" not in self.data: # No results. return --> start to increase counter? the tricky part is to maintain the total number returned = count
It tries to fetch only the number of elements left to reach 5. When 1 is reached, it stays there for 12000 queries ..
********* 500 500 5 5 0 ********** [[test:User:Nip]] [[test:User:TeleComNasSprVen]] ********* 500 500 5 3 2 ********** [[test:User:MaxSem/wap]] ********* 500 500 5 2 3 ********** ********* 500 500 5 2 3 ********** ********* 500 500 5 2 3 ********** ********* 500 500 5 2 3 ********** ********* 500 500 5 2 3 ********** ********* 500 500 5 2 3 ********** [[test:User:HersfoldCiteBot/Citation errors needing manual review]] ********* 500 500 5 1 4 ********** ********* 500 500 5 1 4 ********** ********* 500 500 5 1 4 ********** ......
https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
--- Comment #3 from John Mark Vandenberg jayvdb@gmail.com --- (In reply to Mpaa from comment #2)
It tries to fetch only the number of elements left to reach 5. When 1 is reached, it stays there for 12000 queries ..
But MW doesnt return one row, as requested..?
Yes, pywikibot will need to detect that it is 'getting nowhere slowly', and exponentially increase the new_limit until it finds data or end of dataset.
https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
--- Comment #4 from Mpaa mpaa.wiki@gmail.com --- (In reply to John Mark Vandenberg from comment #3)
(In reply to Mpaa from comment #2)
It tries to fetch only the number of elements left to reach 5. When 1 is reached, it stays there for 12000 queries ..
But MW doesnt return one row, as requested..?
I meant that it will keep sending request with geulimit=1. So to get to to 12000, it will send 12000 request. It returns one row at the time, containing just query-continue data: {u'exturlusage': {u'geuoffset': 24}} {u'exturlusage': {u'geuoffset': 25}} ...
Yes, pywikibot will need to detect that it is 'getting nowhere slowly', and exponentially increase the new_limit until it finds data or end of dataset.
https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
--- Comment #5 from John Mark Vandenberg jayvdb@gmail.com --- geulimit=1 says the client wants 1 only record.
The MW API isnt returning one record. It is moving the cursor forward by one and returning zero records.
It feels like MW is interpreting 'geulimit=1' as 'only look at one record, and return the data if it meets the request criteria'
https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
--- Comment #6 from John Mark Vandenberg jayvdb@gmail.com --- The API documentation explains it
eunamespace - The page namespace(s) to enumerate. NOTE: Due to $wgMiserMode, using this may result in fewer than "eulimit" results returned before continuing; in extreme cases, zero results may be returned Values (separate with '|'): 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, .. Maximum number of values 50 (500 for bots)
https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
--- Comment #7 from Gerrit Notification Bot gerritadmin@wikimedia.org --- Change 167438 had a related patch set uploaded by Mpaa: api.py: increase api limits when data are sparse
https://gerrit.wikimedia.org/r/167438
https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
Gerrit Notification Bot gerritadmin@wikimedia.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |PATCH_TO_REVIEW
https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
--- Comment #8 from Mpaa mpaa.wiki@gmail.com --- Yes, that is what I meant.
https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
--- Comment #9 from Gerrit Notification Bot gerritadmin@wikimedia.org --- Change 167438 merged by jenkins-bot: Increase limits in QueryGenerator when data are sparse
https://gerrit.wikimedia.org/r/167438
https://bugzilla.wikimedia.org/show_bug.cgi?id=72209
John Mark Vandenberg jayvdb@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|PATCH_TO_REVIEW |RESOLVED Resolution|--- |FIXED
pywikipedia-bugs@lists.wikimedia.org