Pywikipedians,
What is the best way to get the wikitext for every revision of a page?
I've been trying to understand why pywikipedia.fullVersionHistory does not keep going. It seems to do one or two or maybe three fetches of revCount=500 and then it stops --- even if there are many more revisions.
Is there a fix for this?
For example, for Barack_Obama, I get consistently 1393 revisions, and the most recent in that list is from 2006 !
Here's how I am calling it:
h = p.fullVersionHistory(getAll=True, reverseOrder=True, revCount=500)
where 'p' is a Page instance.
Advice?
Thanks!
John
-- ___________________________ John R. Frank jrf@mit.edu
Hmm, I had a similar trouble maybe 2 years ago, I will try to remember it. Somehow it was perhaps related to flagged and unflagged versions.
Please make sure you use the newest Pywiki version, send a piece of code and tell us which wiki do you speak about (I think Obama has to have an article approximately in all Wikipedias).
I use this occasion to tell everybody that Mr. Obama has a funny name because "Barack" in Hungarian means peach or apricot. :-)
Hmm, I had a similar trouble maybe 2 years ago, I will try to remember it. Somehow it was perhaps related to flagged and unflagged versions.
Ah, that's interesting. Is there a different API call for flagged pages?
Or is there a more explicit way to iterate over the history? Maybe a lower level function that I should be using?
Please make sure you use the newest Pywiki version, send a piece of code and tell us which wiki do you speak about (I think Obama has to have an article approximately in all Wikipedias).
Using the latest nightly and the attached script, you can see that it only iterates up to 2006 for Mr. Obama...
import wikipedia
urlname = 'Barack_Obama'
site = wikipedia.Site('en') page = wikipedia.Page(site, urlname) history = page.fullVersionHistory( getAll=True, reverseOrder=True, revCount=500)
print "Got history for %s of length = %d", (urlname, len(history))
n = 0 for (revisionId, editTimestamp, username, content) in history: print n, len(content), editTimestamp n += 1
print 'finished!'
output:
<snip> 1391 39143 2006-09-01T17:26:38Z 1392 39112 2006-09-01T17:28:49Z finished!
any advice?
Thanks!
jrf
I made some experiments, and the problem definitely exists.
The results (length of the result, each time much smaller than expected): with reverseOrder=True revCount=500 -> 1393 revCount=5000 -> 500 revCount=50 -> infinite loop?
with reverseOrder=False revCount=500 -> 61 revCount=5000 -> 61 revCount=50 -> infinite loop?
I am not sure if the problem is in Pywiki or API, but without API I got another wrong result (997). Maybe something is hidden in http://en.wikipedia.org/w/index.php?title=Special:Log&page=Barack+Obama? Pending changes? I'll try to continue tomorrow, but I am afraid that I will miss admin rights. When I tried to get API results directly in browser, it crashed.
In huwiki it seems to work.
2012/2/10 Bináris wikiposta@gmail.com
When I tried to get API results directly in browser, it crashed.
One more word: there was a warning section in result that it has been truncated. I don't really believe this is connected to the problem, but also can't exclude it at the moment.
One more word: there was a warning section in result that it has been truncated. I don't really believe this is connected to the problem, but also can't exclude it at the moment.
How do you see that?
Is it related to this? https://bugzilla.wikimedia.org/show_bug.cgi?id=26339
If so, is there a way to iterate (gently) past $wgAPIMaxResultSize
jrf
2012/2/10 John R. Frank jrf@mit.edu
How do you see that?
In the web browser. You may run API queries throuh your browser in XML rendered as HTML. If you visit Manual:API on mediawiki.org, you will find examples. That's where my Firefox crashed. :-) You may also write own scripts that analyze API results using query.py.
Is it related to this? https://bugzilla.wikimedia.**org/show_bug.cgi?id=26339https://bugzilla.wikimedia.org/show_bug.cgi?id=26339
I don't think so, because it seems to be resolved by stating explicitely
it is truncated. But not impossible.
Is it related to this? https://bugzilla.wikimedia.org/show_bug.cgi?id=26339
I don't think so, because it seems to be resolved by stating explicitely it is truncated. But not impossible.
Right right. I don't mean that it is an example of the bug, which is rsolved, I mean this is an example of the patch introduced to fix that bug. The resultsize is apparently larger than this threshold $wgAPIMaxResultSize
Fortunately that does not say something like $wgAPIMaxResultsYouCanEverSee, the question is how to page forward using a continuation:
http://www.mediawiki.org/wiki/API:Query#Continuing_queries
I'm reading query.py
jrf
One more word: there was a warning section in result that it has been truncated. I don't really believe this is connected to the problem, but also can't exclude it at the moment.
If so, is there a way to iterate (gently) past $wgAPIMaxResultSize
Okay. Found the answer:
Set revCount=50 and it takes 25 minutes before returning any results.
When it does return, it returns them all at once, which means it held them somewhere (memory presumably) while accumulating them. This should probably be refactored into a generator that does not fetch more from the API until it has yielded all (or most) of what it has already.
Is this the result of throttling perhaps? Seems likely. Is there a way to disable throttling for a single call to fullVersionHistory and instead enforce it between calls to fullVersionHistory for different pages?
21753 200416 2012-02-07T17:25:58Z 21754 200435 2012-02-07T17:45:33Z 21755 200344 2012-02-07T18:10:33Z 21756 200344 2012-02-08T22:42:47Z finished!
real 25m48.924s user 1m19.839s sys 1m4.913s
You have well run into the trap with the Obama article. It is mentioned here as an extreme example: https://www.mediawiki.org/wiki/Scripting :-)
2012/2/10 John R. Frank jrf@mit.edu
Set revCount=50 and it takes 25 minutes before returning any results.
That's what I thought to be an in infinite loop. :-) I was just impatient.
pywikipedia-l@lists.wikimedia.org