fullVersionHistory question

List overview All Threads
Download

newer

older

Process of i18n

Fwd: [Toolserver-l] Save the date:...

John R. Frank

9 Feb 2012 9 Feb '12

7:15 p.m.

Pywikipedians,

What is the best way to get the wikitext for every revision of a page?

I've been trying to understand why pywikipedia.fullVersionHistory does not keep going. It seems to do one or two or maybe three fetches of revCount=500 and then it stops --- even if there are many more revisions.

Is there a fix for this?

For example, for Barack_Obama, I get consistently 1393 revisions, and the most recent in that list is from 2006 !

Here's how I am calling it:

h = p.fullVersionHistory(getAll=True, reverseOrder=True, revCount=500)

where 'p' is a Page instance.

Advice?

Thanks!

John

-- ___________________________ John R. Frank jrf@mit.edu

Show replies by date

Bináris

9 Feb 9 Feb

8:34 p.m.

Hmm, I had a similar trouble maybe 2 years ago, I will try to remember it. Somehow it was perhaps related to flagged and unflagged versions.

Please make sure you use the newest Pywiki version, send a piece of code and tell us which wiki do you speak about (I think Obama has to have an article approximately in all Wikipedias).

I use this occasion to tell everybody that Mr. Obama has a funny name because "Barack" in Hungarian means peach or apricot. :-)

-- Bináris

John R. Frank

9:26 p.m.

...

Hmm, I had a similar trouble maybe 2 years ago, I will try to remember it. Somehow it was perhaps related to flagged and unflagged versions.

Ah, that's interesting. Is there a different API call for flagged pages?

Or is there a more explicit way to iterate over the history? Maybe a lower level function that I should be using?

...

Please make sure you use the newest Pywiki version, send a piece of code and tell us which wiki do you speak about (I think Obama has to have an article approximately in all Wikipedias).

Using the latest nightly and the attached script, you can see that it only iterates up to 2006 for Mr. Obama...

import wikipedia

urlname = 'Barack_Obama'

site = wikipedia.Site('en') page = wikipedia.Page(site, urlname) history = page.fullVersionHistory( getAll=True, reverseOrder=True, revCount=500)

print "Got history for %s of length = %d", (urlname, len(history))

n = 0 for (revisionId, editTimestamp, username, content) in history: print n, len(content), editTimestamp n += 1

print 'finished!'

output:

<snip> 1391 39143 2006-09-01T17:26:38Z 1392 39112 2006-09-01T17:28:49Z finished!

any advice?

Thanks!

jrf

Bináris

11 p.m.

I made some experiments, and the problem definitely exists.

The results (length of the result, each time much smaller than expected): with reverseOrder=True revCount=500 -> 1393 revCount=5000 -> 500 revCount=50 -> infinite loop?

with reverseOrder=False revCount=500 -> 61 revCount=5000 -> 61 revCount=50 -> infinite loop?

I am not sure if the problem is in Pywiki or API, but without API I got another wrong result (997). Maybe something is hidden in http://en.wikipedia.org/w/index.php?title=Special:Log&page=Barack+Obama? Pending changes? I'll try to continue tomorrow, but I am afraid that I will miss admin rights. When I tried to get API results directly in browser, it crashed.

In huwiki it seems to work.

-- Bináris

Bináris

11:09 p.m.

2012/2/10 Bináris wikiposta@gmail.com

...

When I tried to get API results directly in browser, it crashed.

One more word: there was a warning section in result that it has been truncated. I don't really believe this is connected to the problem, but also can't exclude it at the moment.

-- Bináris

John R. Frank

10 Feb 10 Feb

7:15 a.m.

...

One more word: there was a warning section in result that it has been truncated. I don't really believe this is connected to the problem, but also can't exclude it at the moment.

How do you see that?

Is it related to this? https://bugzilla.wikimedia.org/show_bug.cgi?id=26339

If so, is there a way to iterate (gently) past $wgAPIMaxResultSize

jrf

Bináris

7:37 a.m.

2012/2/10 John R. Frank jrf@mit.edu

...

How do you see that?

In the web browser. You may run API queries throuh your browser in XML rendered as HTML. If you visit Manual:API on mediawiki.org, you will find examples. That's where my Firefox crashed. :-) You may also write own scripts that analyze API results using query.py.

...

Is it related to this? https://bugzilla.wikimedia.**org/show_bug.cgi?id=26339 https://bugzilla.wikimedia.org/show_bug.cgi?id=26339

I don't think so, because it seems to be resolved by stating explicitely

it is truncated. But not impossible.

-- Bináris

John R. Frank

7:49 a.m.

...

  Is it related to this?
  https://bugzilla.wikimedia.org/show_bug.cgi?id=26339
I don't think so, because it seems to be resolved by stating explicitely it is truncated. But not impossible.

Right right. I don't mean that it is an example of the bug, which is rsolved, I mean this is an example of the patch introduced to fix that bug. The resultsize is apparently larger than this threshold $wgAPIMaxResultSize

Fortunately that does not say something like $wgAPIMaxResultsYouCanEverSee, the question is how to page forward using a continuation:

http://www.mediawiki.org/wiki/API:Query#Continuing_queries

I'm reading query.py

jrf

John R. Frank

9:03 a.m.

...

...
One more word: there was a warning section in result that it has been truncated. I don't really believe this is connected to the problem, but also can't exclude it at the moment.

If so, is there a way to iterate (gently) past $wgAPIMaxResultSize

Okay. Found the answer:

Set revCount=50 and it takes 25 minutes before returning any results.

When it does return, it returns them all at once, which means it held them somewhere (memory presumably) while accumulating them. This should probably be refactored into a generator that does not fetch more from the API until it has yielded all (or most) of what it has already.

Is this the result of throttling perhaps? Seems likely. Is there a way to disable throttling for a single call to fullVersionHistory and instead enforce it between calls to fullVersionHistory for different pages?

21753 200416 2012-02-07T17:25:58Z 21754 200435 2012-02-07T17:45:33Z 21755 200344 2012-02-07T18:10:33Z 21756 200344 2012-02-08T22:42:47Z finished!

real 25m48.924s user 1m19.839s sys 1m4.913s

Bináris

7:12 p.m.

You have well run into the trap with the Obama article. It is mentioned here as an extreme example: https://www.mediawiki.org/wiki/Scripting :-)

-- Bináris

Bináris

11 Feb 11 Feb

10:07 p.m.

2012/2/10 John R. Frank jrf@mit.edu

...

Set revCount=50 and it takes 25 minutes before returning any results.

That's what I thought to be an in infinite loop. :-) I was just impatient.

-- Bináris

4680

Age (days ago)

4682

Last active (days ago)

pywikipedia-l@lists.wikimedia.org

10 comments

2 participants

tags (0)

participants (2)

Bináris
John R. Frank