On 06/04/2012 16:01, Alex Buie wrote:
Excellent, thanks guys. I'm assuming that I
shouldn't have to worry
about malformed xml (hopefully, haha), which makes it even easier/faster.
The dumps are well formed xml of course, the problem is that not always
the tags are in the same order or the revision are on chronological
order...and of course the revision text is a real mess!
I suggest you to have a look at our library and start by using it for
building simple scripts. It's really easy! All you have to do is to
write a method for every tag called process_tag (e.g.: process_title for
title tag). Have a look at
https://github.com/volpino/wiki-network/blob/master/revisions_page.py
for an example, it's a simple script that takes a pages-meta-history
dump and extracts the revisions of a specific page set to a csv file.
Feel free to write me for more information ;)
--
f.
"Always code as if the guy who ends up maintaining your code will be a
violent psychopath who knows where you live."
(Martin Golding)
() ascii ribbon campaign - against html e-mail
/\
www.asciiribbon.org - against proprietary attachments
http://about.me/fox91