On 06/04/2012 16:01, Alex Buie wrote:
Excellent, thanks guys. I'm assuming that I shouldn't have to worry about malformed xml (hopefully, haha), which makes it even easier/faster.
The dumps are well formed xml of course, the problem is that not always the tags are in the same order or the revision are on chronological order...and of course the revision text is a real mess!
I suggest you to have a look at our library and start by using it for building simple scripts. It's really easy! All you have to do is to write a method for every tag called process_tag (e.g.: process_title for title tag). Have a look at https://github.com/volpino/wiki-network/blob/master/revisions_page.py for an example, it's a simple script that takes a pages-meta-history dump and extracts the revisions of a specific page set to a csv file.
Feel free to write me for more information ;)