[Pywikipedia-l] XMLreader.py

1 Oct 2010


      Hi all;
I think that there is an error in xmlreader.py. When parsing a full revision
XML (in this case[1]), using this code[2] (look at the try-catch, it writes
when fails) I get correctly username, timestamp and revisionid, but
sometimes, the page title and the page id are None or empty string.
The first error is:
['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']
But if we do:
7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep
-i '2004-10-10T04::14Z' -C20
We get this[3], which is OK, the page title and the page id are available in
the XML, but not correctly parsed. And this is not the only page title and
page it that fails.
Perhaps I have missed something, because I'm learning to parsing XML. Sorry
in that case.
Regards,
emijrp
[1]
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...
[2] http://pastebin.ca/1951930
[3] http://pastebin.ca/1951937

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

[Pywikipedia-l] XMLreader.py