Bugs item #3080364, was opened at 2010-10-03 15:51 Message generated for change (Tracker Item Submitted) made by emijrp You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=3080364...
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: General Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: emijrp (emijrp) Assigned to: Nobody/Anonymous (nobody) Summary: xmlreader.py fails a lot
Initial Comment: Hi all;
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes in console when it fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267'] #look the empty string for the title, and the None for pageid
But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20
We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails.
Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case.
Regards, emijrp
[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
----------------------------------------------------------------------
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=3080364...
pywikipedia-bugs@lists.wikimedia.org