As far as I remember xmlreader can use alternative mechanisms of XML parsing: cElementTree, ElementTree, regexp. The version of the cElementTree depends on the Python version. My bet it is a regexp method fault. Or maybe cElementTree fault [IMHO this library have never been up to the standard].
-- Dmitry
On Tue, Oct 5, 2010 at 2:35 PM, Russell Blau russblau@hotmail.com wrote:
"emijrp" emijrp@gmail.com wrote in message news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
[1]
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...
[2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file.
Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages.
Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing.
I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py.
Russ
[4] http://pastebin.ca/1955170
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l