I think that the problem is in the xmlreader.py module. I don't know why, but, I think that sometimes it clears the title, user, or other variables before complete the entire list of revision for a page. So when you read a revision these values have disappeared in some cases.
You didn't replicated the exact case. You must use: xmlreader.XmlDump(dumpfilename, allrevisions=True). I guess you parsed only one revision (the last?) for every page, so, it shows 4711. But you skipped the errors which happen when parsing the whole dump.
2010/10/5 Russell Blau <russblau@hotmail.com>"emijrp" <emijrp@gmail.com> wrote in message
news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...
> I think that there is an error in xmlreader.py. When parsing a full
> revision XML (in this case[1]), using this code[2] (look at the
> try-catch, it writes when fails) I get correctly username,
> timestamp and revisionid, but sometimes, the page title and the page
> id are None or empty string.
> [1]I have been completely unable to replicate this supposed error. I
> http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z
> [2] http://pastebin.ca/1951930
> [3] http://pastebin.ca/1951937
downloaded the same kwwiki dump file that you referenced. I loaded it with
xmlreader.XmlDump, ran it through the parser, and counted the number of
XMLEntry objects it generated: 4711. Then as a test I opened the same dump
as a text file and counted the number of lines that contain the string
"<page>": 4711. So the parser is correctly returning one object per page
item found in the file.
Next I ran the parser again with a script that would print out a message if
any XMLEntry object had a missing title (None or empty string); no messages.
Then I searched for the specific page entry you showed in your pastebin item
[3]. The result of this test is shown at [4]. In short, it found exactly the
page title you said was missing.
I cannot explain why your results are different than mine, unless perhaps
you have a corrupted copy of the dump file, or are not using the current
version of xmlreader.py.
Russ
[4] http://pastebin.ca/1955170
_______________________________________________
Pywikipedia-l mailing list
Pywikipedia-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l