As far as I remember xmlreader can use alternative mechanisms of XML
parsing: cElementTree, ElementTree, regexp. The version of the cElementTree
depends on the Python version. My bet it is a regexp method fault. Or maybe
cElementTree fault [IMHO this library have never been up to the standard].
On Tue, Oct 5, 2010 at 2:35 PM, Russell Blau <russblau(a)hotmail.com> wrote:
"emijrp" <emijrp(a)gmail.com> wrote in
I think that there is an error in xmlreader.py.
When parsing a full
revision XML (in this case), using this code (look at the
try-catch, it writes when fails) I get correctly username,
timestamp and revisionid, but sometimes, the page title and the page
id are None or empty string.
I have been completely unable to replicate this supposed error. I
downloaded the same kwwiki dump file that you referenced. I loaded it with
xmlreader.XmlDump, ran it through the parser, and counted the number of
XMLEntry objects it generated: 4711. Then as a test I opened the same dump
as a text file and counted the number of lines that contain the string
"<page>": 4711. So the parser is correctly returning one object per
item found in the file.
Next I ran the parser again with a script that would print out a message if
any XMLEntry object had a missing title (None or empty string); no
Then I searched for the specific page entry you showed in your pastebin
. The result of this test is shown at . In short, it found exactly
page title you said was missing.
I cannot explain why your results are different than mine, unless perhaps
you have a corrupted copy of the dump file, or are not using the current
version of xmlreader.py.
Pywikipedia-l mailing list