[Pywikipedia-l] XMLreader.py

Tue Oct 5 22:25:43 UTC 2010

As far as I remember xmlreader can use alternative mechanisms of XML
parsing: cElementTree, ElementTree, regexp. The version of the cElementTree
depends on the Python version. My bet it is a regexp method fault. Or maybe
cElementTree fault [IMHO this library have never been up to the standard].

-- Dmitry

On Tue, Oct 5, 2010 at 2:35 PM, Russell Blau <russblau at hotmail.com> wrote:

> "emijrp" <emijrp at gmail.com> wrote in message
> news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ at mail.gmail.com...
>
> > I think that there is an error in xmlreader.py. When parsing a full
> > revision XML (in this case[1]), using this code[2] (look at the
> > try-catch, it writes when fails) I get correctly username,
> > timestamp and revisionid, but sometimes, the page title and the page
> > id are None or empty string.
>
> > [1]
> >
> http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z
> > [2] http://pastebin.ca/1951930
> > [3] http://pastebin.ca/1951937
>
> I have been completely unable to replicate this supposed error.  I
> downloaded the same kwwiki dump file that you referenced.  I loaded it with
> xmlreader.XmlDump, ran it through the parser, and counted the number of
> XMLEntry objects it generated: 4711.  Then as a test I opened the same dump
> as a text file and counted the number of lines that contain the string
> "<page>": 4711.  So the parser is correctly returning one object per page
> item found in the file.
>
> Next I ran the parser again with a script that would print out a message if
> any XMLEntry object had a missing title (None or empty string); no
> messages.
>
> Then I searched for the specific page entry you showed in your pastebin
> item
> [3]. The result of this test is shown at [4]. In short, it found exactly
> the
> page title you said was missing.
>
> I cannot explain why your results are different than mine, unless perhaps
> you have a corrupted copy of the dump file, or are not using the current
> version of xmlreader.py.
>
> Russ
>
> [4] http://pastebin.ca/1955170
>
>
>
>
> _______________________________________________
> Pywikipedia-l mailing list
> Pywikipedia-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.wikimedia.org/pipermail/pywikipedia-l/attachments/20101005/2c65f9a0/attachment.htm