Re: [Pywikipedia-l] XMLreader.py

5 Oct 2010


      "emijrp" emijrp@gmail.com wrote in message 
news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...
...
I think that there is an error in xmlreader.py. When parsing a full
revision XML (in this case[1]), using this code[2] (look at the
try-catch, it writes when fails) I get correctly username,
timestamp and revisionid, but sometimes, the page title and the page
id are None or empty string.
...
[1] 
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...
[2] http://pastebin.ca/1951930
[3] http://pastebin.ca/1951937
I have been completely unable to replicate this supposed error.  I 
downloaded the same kwwiki dump file that you referenced.  I loaded it with 
xmlreader.XmlDump, ran it through the parser, and counted the number of 
XMLEntry objects it generated: 4711.  Then as a test I opened the same dump 
as a text file and counted the number of lines that contain the string 
"<page>": 4711.  So the parser is correctly returning one object per page 
item found in the file.
Next I ran the parser again with a script that would print out a message if 
any XMLEntry object had a missing title (None or empty string); no messages.
Then I searched for the specific page entry you showed in your pastebin item 
[3]. The result of this test is shown at [4]. In short, it found exactly the 
page title you said was missing.
I cannot explain why your results are different than mine, unless perhaps 
you have a corrupted copy of the dump file, or are not using the current 
version of xmlreader.py.
Russ
[4] http://pastebin.ca/1955170

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Pywikipedia-l] XMLreader.py