Re: [Pywikipedia-l] XMLreader.py

7 Nov 2010

I think that the problem is in the xmlreader.py module. I don't know why,
but, I think that sometimes it clears the title, user, or other variables
before complete the entire list of revision for a page. So when you read a
revision these values have disappeared in some cases.

2010/11/7 emijrp &lt;emijrp(a)gmail.com&gt;

...
  You didn't replicated the exact case. You must
use:
 xmlreader.XmlDump(dumpfilename, allrevisions=True). I guess you parsed only
 one revision (the last?) for every page, so, it shows 4711. But you skipped
 the errors which happen when parsing the whole dump.

 2010/10/5 Russell Blau &lt;russblau(a)hotmail.com&gt;

 "emijrp" &lt;emijrp(a)gmail.com&gt; wrote in message

news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...

  I think that there is an error in xmlreader.py.
When parsing a full
 revision XML (in this case[1]), using this code[2] (look at the
 try-catch, it writes when fails) I get correctly username,
 timestamp and revisionid, but sometimes, the page title and the page
 id are None or empty string. 
  [1]

http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-hi…
  [2] http://pastebin.ca/1951930
 [3] http://pastebin.ca/1951937 
 I have been completely unable to replicate this supposed error.  I
 downloaded the same kwwiki dump file that you referenced.  I loaded it
 with
 xmlreader.XmlDump, ran it through the parser, and counted the number of
 XMLEntry objects it generated: 4711.  Then as a test I opened the same
 dump
 as a text file and counted the number of lines that contain the string
 "<page>": 4711.  So the parser is correctly returning one object per
page
 item found in the file.

 Next I ran the parser again with a script that would print out a message
 if
 any XMLEntry object had a missing title (None or empty string); no
 messages.

 Then I searched for the specific page entry you showed in your pastebin
 item
 [3]. The result of this test is shown at [4]. In short, it found exactly
 the
 page title you said was missing.

 I cannot explain why your results are different than mine, unless perhaps
 you have a corrupted copy of the dump file, or are not using the current
 version of xmlreader.py.

 Russ

 [4] http://pastebin.ca/1955170

 _______________________________________________
 Pywikipedia-l mailing list
 Pywikipedia-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Pywikipedia-l] XMLreader.py