Re: [Pywikipedia-l] XMLreader.py

1 Oct 2010


      Furthermore, if you see the chunk of the dump that I have posted, the page
title and page id are there. But the parser doesn't get them.
2010/10/1 emijrp emijrp@gmail.com
...
Hi, thanks for your quick response, but I have a question. Why are deleted
pages included in the dump? Also, the page of the error is not deleted in
the wiki.[1]
[1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497
2010/9/30 Dmitry Chichkov dchichkov@gmail.com
Hi Emijrp,
...
That's "normal". Page id/title can be None/empty for deleted pages.
-- Regards, Dmitry
On Thu, Sep 30, 2010 at 9:50 AM, emijrp emijrp@gmail.com wrote:
...
Hi all;
I think that there is an error in xmlreader.py. When parsing a full
revision XML (in this case[1]), using this code[2] (look at the try-catch,
it writes when fails) I get correctly username, timestamp and revisionid,
but sometimes, the page title and the page id are None or empty string.
The first error is:
['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']
But if we do:
7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null |
egrep -i '2004-10-10T04::14Z' -C20
We get this[3], which is OK, the page title and the page id are available
in the XML, but not correctly parsed. And this is not the only page title
and page it that fails.
Perhaps I have missed something, because I'm learning to parsing XML.
Sorry in that case.
Regards,
emijrp
[1]
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...
[2] http://pastebin.ca/1951930
[3] http://pastebin.ca/1951937

Pywikipedia-l mailing list
Pywikipedia-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pywikipedia-l mailing list
Pywikipedia-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Pywikipedia-l] XMLreader.py