Re: [Pywikipedia-l] XMLreader.py

1 Oct 2010


      May be I am wrong, but xqt told me once that the PreloadingGenerator
has problems with API. I myself had problems due to deleted (and re-
direct) pages with API loading multiple pages at once too.
So my assumption is, this xml parser has indeed problem parsing the
deleted (and maybe redirect) pages and thus fails to return them all
and so the PreloadingGenerator does not work with API.
If I am right with this, the solution to the problem mentioned here
can also solve the Preloading with API problem. This would be very
nice! But the be sure I would appreciate a comment by xqt on this ;))
Just some thoughts...
Greetings
DrTrigon
Am 01.10.2010 00:52, schrieb Dmitry Chichkov:
...
I see. Strange... That indeed looks like a parser bug.
-- Dmitry
On Thu, Sep 30, 2010 at 3:37 PM, emijrp <emijrp@gmail.com
mailto:emijrp@gmail.com> wrote:
Furthermore, if you see the chunk of the dump that I have posted,
the page title and page id are there. But the parser doesn't get them.

2010/10/1 emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com>>

    Hi, thanks for your quick response, but I have a question. Why
    are deleted pages included in the dump? Also, the page of the
    error is not deleted in the wiki.[1]

    [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497

    2010/9/30 Dmitry Chichkov <dchichkov@gmail.com
    <mailto:dchichkov@gmail.com>>

        Hi Emijrp,

        That's "normal". Page id/title can be None/empty for deleted
        pages.

        -- Regards, Dmitry


        On Thu, Sep 30, 2010 at 9:50 AM, emijrp <emijrp@gmail.com
        <mailto:emijrp@gmail.com>> wrote:

            Hi all;

            I think that there is an error in xmlreader.py. When
            parsing a full revision XML (in this case[1]), using
            this code[2] (look at the try-catch, it writes when
            fails) I get correctly username, timestamp and
            revisionid, but sometimes, the page title and the page
            id are None or empty string.

            The first error is:
            ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z',
            '4267']

            But if we do:
            7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z
            2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20

            We get this[3], which is OK, the page title and the page
            id are available in the XML, but not correctly parsed.
            And this is not the only page title and page it that fails.

            Perhaps I have missed something, because I'm learning to
            parsing XML. Sorry in that case.

            Regards,
            emijrp

            [1]
            http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z
            [2] http://pastebin.ca/1951930
            [3] http://pastebin.ca/1951937

            _______________________________________________
            Pywikipedia-l mailing list
            Pywikipedia-l@lists.wikimedia.org
            <mailto:Pywikipedia-l@lists.wikimedia.org>
            https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l


        _______________________________________________
        Pywikipedia-l mailing list
        Pywikipedia-l@lists.wikimedia.org
        <mailto:Pywikipedia-l@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l


_______________________________________________
Pywikipedia-l mailing list
Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l


Pywikipedia-l mailing list
Pywikipedia-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

Re: [Pywikipedia-l] XMLreader.py