XMLreader.py

List overview All Threads
Download

newer

older

match and list, but not replace

pagefromfile.py error

emijrp

30 Sep 2010 30 Sep '10

11:50 a.m.

Hi all;

I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.

The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']

But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20

We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails.

Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case.

Regards, emijrp

[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937

Attachments:

attachment.htm (text/html — 1.3 KB)

Show replies by date

Dmitry Chichkov

30 Sep 30 Sep

4:52 p.m.

Hi Emijrp,

That's "normal". Page id/title can be None/empty for deleted pages.

-- Regards, Dmitry

On Thu, Sep 30, 2010 at 9:50 AM, emijrp emijrp@gmail.com wrote:

...

Hi all;

I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.

The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']

But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20

We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails.

Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case.

Regards, emijrp

[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

emijrp

5:35 p.m.

Hi, thanks for your quick response, but I have a question. Why are deleted pages included in the dump? Also, the page of the error is not deleted in the wiki.[1]

[1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497

2010/9/30 Dmitry Chichkov dchichkov@gmail.com

...

Hi Emijrp,

That's "normal". Page id/title can be None/empty for deleted pages.

-- Regards, Dmitry

On Thu, Sep 30, 2010 at 9:50 AM, emijrp emijrp@gmail.com wrote:

...
Hi all;

I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.

The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']

But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20

We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails.

Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case.

Regards, emijrp

[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

emijrp

5:37 p.m.

Furthermore, if you see the chunk of the dump that I have posted, the page title and page id are there. But the parser doesn't get them.

2010/10/1 emijrp emijrp@gmail.com

...

Hi, thanks for your quick response, but I have a question. Why are deleted pages included in the dump? Also, the page of the error is not deleted in the wiki.[1]

[1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497

2010/9/30 Dmitry Chichkov dchichkov@gmail.com

Hi Emijrp,

...
That's "normal". Page id/title can be None/empty for deleted pages.

-- Regards, Dmitry

On Thu, Sep 30, 2010 at 9:50 AM, emijrp emijrp@gmail.com wrote:

...
Hi all;

I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.

The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']

But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20

We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails.

Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case.

Regards, emijrp

[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Dmitry Chichkov

5:52 p.m.

I see. Strange... That indeed looks like a parser bug.

-- Dmitry

On Thu, Sep 30, 2010 at 3:37 PM, emijrp emijrp@gmail.com wrote:

...

Furthermore, if you see the chunk of the dump that I have posted, the page title and page id are there. But the parser doesn't get them.

2010/10/1 emijrp emijrp@gmail.com

Hi, thanks for your quick response, but I have a question. Why are deleted

...
pages included in the dump? Also, the page of the error is not deleted in the wiki.[1]

[1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497

2010/9/30 Dmitry Chichkov dchichkov@gmail.com

Hi Emijrp,

...
That's "normal". Page id/title can be None/empty for deleted pages.

-- Regards, Dmitry

On Thu, Sep 30, 2010 at 9:50 AM, emijrp emijrp@gmail.com wrote:

...
Hi all;

I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.

The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']

But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20

We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails.

Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case.

Regards, emijrp

[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Dr. Trigon

1 Oct 1 Oct

6:07 a.m.

May be I am wrong, but xqt told me once that the PreloadingGenerator has problems with API. I myself had problems due to deleted (and re- direct) pages with API loading multiple pages at once too.

So my assumption is, this xml parser has indeed problem parsing the deleted (and maybe redirect) pages and thus fails to return them all and so the PreloadingGenerator does not work with API. If I am right with this, the solution to the problem mentioned here can also solve the Preloading with API problem. This would be very nice! But the be sure I would appreciate a comment by xqt on this ;))

Just some thoughts...

Greetings DrTrigon

Am 01.10.2010 00:52, schrieb Dmitry Chichkov:

...

I see. Strange... That indeed looks like a parser bug.

-- Dmitry

On Thu, Sep 30, 2010 at 3:37 PM, emijrp <emijrp@gmail.com mailto:emijrp@gmail.com> wrote:

Furthermore, if you see the chunk of the dump that I have posted,
the page title and page id are there. But the parser doesn't get them.

2010/10/1 emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com>>

    Hi, thanks for your quick response, but I have a question. Why
    are deleted pages included in the dump? Also, the page of the
    error is not deleted in the wiki.[1]

    [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497

    2010/9/30 Dmitry Chichkov <dchichkov@gmail.com
    <mailto:dchichkov@gmail.com>>

        Hi Emijrp,

        That's "normal". Page id/title can be None/empty for deleted
        pages.

        -- Regards, Dmitry


        On Thu, Sep 30, 2010 at 9:50 AM, emijrp <emijrp@gmail.com
        <mailto:emijrp@gmail.com>> wrote:

            Hi all;

            I think that there is an error in xmlreader.py. When
            parsing a full revision XML (in this case[1]), using
            this code[2] (look at the try-catch, it writes when
            fails) I get correctly username, timestamp and
            revisionid, but sometimes, the page title and the page
            id are None or empty string.

            The first error is:
            ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z',
            '4267']

            But if we do:
            7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z
            2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20

            We get this[3], which is OK, the page title and the page
            id are available in the XML, but not correctly parsed.
            And this is not the only page title and page it that fails.

            Perhaps I have missed something, because I'm learning to
            parsing XML. Sorry in that case.

            Regards,
            emijrp

            [1]
            http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z
            [2] http://pastebin.ca/1951930
            [3] http://pastebin.ca/1951937

            _______________________________________________
            Pywikipedia-l mailing list
            Pywikipedia-l@lists.wikimedia.org
            <mailto:Pywikipedia-l@lists.wikimedia.org>
            https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l



        _______________________________________________
        Pywikipedia-l mailing list
        Pywikipedia-l@lists.wikimedia.org
        <mailto:Pywikipedia-l@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l




_______________________________________________
Pywikipedia-l mailing list
Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

emijrp

9:22 a.m.

The dump doesn't include deleted pages or revisions. The dump has the values but the parser doesn't parse them.

2010/10/1 Dr. Trigon dr.trigon@surfeu.ch

...

May be I am wrong, but xqt told me once that the PreloadingGenerator has problems with API. I myself had problems due to deleted (and re- direct) pages with API loading multiple pages at once too.

So my assumption is, this xml parser has indeed problem parsing the deleted (and maybe redirect) pages and thus fails to return them all and so the PreloadingGenerator does not work with API. If I am right with this, the solution to the problem mentioned here can also solve the Preloading with API problem. This would be very nice! But the be sure I would appreciate a comment by xqt on this ;))

Just some thoughts...

Greetings DrTrigon

Am 01.10.2010 00:52, schrieb Dmitry Chichkov:

...
I see. Strange... That indeed looks like a parser bug.

-- Dmitry

On Thu, Sep 30, 2010 at 3:37 PM, emijrp <emijrp@gmail.com mailto:emijrp@gmail.com> wrote:
Furthermore, if you see the chunk of the dump that I have posted,
the page title and page id are there. But the parser doesn't get
them.

...
2010/10/1 emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com>>

    Hi, thanks for your quick response, but I have a question. Why
    are deleted pages included in the dump? Also, the page of the
    error is not deleted in the wiki.[1]

    [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497

    2010/9/30 Dmitry Chichkov <dchichkov@gmail.com
    <mailto:dchichkov@gmail.com>>

        Hi Emijrp,

        That's "normal". Page id/title can be None/empty for deleted
        pages.

        -- Regards, Dmitry


        On Thu, Sep 30, 2010 at 9:50 AM, emijrp <emijrp@gmail.com
        <mailto:emijrp@gmail.com>> wrote:

            Hi all;

            I think that there is an error in xmlreader.py. When
            parsing a full revision XML (in this case[1]), using
            this code[2] (look at the try-catch, it writes when
            fails) I get correctly username, timestamp and
            revisionid, but sometimes, the page title and the page
            id are None or empty string.

            The first error is:
            ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z',
            '4267']

            But if we do:
            7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z
            2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20

            We get this[3], which is OK, the page title and the page
            id are available in the XML, but not correctly parsed.
            And this is not the only page title and page it that
fails.

...
            Perhaps I have missed something, because I'm learning to
            parsing XML. Sorry in that case.

            Regards,
            emijrp

            [1]
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...

...
            [2] http://pastebin.ca/1951930
            [3] http://pastebin.ca/1951937

            _______________________________________________
            Pywikipedia-l mailing list
            Pywikipedia-l@lists.wikimedia.org
            <mailto:Pywikipedia-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

...
        _______________________________________________
        Pywikipedia-l mailing list
        Pywikipedia-l@lists.wikimedia.org
        <mailto:Pywikipedia-l@lists.wikimedia.org>
        https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l




_______________________________________________
Pywikipedia-l mailing list
Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Dr. Trigon

2 Oct 2 Oct

5:37 a.m.

Ok! Good to know - thanks! What does this mean, e.g. for Preloading or else? Is the parse thought to parse deleted pages too (and it's an error in it not to do so) or is this part of the parsers concept?

Thanks and greetings!

Am 01.10.2010 16:22, schrieb emijrp:

...

The dump doesn't include deleted pages or revisions. The dump has the values but the parser doesn't parse them.

2010/10/1 Dr. Trigon <dr.trigon@surfeu.ch mailto:dr.trigon@surfeu.ch>

May be I am wrong, but xqt told me once that the PreloadingGenerator
has problems with API. I myself had problems due to deleted (and re-
direct) pages with API loading multiple pages at once too.

So my assumption is, this xml parser has indeed problem parsing the
deleted (and maybe redirect) pages and thus fails to return them all
and so the PreloadingGenerator does not work with API.
If I am right with this, the solution to the problem mentioned here
can also solve the Preloading with API problem. This would be very
nice! But the be sure I would appreciate a comment by xqt on this ;))

Just some thoughts...

Greetings
DrTrigon


Am 01.10.2010 00:52, schrieb Dmitry Chichkov:
 > I see. Strange... That indeed looks like a parser bug.
 >
 > -- Dmitry
 >
 >
 > On Thu, Sep 30, 2010 at 3:37 PM, emijrp <emijrp@gmail.com
<mailto:emijrp@gmail.com>
 > <mailto:emijrp@gmail.com <mailto:emijrp@gmail.com>>> wrote:
 >
 >     Furthermore, if you see the chunk of the dump that I have posted,
 >     the page title and page id are there. But the parser doesn't
get them.
 >
 >     2010/10/1 emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com>
<mailto:emijrp@gmail.com <mailto:emijrp@gmail.com>>>
 >
 >         Hi, thanks for your quick response, but I have a
question. Why
 >         are deleted pages included in the dump? Also, the page of the
 >         error is not deleted in the wiki.[1]
 >
 >         [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497
 >
 >         2010/9/30 Dmitry Chichkov <dchichkov@gmail.com
<mailto:dchichkov@gmail.com>
 > <mailto:dchichkov@gmail.com <mailto:dchichkov@gmail.com>>>
 >
 >             Hi Emijrp,
 >
 >             That's "normal". Page id/title can be None/empty for
deleted
 >             pages.
 >
 >             -- Regards, Dmitry
 >
 >
 >             On Thu, Sep 30, 2010 at 9:50 AM, emijrp
<emijrp@gmail.com <mailto:emijrp@gmail.com>
 > <mailto:emijrp@gmail.com <mailto:emijrp@gmail.com>>> wrote:
 >
 >                 Hi all;
 >
 >                 I think that there is an error in xmlreader.py. When
 >                 parsing a full revision XML (in this case[1]), using
 >                 this code[2] (look at the try-catch, it writes when
 >                 fails) I get correctly username, timestamp and
 >                 revisionid, but sometimes, the page title and the
page
 >                 id are None or empty string.
 >
 >                 The first error is:
 >                 ['', None, 'QuartierLatin1968',
'2004-10-10T04:24:14Z',
 > '4267']
 >
 >                 But if we do:
 >                 7za e -bd -so
kwwiki-20100926-pages-meta-history.xml.7z
 >                 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20
 >
 >                 We get this[3], which is OK, the page title and
the page
 >                 id are available in the XML, but not correctly
parsed.
 >                 And this is not the only page title and page it
that fails.
 >
 >                 Perhaps I have missed something, because I'm
learning to
 >                 parsing XML. Sorry in that case.
 >
 >                 Regards,
 >                 emijrp
 >
 >                 [1]
 >
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z
 >                 [2] http://pastebin.ca/1951930
 >                 [3] http://pastebin.ca/1951937
 >
 >                 _______________________________________________
 >                 Pywikipedia-l mailing list
 > Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>
 > <mailto:Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>>
 > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
 >
 >
 >
 >             _______________________________________________
 >             Pywikipedia-l mailing list
 > Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>
 > <mailto:Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>>
 > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
 >
 >
 >
 >
 >     _______________________________________________
 >     Pywikipedia-l mailing list
 > Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>
 > <mailto:Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>>
 > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
 >
 >
 >
 >
 > _______________________________________________
 > Pywikipedia-l mailing list
 > Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>
 > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l


_______________________________________________
Pywikipedia-l mailing list
Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Russell Blau

5 Oct 5 Oct

4:35 p.m.

"emijrp" emijrp@gmail.com wrote in message news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...

...

I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.

...

[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937

I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file.

Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages.

Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing.

I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py.

Russ

[4] http://pastebin.ca/1955170

Dmitry Chichkov

5:25 p.m.

As far as I remember xmlreader can use alternative mechanisms of XML parsing: cElementTree, ElementTree, regexp. The version of the cElementTree depends on the Python version. My bet it is a regexp method fault. Or maybe cElementTree fault [IMHO this library have never been up to the standard].

-- Dmitry

On Tue, Oct 5, 2010 at 2:35 PM, Russell Blau russblau@hotmail.com wrote:

...

"emijrp" emijrp@gmail.com wrote in message news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...

...
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.

...
[1]

http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...

...
[2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937

I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file.

Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages.

Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing.

I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py.

Russ

[4] http://pastebin.ca/1955170

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

emijrp

6 Oct 6 Oct

1:40 a.m.

I have tested your code, with the bz2 and 7z dumps, and I get titles with None value. The first one is the same error that apperas in my code.

Reading XML dump... None 2004-10-10T04:24:14Z

I have the last version of pywikipediabot and Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56). Probably, it can be a error of Python or cElementTree. What are your versions?

2010/10/5 Russell Blau russblau@hotmail.com

...

"emijrp" emijrp@gmail.com wrote in message news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...

...
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.

...
[1]

http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...

...
[2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937

I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file.

Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages.

Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing.

I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py.

Russ

[4] http://pastebin.ca/1955170

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Russell Blau

10:21 a.m.

"emijrp" emijrp@gmail.com wrote in message news:AANLkTi=A-=HYv03T+xyhvFurJqCYA-bCjfMhx6N13pGD@mail.gmail.com...

...

I have tested your code, with the bz2 and 7z dumps, and I get titles with None value. The first one is the same error that apperas in my code.

Reading XML dump... None 2004-10-10T04:24:14Z

I have the last version of pywikipediabot and Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56). Probably, it can be a error of Python or cElementTree. What are your versions?

Pywikipedia [svn+ssh] (r8609, 2010/10/05, 16:21:42) Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]

However, I retried the same tests under Python 2.6.5 and got the same results.

Try the following and see if your result is different than mine:

...

...
...
import xmlreader dump = xmlreader.XmlDump("kwwiki-20100926-pages-meta-history.xml.bz2") parser = dump.parse()

Reading XML dump...

...

...
...
print parser

If you get <generator object regex_parse ...> instead of new_parse, then you don't have elementtree available, although since it is supposed to be standard since Python 2.5 that would be somewhat surprising.

Russ

emijrp

1:12 p.m.

I get this:

...

...
...
dump = xmlreader.XmlDump("kwwiki-20100926-pages-meta-history.xml.bz2") parser=dump.parse()

Reading XML dump...

...

...
...
print parser

...

...
...

Very weird.

2010/10/6 Russell Blau russblau@hotmail.com

...

"emijrp" emijrp@gmail.com wrote in message news:AANLkTi=A-=HYv03T+xyhvFurJqCYA-bCjfMhx6N13pGD@mail.gmail.com...

...
I have tested your code, with the bz2 and 7z dumps, and I get titles with None value. The first one is the same error that apperas in my code.

Reading XML dump... None 2004-10-10T04:24:14Z

I have the last version of pywikipediabot and Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56). Probably, it can be a error of Python or cElementTree. What are your versions?

Pywikipedia [svn+ssh] (r8609, 2010/10/05, 16:21:42) Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]

However, I retried the same tests under Python 2.6.5 and got the same results.

Try the following and see if your result is different than mine:

...
...
...
import xmlreader dump = xmlreader.XmlDump("kwwiki-20100926-pages-meta-history.xml.bz2") parser = dump.parse()

Reading XML dump...

...
...
...
print parser

<generator object new_parse at 0x0132A968>

If you get <generator object regex_parse ...> instead of new_parse, then you don't have elementtree available, although since it is supposed to be standard since Python 2.5 that would be somewhat surprising.

Russ

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

emijrp

6 Nov 6 Nov

7:04 p.m.

What is your OS and how did you installed Python 2.7 (r27:82525) ?

2010/10/6 Russell Blau russblau@hotmail.com

...

"emijrp" emijrp@gmail.com wrote in message news:AANLkTi=A-=HYv03T+xyhvFurJqCYA-bCjfMhx6N13pGD@mail.gmail.com...

...
I have tested your code, with the bz2 and 7z dumps, and I get titles with None value. The first one is the same error that apperas in my code.

Reading XML dump... None 2004-10-10T04:24:14Z

I have the last version of pywikipediabot and Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56). Probably, it can be a error of Python or cElementTree. What are your versions?

Pywikipedia [svn+ssh] (r8609, 2010/10/05, 16:21:42) Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]

However, I retried the same tests under Python 2.6.5 and got the same results.

Try the following and see if your result is different than mine:

...
...
...
import xmlreader dump = xmlreader.XmlDump("kwwiki-20100926-pages-meta-history.xml.bz2") parser = dump.parse()

Reading XML dump...

...
...
...
print parser

<generator object new_parse at 0x0132A968>

If you get <generator object regex_parse ...> instead of new_parse, then you don't have elementtree available, although since it is supposed to be standard since Python 2.5 that would be somewhat surprising.

Russ

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Dr. Trigon

6 Oct 6 Oct

3:54 p.m.

...

cElementTree. What are your versions?

Python 2.6.4 (r264:75706, Jun 4 2010, 18:20:31)

(on f13 x64)

Greetings

Am 06.10.2010 08:40, schrieb emijrp:

...

I have tested your code, with the bz2 and 7z dumps, and I get titles with None value. The first one is the same error that apperas in my code.

Reading XML dump... None 2004-10-10T04:24:14Z

I have the last version of pywikipediabot and Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56). Probably, it can be a error of Python or cElementTree. What are your versions?

2010/10/5 Russell Blau <russblau@hotmail.com mailto:russblau@hotmail.com>

"emijrp" <emijrp@gmail.com <mailto:emijrp@gmail.com>> wrote in message
news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...

 > I think that there is an error in xmlreader.py. When parsing a full
 > revision XML (in this case[1]), using this code[2] (look at the
 > try-catch, it writes when fails) I get correctly username,
 > timestamp and revisionid, but sometimes, the page title and the page
 > id are None or empty string.

 > [1]
 >
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z
 > [2] http://pastebin.ca/1951930
 > [3] http://pastebin.ca/1951937

I have been completely unable to replicate this supposed error.  I
downloaded the same kwwiki dump file that you referenced.  I loaded
it with
xmlreader.XmlDump, ran it through the parser, and counted the number of
XMLEntry objects it generated: 4711.  Then as a test I opened the
same dump
as a text file and counted the number of lines that contain the string
"<page>": 4711.  So the parser is correctly returning one object per
page
item found in the file.

Next I ran the parser again with a script that would print out a
message if
any XMLEntry object had a missing title (None or empty string); no
messages.

Then I searched for the specific page entry you showed in your
pastebin item
[3]. The result of this test is shown at [4]. In short, it found
exactly the
page title you said was missing.

I cannot explain why your results are different than mine, unless
perhaps
you have a corrupted copy of the dump file, or are not using the current
version of xmlreader.py.

Russ

[4] http://pastebin.ca/1955170




_______________________________________________
Pywikipedia-l mailing list
Pywikipedia-l@lists.wikimedia.org
<mailto:Pywikipedia-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

emijrp

6 Nov 6 Nov

7:10 p.m.

You didn't replicated the exact case. You must use: xmlreader.XmlDump(dumpfilename, allrevisions=True). I guess you parsed only one revision (the last?) for every page, so, it shows 4711. But you skipped the errors which happen when parsing the whole dump.

2010/10/5 Russell Blau russblau@hotmail.com

...

"emijrp" emijrp@gmail.com wrote in message news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...

...
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.

...
[1]

http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...

...
[2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937

I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file.

Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages.

Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing.

I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py.

Russ

[4] http://pastebin.ca/1955170

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

emijrp

7:17 p.m.

I think that the problem is in the xmlreader.py module. I don't know why, but, I think that sometimes it clears the title, user, or other variables before complete the entire list of revision for a page. So when you read a revision these values have disappeared in some cases.

2010/11/7 emijrp emijrp@gmail.com

...

You didn't replicated the exact case. You must use: xmlreader.XmlDump(dumpfilename, allrevisions=True). I guess you parsed only one revision (the last?) for every page, so, it shows 4711. But you skipped the errors which happen when parsing the whole dump.

2010/10/5 Russell Blau russblau@hotmail.com

"emijrp" emijrp@gmail.com wrote in message

...
news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...

...
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.

...
[1]

http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...

...
[2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937

I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file.

Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages.

Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing.

I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py.

Russ

[4] http://pastebin.ca/1955170

Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

5156

Age (days ago)

5194

Last active (days ago)

pywikipedia-l@lists.wikimedia.org

16 comments

4 participants

tags (0)

participants (4)

Dmitry Chichkov
Dr. Trigon
emijrp
Russell Blau