Hi all;
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']
But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20
We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails.
Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case.
Regards, emijrp
[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
Hi Emijrp,
That's "normal". Page id/title can be None/empty for deleted pages.
-- Regards, Dmitry
On Thu, Sep 30, 2010 at 9:50 AM, emijrp emijrp@gmail.com wrote:
Hi all;
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']
But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20
We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails.
Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case.
Regards, emijrp
[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Hi, thanks for your quick response, but I have a question. Why are deleted pages included in the dump? Also, the page of the error is not deleted in the wiki.[1]
[1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497
2010/9/30 Dmitry Chichkov dchichkov@gmail.com
Hi Emijrp,
That's "normal". Page id/title can be None/empty for deleted pages.
-- Regards, Dmitry
On Thu, Sep 30, 2010 at 9:50 AM, emijrp emijrp@gmail.com wrote:
Hi all;
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']
But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20
We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails.
Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case.
Regards, emijrp
[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Furthermore, if you see the chunk of the dump that I have posted, the page title and page id are there. But the parser doesn't get them.
2010/10/1 emijrp emijrp@gmail.com
Hi, thanks for your quick response, but I have a question. Why are deleted pages included in the dump? Also, the page of the error is not deleted in the wiki.[1]
[1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497
2010/9/30 Dmitry Chichkov dchichkov@gmail.com
Hi Emijrp,
That's "normal". Page id/title can be None/empty for deleted pages.
-- Regards, Dmitry
On Thu, Sep 30, 2010 at 9:50 AM, emijrp emijrp@gmail.com wrote:
Hi all;
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']
But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20
We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails.
Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case.
Regards, emijrp
[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
I see. Strange... That indeed looks like a parser bug.
-- Dmitry
On Thu, Sep 30, 2010 at 3:37 PM, emijrp emijrp@gmail.com wrote:
Furthermore, if you see the chunk of the dump that I have posted, the page title and page id are there. But the parser doesn't get them.
2010/10/1 emijrp emijrp@gmail.com
Hi, thanks for your quick response, but I have a question. Why are deleted
pages included in the dump? Also, the page of the error is not deleted in the wiki.[1]
[1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497
2010/9/30 Dmitry Chichkov dchichkov@gmail.com
Hi Emijrp,
That's "normal". Page id/title can be None/empty for deleted pages.
-- Regards, Dmitry
On Thu, Sep 30, 2010 at 9:50 AM, emijrp emijrp@gmail.com wrote:
Hi all;
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267']
But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20
We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails.
Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case.
Regards, emijrp
[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
May be I am wrong, but xqt told me once that the PreloadingGenerator has problems with API. I myself had problems due to deleted (and re- direct) pages with API loading multiple pages at once too.
So my assumption is, this xml parser has indeed problem parsing the deleted (and maybe redirect) pages and thus fails to return them all and so the PreloadingGenerator does not work with API. If I am right with this, the solution to the problem mentioned here can also solve the Preloading with API problem. This would be very nice! But the be sure I would appreciate a comment by xqt on this ;))
Just some thoughts...
Greetings DrTrigon
Am 01.10.2010 00:52, schrieb Dmitry Chichkov:
I see. Strange... That indeed looks like a parser bug.
-- Dmitry
On Thu, Sep 30, 2010 at 3:37 PM, emijrp <emijrp@gmail.com mailto:emijrp@gmail.com> wrote:
Furthermore, if you see the chunk of the dump that I have posted, the page title and page id are there. But the parser doesn't get them. 2010/10/1 emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com>> Hi, thanks for your quick response, but I have a question. Why are deleted pages included in the dump? Also, the page of the error is not deleted in the wiki.[1] [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497 2010/9/30 Dmitry Chichkov <dchichkov@gmail.com <mailto:dchichkov@gmail.com>> Hi Emijrp, That's "normal". Page id/title can be None/empty for deleted pages. -- Regards, Dmitry On Thu, Sep 30, 2010 at 9:50 AM, emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com>> wrote: Hi all; I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string. The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267'] But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20 We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that fails. Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case. Regards, emijrp [1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937 _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
The dump doesn't include deleted pages or revisions. The dump has the values but the parser doesn't parse them.
2010/10/1 Dr. Trigon dr.trigon@surfeu.ch
May be I am wrong, but xqt told me once that the PreloadingGenerator has problems with API. I myself had problems due to deleted (and re- direct) pages with API loading multiple pages at once too.
So my assumption is, this xml parser has indeed problem parsing the deleted (and maybe redirect) pages and thus fails to return them all and so the PreloadingGenerator does not work with API. If I am right with this, the solution to the problem mentioned here can also solve the Preloading with API problem. This would be very nice! But the be sure I would appreciate a comment by xqt on this ;))
Just some thoughts...
Greetings DrTrigon
Am 01.10.2010 00:52, schrieb Dmitry Chichkov:
I see. Strange... That indeed looks like a parser bug.
-- Dmitry
On Thu, Sep 30, 2010 at 3:37 PM, emijrp <emijrp@gmail.com mailto:emijrp@gmail.com> wrote:
Furthermore, if you see the chunk of the dump that I have posted, the page title and page id are there. But the parser doesn't get
them.
2010/10/1 emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com>> Hi, thanks for your quick response, but I have a question. Why are deleted pages included in the dump? Also, the page of the error is not deleted in the wiki.[1] [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497 2010/9/30 Dmitry Chichkov <dchichkov@gmail.com <mailto:dchichkov@gmail.com>> Hi Emijrp, That's "normal". Page id/title can be None/empty for deleted pages. -- Regards, Dmitry On Thu, Sep 30, 2010 at 9:50 AM, emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com>> wrote: Hi all; I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string. The first error is: ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', '4267'] But if we do: 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20 We get this[3], which is OK, the page title and the page id are available in the XML, but not correctly parsed. And this is not the only page title and page it that
fails.
Perhaps I have missed something, because I'm learning to parsing XML. Sorry in that case. Regards, emijrp [1]
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...
[2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937 _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
_______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Ok! Good to know - thanks! What does this mean, e.g. for Preloading or else? Is the parse thought to parse deleted pages too (and it's an error in it not to do so) or is this part of the parsers concept?
Thanks and greetings!
Am 01.10.2010 16:22, schrieb emijrp:
The dump doesn't include deleted pages or revisions. The dump has the values but the parser doesn't parse them.
2010/10/1 Dr. Trigon <dr.trigon@surfeu.ch mailto:dr.trigon@surfeu.ch>
May be I am wrong, but xqt told me once that the PreloadingGenerator has problems with API. I myself had problems due to deleted (and re- direct) pages with API loading multiple pages at once too. So my assumption is, this xml parser has indeed problem parsing the deleted (and maybe redirect) pages and thus fails to return them all and so the PreloadingGenerator does not work with API. If I am right with this, the solution to the problem mentioned here can also solve the Preloading with API problem. This would be very nice! But the be sure I would appreciate a comment by xqt on this ;)) Just some thoughts... Greetings DrTrigon Am 01.10.2010 00:52, schrieb Dmitry Chichkov: > I see. Strange... That indeed looks like a parser bug. > > -- Dmitry > > > On Thu, Sep 30, 2010 at 3:37 PM, emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com> > <mailto:emijrp@gmail.com <mailto:emijrp@gmail.com>>> wrote: > > Furthermore, if you see the chunk of the dump that I have posted, > the page title and page id are there. But the parser doesn't get them. > > 2010/10/1 emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com> <mailto:emijrp@gmail.com <mailto:emijrp@gmail.com>>> > > Hi, thanks for your quick response, but I have a question. Why > are deleted pages included in the dump? Also, the page of the > error is not deleted in the wiki.[1] > > [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497 > > 2010/9/30 Dmitry Chichkov <dchichkov@gmail.com <mailto:dchichkov@gmail.com> > <mailto:dchichkov@gmail.com <mailto:dchichkov@gmail.com>>> > > Hi Emijrp, > > That's "normal". Page id/title can be None/empty for deleted > pages. > > -- Regards, Dmitry > > > On Thu, Sep 30, 2010 at 9:50 AM, emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com> > <mailto:emijrp@gmail.com <mailto:emijrp@gmail.com>>> wrote: > > Hi all; > > I think that there is an error in xmlreader.py. When > parsing a full revision XML (in this case[1]), using > this code[2] (look at the try-catch, it writes when > fails) I get correctly username, timestamp and > revisionid, but sometimes, the page title and the page > id are None or empty string. > > The first error is: > ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', > '4267'] > > But if we do: > 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z > 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20 > > We get this[3], which is OK, the page title and the page > id are available in the XML, but not correctly parsed. > And this is not the only page title and page it that fails. > > Perhaps I have missed something, because I'm learning to > parsing XML. Sorry in that case. > > Regards, > emijrp > > [1] > http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z > [2] http://pastebin.ca/1951930 > [3] http://pastebin.ca/1951937 > > _______________________________________________ > Pywikipedia-l mailing list > Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> > <mailto:Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > > > > _______________________________________________ > Pywikipedia-l mailing list > Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> > <mailto:Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > > > > > _______________________________________________ > Pywikipedia-l mailing list > Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> > <mailto:Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > > > > > _______________________________________________ > Pywikipedia-l mailing list > Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
"emijrp" emijrp@gmail.com wrote in message news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
[1] http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his... [2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file.
Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages.
Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing.
I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py.
Russ
As far as I remember xmlreader can use alternative mechanisms of XML parsing: cElementTree, ElementTree, regexp. The version of the cElementTree depends on the Python version. My bet it is a regexp method fault. Or maybe cElementTree fault [IMHO this library have never been up to the standard].
-- Dmitry
On Tue, Oct 5, 2010 at 2:35 PM, Russell Blau russblau@hotmail.com wrote:
"emijrp" emijrp@gmail.com wrote in message news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
[1]
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...
[2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file.
Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages.
Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing.
I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py.
Russ
[4] http://pastebin.ca/1955170
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
I have tested your code, with the bz2 and 7z dumps, and I get titles with None value. The first one is the same error that apperas in my code.
Reading XML dump... None 2004-10-10T04:24:14Z
I have the last version of pywikipediabot and Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56). Probably, it can be a error of Python or cElementTree. What are your versions?
2010/10/5 Russell Blau russblau@hotmail.com
"emijrp" emijrp@gmail.com wrote in message news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
[1]
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...
[2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file.
Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages.
Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing.
I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py.
Russ
[4] http://pastebin.ca/1955170
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
"emijrp" emijrp@gmail.com wrote in message news:AANLkTi=A-=HYv03T+xyhvFurJqCYA-bCjfMhx6N13pGD@mail.gmail.com...
I have tested your code, with the bz2 and 7z dumps, and I get titles with None value. The first one is the same error that apperas in my code.
Reading XML dump... None 2004-10-10T04:24:14Z
I have the last version of pywikipediabot and Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56). Probably, it can be a error of Python or cElementTree. What are your versions?
Pywikipedia [svn+ssh] (r8609, 2010/10/05, 16:21:42) Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]
However, I retried the same tests under Python 2.6.5 and got the same results.
Try the following and see if your result is different than mine:
import xmlreader dump = xmlreader.XmlDump("kwwiki-20100926-pages-meta-history.xml.bz2") parser = dump.parse()
Reading XML dump...
print parser
<generator object new_parse at 0x0132A968>
If you get <generator object regex_parse ...> instead of new_parse, then you don't have elementtree available, although since it is supposed to be standard since Python 2.5 that would be somewhat surprising.
Russ
I get this:
dump = xmlreader.XmlDump("kwwiki-20100926-pages-meta-history.xml.bz2") parser=dump.parse()
Reading XML dump...
print parser
<generator object new_parse at 0xb7782b94>
Very weird.
2010/10/6 Russell Blau russblau@hotmail.com
"emijrp" emijrp@gmail.com wrote in message news:AANLkTi=A-=HYv03T+xyhvFurJqCYA-bCjfMhx6N13pGD@mail.gmail.com...
I have tested your code, with the bz2 and 7z dumps, and I get titles with None value. The first one is the same error that apperas in my code.
Reading XML dump... None 2004-10-10T04:24:14Z
I have the last version of pywikipediabot and Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56). Probably, it can be a error of Python or cElementTree. What are your versions?
Pywikipedia [svn+ssh] (r8609, 2010/10/05, 16:21:42) Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]
However, I retried the same tests under Python 2.6.5 and got the same results.
Try the following and see if your result is different than mine:
import xmlreader dump = xmlreader.XmlDump("kwwiki-20100926-pages-meta-history.xml.bz2") parser = dump.parse()
Reading XML dump...
print parser
<generator object new_parse at 0x0132A968>
If you get <generator object regex_parse ...> instead of new_parse, then you don't have elementtree available, although since it is supposed to be standard since Python 2.5 that would be somewhat surprising.
Russ
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
What is your OS and how did you installed Python 2.7 (r27:82525) ?
2010/10/6 Russell Blau russblau@hotmail.com
"emijrp" emijrp@gmail.com wrote in message news:AANLkTi=A-=HYv03T+xyhvFurJqCYA-bCjfMhx6N13pGD@mail.gmail.com...
I have tested your code, with the bz2 and 7z dumps, and I get titles with None value. The first one is the same error that apperas in my code.
Reading XML dump... None 2004-10-10T04:24:14Z
I have the last version of pywikipediabot and Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56). Probably, it can be a error of Python or cElementTree. What are your versions?
Pywikipedia [svn+ssh] (r8609, 2010/10/05, 16:21:42) Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]
However, I retried the same tests under Python 2.6.5 and got the same results.
Try the following and see if your result is different than mine:
import xmlreader dump = xmlreader.XmlDump("kwwiki-20100926-pages-meta-history.xml.bz2") parser = dump.parse()
Reading XML dump...
print parser
<generator object new_parse at 0x0132A968>
If you get <generator object regex_parse ...> instead of new_parse, then you don't have elementtree available, although since it is supposed to be standard since Python 2.5 that would be somewhat surprising.
Russ
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
cElementTree. What are your versions?
Python 2.6.4 (r264:75706, Jun 4 2010, 18:20:31)
(on f13 x64)
Greetings
Am 06.10.2010 08:40, schrieb emijrp:
I have tested your code, with the bz2 and 7z dumps, and I get titles with None value. The first one is the same error that apperas in my code.
Reading XML dump... None 2004-10-10T04:24:14Z
I have the last version of pywikipediabot and Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56). Probably, it can be a error of Python or cElementTree. What are your versions?
2010/10/5 Russell Blau <russblau@hotmail.com mailto:russblau@hotmail.com>
"emijrp" <emijrp@gmail.com <mailto:emijrp@gmail.com>> wrote in message news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com... > I think that there is an error in xmlreader.py. When parsing a full > revision XML (in this case[1]), using this code[2] (look at the > try-catch, it writes when fails) I get correctly username, > timestamp and revisionid, but sometimes, the page title and the page > id are None or empty string. > [1] > http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z > [2] http://pastebin.ca/1951930 > [3] http://pastebin.ca/1951937 I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file. Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages. Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing. I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py. Russ [4] http://pastebin.ca/1955170 _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
You didn't replicated the exact case. You must use: xmlreader.XmlDump(dumpfilename, allrevisions=True). I guess you parsed only one revision (the last?) for every page, so, it shows 4711. But you skipped the errors which happen when parsing the whole dump.
2010/10/5 Russell Blau russblau@hotmail.com
"emijrp" emijrp@gmail.com wrote in message news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
[1]
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...
[2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file.
Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages.
Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing.
I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py.
Russ
[4] http://pastebin.ca/1955170
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
I think that the problem is in the xmlreader.py module. I don't know why, but, I think that sometimes it clears the title, user, or other variables before complete the entire list of revision for a page. So when you read a revision these values have disappeared in some cases.
2010/11/7 emijrp emijrp@gmail.com
You didn't replicated the exact case. You must use: xmlreader.XmlDump(dumpfilename, allrevisions=True). I guess you parsed only one revision (the last?) for every page, so, it shows 4711. But you skipped the errors which happen when parsing the whole dump.
2010/10/5 Russell Blau russblau@hotmail.com
"emijrp" emijrp@gmail.com wrote in message
news:AANLkTimu0+xJMBU1f48z8di9deBS_4_gmC_gOB6t82iJ@mail.gmail.com...
I think that there is an error in xmlreader.py. When parsing a full revision XML (in this case[1]), using this code[2] (look at the try-catch, it writes when fails) I get correctly username, timestamp and revisionid, but sometimes, the page title and the page id are None or empty string.
[1]
http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-his...
[2] http://pastebin.ca/1951930 [3] http://pastebin.ca/1951937
I have been completely unable to replicate this supposed error. I downloaded the same kwwiki dump file that you referenced. I loaded it with xmlreader.XmlDump, ran it through the parser, and counted the number of XMLEntry objects it generated: 4711. Then as a test I opened the same dump as a text file and counted the number of lines that contain the string "<page>": 4711. So the parser is correctly returning one object per page item found in the file.
Next I ran the parser again with a script that would print out a message if any XMLEntry object had a missing title (None or empty string); no messages.
Then I searched for the specific page entry you showed in your pastebin item [3]. The result of this test is shown at [4]. In short, it found exactly the page title you said was missing.
I cannot explain why your results are different than mine, unless perhaps you have a corrupted copy of the dump file, or are not using the current version of xmlreader.py.
Russ
[4] http://pastebin.ca/1955170
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
pywikipedia-l@lists.wikimedia.org