Ok! Good to know - thanks! What does this mean, e.g. for Preloading or else? Is the parse thought to parse deleted pages too (and it's an error in it not to do so) or is this part of the parsers concept?
Thanks and greetings!
Am 01.10.2010 16:22, schrieb emijrp:
The dump doesn't include deleted pages or revisions. The dump has the values but the parser doesn't parse them.
2010/10/1 Dr. Trigon <dr.trigon@surfeu.ch mailto:dr.trigon@surfeu.ch>
May be I am wrong, but xqt told me once that the PreloadingGenerator has problems with API. I myself had problems due to deleted (and re- direct) pages with API loading multiple pages at once too. So my assumption is, this xml parser has indeed problem parsing the deleted (and maybe redirect) pages and thus fails to return them all and so the PreloadingGenerator does not work with API. If I am right with this, the solution to the problem mentioned here can also solve the Preloading with API problem. This would be very nice! But the be sure I would appreciate a comment by xqt on this ;)) Just some thoughts... Greetings DrTrigon Am 01.10.2010 00:52, schrieb Dmitry Chichkov: > I see. Strange... That indeed looks like a parser bug. > > -- Dmitry > > > On Thu, Sep 30, 2010 at 3:37 PM, emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com> > <mailto:emijrp@gmail.com <mailto:emijrp@gmail.com>>> wrote: > > Furthermore, if you see the chunk of the dump that I have posted, > the page title and page id are there. But the parser doesn't get them. > > 2010/10/1 emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com> <mailto:emijrp@gmail.com <mailto:emijrp@gmail.com>>> > > Hi, thanks for your quick response, but I have a question. Why > are deleted pages included in the dump? Also, the page of the > error is not deleted in the wiki.[1] > > [1] http://kw.wikipedia.org/wiki/Rebellyans_Kernow_1497 > > 2010/9/30 Dmitry Chichkov <dchichkov@gmail.com <mailto:dchichkov@gmail.com> > <mailto:dchichkov@gmail.com <mailto:dchichkov@gmail.com>>> > > Hi Emijrp, > > That's "normal". Page id/title can be None/empty for deleted > pages. > > -- Regards, Dmitry > > > On Thu, Sep 30, 2010 at 9:50 AM, emijrp <emijrp@gmail.com <mailto:emijrp@gmail.com> > <mailto:emijrp@gmail.com <mailto:emijrp@gmail.com>>> wrote: > > Hi all; > > I think that there is an error in xmlreader.py. When > parsing a full revision XML (in this case[1]), using > this code[2] (look at the try-catch, it writes when > fails) I get correctly username, timestamp and > revisionid, but sometimes, the page title and the page > id are None or empty string. > > The first error is: > ['', None, 'QuartierLatin1968', '2004-10-10T04:24:14Z', > '4267'] > > But if we do: > 7za e -bd -so kwwiki-20100926-pages-meta-history.xml.7z > 2>/dev/null | egrep -i '2004-10-10T04::14Z' -C20 > > We get this[3], which is OK, the page title and the page > id are available in the XML, but not correctly parsed. > And this is not the only page title and page it that fails. > > Perhaps I have missed something, because I'm learning to > parsing XML. Sorry in that case. > > Regards, > emijrp > > [1] > http://download.wikimedia.org/kwwiki/20100926/kwwiki-20100926-pages-meta-history.xml.7z > [2] http://pastebin.ca/1951930 > [3] http://pastebin.ca/1951937 > > _______________________________________________ > Pywikipedia-l mailing list > Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> > <mailto:Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > > > > _______________________________________________ > Pywikipedia-l mailing list > Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> > <mailto:Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > > > > > _______________________________________________ > Pywikipedia-l mailing list > Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> > <mailto:Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l > > > > > _______________________________________________ > Pywikipedia-l mailing list > Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l _______________________________________________ Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org <mailto:Pywikipedia-l@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
Pywikipedia-l mailing list Pywikipedia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l