Hi,
[I'm resending the message, the original one didn't appear on the list. Maybe because of incomplete bits of XML in the content, which I've removed in this version.]
I'm conducting a research on content from all 250+ language versions of Wikipedia. I'm primarily concenred with the page, langlinks, redirect, pagelinks and categorylinks tables for each language version.
I've noticed that SQL dumps of the redirect tables often contain less entries than they should. The problem applies to both large and small wikis.
For a concrete, easy-to-reproduce example, take a small wiki, like the Zulu language version, and compare the contents of XML dump with the contents of SQL dump:
$ bzcat zuwiki-20080206-pages-articles.xml.bz2 | grep '#REDIRECT' \ | sed 's/<[^>]*>//g' #REDIRECT [[Wikipedia:Sandbox]] #REDIRECT [[IRiphabliki yaseNingizimu Afrika]] #REDIRECT [[eThekwini]] #REDIRECT [[IsiHolandi]] #REDIRECT [[IGoli]] #REDIRECT [[UBudha]] #REDIRECT [[iShayina]] #REDIRECT [[UJesu Krestu]] #REDIRECT [[Ikhasi Elikhulu]] #REDIRECT [[KwaXhosa]] #REDIRECT [[iGauteng]] #REDIRECT [[KwaXhosa]] #REDIRECT[[Mošovce]] #REDIRECT [[iJapani]] #REDIRECT [[IsItalian]] #REDIRECT [[ITheku]]
$ zcat zuwiki-20080206-redirect.sql.gz | grep ^INSERT INSERT INTO `redirect` VALUES (1720,0,'EThekwini'),(2105,0,'IJapani'),(2376,0,'ITheku'), (2334,0,'IsItalian'),(2157,0,'Medmaacher_Klaaf:Purodha'), (2096,0,'Mošovce'),(2120,0,'User:CommonsDelinker'), (2213,3,'Multichill');
Take a look at the XML dump to confirm that redirects like zu:Isi-Dutch -> zu:IsiHolandi do exist.
Beside the obvious "Can you fix it?" question, could you tell me how the SQL dumps of redirect tables are generated? Maybe the tables are OK, and I'm missing something.
Currently I'm processing the XML dumps to generate the correct SQL dumps for redirect tables (an easy modification of mwdumper introducing a new subclass of SqlWriter solves the problem).
Regards, Lukasz
On Mon, Mar 3, 2008 at 2:06 PM, Lukasz Bolikowski l.bolikowski@icm.edu.pl wrote:
I've noticed that SQL dumps of the redirect tables often contain less entries than they should. The problem applies to both large and small wikis.
This is because it didn't always exist. Legacy redirects (that have not been changed in the last . . . maybe year?) will have page.page_is_redirect set to indicate that they're a redirect. The old way then just parsed the text of the page for a target.
It's possible to fix this using maintenance/refreshLinks.php, but nobody's done it on Wikimedia wikis, apparently. It would be a pretty big operation on the large ones, of course, but I don't think it should be unmanageable at all, assuming the script is sane and only tries parsing pages with page_is_redirect but no redirect table row. You'd have to ask someone with shell access to do it.
Seems to have an entry in the bugzilla for that : https://bugzilla.wikimedia.org/show_bug.cgi?id=10931
2008/3/4, Simetrical Simetrical+wikilist@gmail.com:
On Mon, Mar 3, 2008 at 2:06 PM, Lukasz Bolikowski l.bolikowski@icm.edu.pl wrote:
I've noticed that SQL dumps of the redirect tables often contain less entries than they should. The problem applies to both large and small wikis.
This is because it didn't always exist. Legacy redirects (that have not been changed in the last . . . maybe year?) will have page.page_is_redirect set to indicate that they're a redirect. The old way then just parsed the text of the page for a target.
It's possible to fix this using maintenance/refreshLinks.php, but nobody's done it on Wikimedia wikis, apparently. It would be a pretty big operation on the large ones, of course, but I don't think it should be unmanageable at all, assuming the script is sane and only tries parsing pages with page_is_redirect but no redirect table row. You'd have to ask someone with shell access to do it.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org