Hi,
[I'm resending the message, the original one didn't appear on the list. Maybe because of incomplete bits of XML in the content, which I've removed in this version.]
I'm conducting a research on content from all 250+ language versions of Wikipedia. I'm primarily concenred with the page, langlinks, redirect, pagelinks and categorylinks tables for each language version.
I've noticed that SQL dumps of the redirect tables often contain less entries than they should. The problem applies to both large and small wikis.
For a concrete, easy-to-reproduce example, take a small wiki, like the Zulu language version, and compare the contents of XML dump with the contents of SQL dump:
$ bzcat zuwiki-20080206-pages-articles.xml.bz2 | grep '#REDIRECT' \ | sed 's/<[^>]*>//g' #REDIRECT [[Wikipedia:Sandbox]] #REDIRECT [[IRiphabliki yaseNingizimu Afrika]] #REDIRECT [[eThekwini]] #REDIRECT [[IsiHolandi]] #REDIRECT [[IGoli]] #REDIRECT [[UBudha]] #REDIRECT [[iShayina]] #REDIRECT [[UJesu Krestu]] #REDIRECT [[Ikhasi Elikhulu]] #REDIRECT [[KwaXhosa]] #REDIRECT [[iGauteng]] #REDIRECT [[KwaXhosa]] #REDIRECT[[Mošovce]] #REDIRECT [[iJapani]] #REDIRECT [[IsItalian]] #REDIRECT [[ITheku]]
$ zcat zuwiki-20080206-redirect.sql.gz | grep ^INSERT INSERT INTO `redirect` VALUES (1720,0,'EThekwini'),(2105,0,'IJapani'),(2376,0,'ITheku'), (2334,0,'IsItalian'),(2157,0,'Medmaacher_Klaaf:Purodha'), (2096,0,'Mošovce'),(2120,0,'User:CommonsDelinker'), (2213,3,'Multichill');
Take a look at the XML dump to confirm that redirects like zu:Isi-Dutch -> zu:IsiHolandi do exist.
Beside the obvious "Can you fix it?" question, could you tell me how the SQL dumps of redirect tables are generated? Maybe the tables are OK, and I'm missing something.
Currently I'm processing the XML dumps to generate the correct SQL dumps for redirect tables (an easy modification of mwdumper introducing a new subclass of SqlWriter solves the problem).
Regards, Lukasz