Hi,
[I'm resending the message, the original one didn't appear
on the list. Maybe because of incomplete bits of XML in
the content, which I've removed in this version.]
I'm conducting a research on content from all 250+ language
versions of Wikipedia. I'm primarily concenred with the page,
langlinks, redirect, pagelinks and categorylinks tables for
each language version.
I've noticed that SQL dumps of the redirect tables often
contain less entries than they should. The problem applies
to both large and small wikis.
For a concrete, easy-to-reproduce example, take a small wiki,
like the Zulu language version, and compare the contents of
XML dump with the contents of SQL dump:
$ bzcat zuwiki-20080206-pages-articles.xml.bz2 | grep '#REDIRECT' \
| sed 's/<[^>]*>//g'
#REDIRECT [[Wikipedia:Sandbox]]
#REDIRECT [[IRiphabliki yaseNingizimu Afrika]]
#REDIRECT [[eThekwini]]
#REDIRECT [[IsiHolandi]]
#REDIRECT [[IGoli]]
#REDIRECT [[UBudha]]
#REDIRECT [[iShayina]]
#REDIRECT [[UJesu Krestu]]
#REDIRECT [[Ikhasi Elikhulu]]
#REDIRECT [[KwaXhosa]]
#REDIRECT [[iGauteng]]
#REDIRECT [[KwaXhosa]]
#REDIRECT[[Mošovce]]
#REDIRECT [[iJapani]]
#REDIRECT [[IsItalian]]
#REDIRECT [[ITheku]]
$ zcat zuwiki-20080206-redirect.sql.gz | grep ^INSERT
INSERT INTO `redirect` VALUES
(1720,0,'EThekwini'),(2105,0,'IJapani'),(2376,0,'ITheku'),
(2334,0,'IsItalian'),(2157,0,'Medmaacher_Klaaf:Purodha'),
(2096,0,'Mošovce'),(2120,0,'User:CommonsDelinker'),
(2213,3,'Multichill');
Take a look at the XML dump to confirm that redirects
like zu:Isi-Dutch -> zu:IsiHolandi do exist.
Beside the obvious "Can you fix it?" question, could
you tell me how the SQL dumps of redirect tables are
generated? Maybe the tables are OK, and I'm missing
something.
Currently I'm processing the XML dumps to generate
the correct SQL dumps for redirect tables (an easy
modification of mwdumper introducing a new subclass
of SqlWriter solves the problem).
Regards,
Lukasz