jmerkey(a)wolfmountaingroup.com wrote:
Jeffrey Vernon
Merkey wrote:
NOTE: Many of these errors are self-referencing
#REDIRECT statements
which cause database corruption if not applied in the right order.
Offhand I'd
guess that your table schemas are wrong, using
case-insensitive collation. Page title fields must be set as binary
(varbinary or varchar binary) to ensure you don't get duplicate key
errors.
Can you double-check?
Table Schema's are those produced by tables.sql. I use the following
script (passwords removed) to create the shell database. Opening the XML
dump with hexedit reveals there are in fact a large number of duplicate
titles.
I am guessing this may be due to the clustering setup you are using not
checking for duplicate titles. <title>Ohm Law</title> is one example to
check.
Duplicate titles in the source should be impossible, as the title
records are read straight out of the page table in a single pass.
I've confirmed:
[brion@benet 20070908]$ gzip -dc enwiki-20070908-stub-articles.xml.gz |
grep '<title>Ohm Law</title>'
<title>Ohm Law</title>
[brion@benet 20070908]$ bzip2 -dc enwiki-20070908-pages-articles.xml.bz2
| grep '<title>Ohm Law</title>'
<title>Ohm Law</title>
[brion@benet 20070908]$ gzip -dc
enwiki-20070908-stub-meta-current.xml.gz | grep '<title>Ohm
Law</title>'
<title>Ohm Law</title>
[brion@benet 20070908]$ bzip2 -dc
enwiki-20070908-pages-meta-current.xml.bz2 | grep '<title>Ohm
Law</title>'
<title>Ohm Law</title>
So, no problem in the dump files. Can you confirm these are the affected
files? I notice that you didn't mention which dump files you are using
("the latest", but of which wiki, and which data set?), nor have you
told how you got SQL out of the XML dumps, with what software, what
version of it, what if any options, what if any processing...
Another possibility might be if you're using a dump-to-SQL tool that's
broken, perhaps in its handling of namespaces, or if you've done some
processing on the XML dump file which damages the namespace list.
mwdumper, for instance, will require the <namespaces> info to properly
split titles.
MW's internal importDump.php (which is much slower) may have unexpected
results if your local namespaces don't match, particularly where some
namespaces in the dump match local interwiki prefixes (eg 'Wikipedia:')
and are not defined locally.
Yet another possibility is a corrupt download, where parts of the file
have been duplicated within itself.
Or of course you may just be importing the dump twice by mistake somehow.
However given your note about #redirect entries, the most likely
explanation is mismarked indexes.
Can you provide the exact 'page' table definition you're using?
Trunk's maintenance/tables.sql (as of r26282) defines page_title as:
-- The rest of the title, as text.
-- Spaces are transformed into underscores in title storage.
page_title varchar(255) binary NOT NULL,
which should be nicely binary-safe for sorting and unique index matches.
Please confirm that your table has the same definition for the field.
Here is the method I use for each MediaWiki version to
setup the base tables.
mysqladmin drop endb --password=XXXX
mysqladmin create endb --password=XXXX
echo "grant all privileges on endb.* to wgchr@localhost identified by
'dhbowt';" | mysql --password=XXXX
echo "flush privileges" | mysql --password=XXXX
mysql --password=XXXX endb < /wikidump/en/maintenance/tables.sql
mysql --password=XXXX endb < /wikidump/en/maintenance/wikipedia-interwiki.sql
php maintenance/createBcrat.php WikiSysop XXXX
php maintenance/changePassword.php --user=WikiSysop --password=XXXX
Confirm the version and the table definition, and your XML-to-SQL
conversion.
-- brion vibber (brion @
wikimedia.org)