NOTE: Many of these errors are self-referencing #REDIRECT statements which cause database corruption if not applied in the right order.
In short, the DUMPS are broken again.
Jeff ----- Original Message ----- From: Jeffrey Vernon Merkey To: wikitech-l@wikimedia.org Sent: Monday, October 01, 2007 1:11 PM Subject: Duplicate Articles in Database Dump -- MYSQL errors
The last two runs of database dumps are severely broken and have a large number of duplicate titles which cause mysql to throw errors. I have managed to get some good translation runs and full imports, but not without a lot of work, and a lot of wasted time.
How about we fix the database dumps for those of us who need them and stop introfucing breakage.
Thanks. Latest errors and duplicate titles attached. I did not attach the full listing since this will exceed the limits of the mail server for a single message since there are hundreds of errors.
The 20070802 dumps do not seem to have this problem as bad. ERROR 1062 (23000) at line 96187: Duplicate entry '0-Nyquist_theorem' for key 2
ERROR 1062 (23000) at line 113658: Duplicate entry '0-Urysohn_lemma' for key 2
ERROR 1062 (23000) at line 114417: Duplicate entry '0-Turner_syndrome' for key 2
ERROR 1062 (23000) at line 116050: Duplicate entry '0-World_fair' for key 2
ERROR 1062 (23000) at line 125448: Duplicate entry '0-Dining_cryptographer_protocol' for key 2
ERROR 1062 (23000) at line 132820: Duplicate entry '0-Rothmund-Thompson_syndrome' for key 2
ERROR 1062 (23000) at line 136377: Duplicate entry '0-Hansen_disease' for key 2
ERROR 1062 (23000) at line 139130: Duplicate entry '0-Wilson_disease' for key 2
ERROR 1062 (23000) at line 147900: Duplicate entry '0-Falkner_Island' for key 2
ERROR 1062 (23000) at line 170901: Duplicate entry '0-Microsoft_.NET' for key 2
ERROR 1062 (23000) at line 184074: Duplicate entry '0-Ohm_Law' for key 2
ERROR 1062 (23000) at line 204307: Duplicate entry '0-Kaposi_Sarcoma' for key 2
ERROR 1062 (23000) at line 257406: Duplicate entry '0-Bell_inequality' for key 2
ERROR 1062 (23000) at line 289396: Duplicate entry '0-Hollywood_Walk_of_Fame' for key 2
ERROR 1062 (23000) at line 343974: Duplicate entry '0-Sgt._Pepper_Lonely_Hearts_Club_Band' for key 2
ERROR 1062 (23000) at line 350134: Duplicate entry '0-Boltzmann_constant' for key 2
ERROR 1062 (23000) at line 361687: Duplicate entry '0-Gauss_theorem' for key 2
ERROR 1062 (23000) at line 369568: Duplicate entry '0-Charles_law' for key 2
ERROR 1062 (23000) at line 374524: Duplicate entry '0-Lavender_Blue' for key 2
ERROR 1062 (23000) at line 375710: Duplicate entry '0-DeMorgan' for key 2
ERROR 1062 (23000) at line 378180: Duplicate entry '0-Kaposi_sarcoma' for key 2
ERROR 1062 (23000) at line 383521: Duplicate entry '0-Down_syndrome' for key 2
ERROR 1062 (23000) at line 384832: Duplicate entry '0-Maslow_hierarchy_of_needs' for key 2
ERROR 1062 (23000) at line 389129: Duplicate entry '0-Japan_copyright_law' for key 2
ERROR 1062 (23000) at line 397556: Duplicate entry '0-Sainsbury' for key 2
ERROR 1062 (23000) at line 409151: Duplicate entry '0-St._John' for key 2
ERROR 1062 (23000) at line 413091: Duplicate entry '0-Milgram_experiment' for key 2
ERROR 1062 (23000) at line 418176: Duplicate entry '0-Hudson_Bay_Company' for key 2
ERROR 1062 (23000) at line 419421: Duplicate entry '0-Tourette_syndrome' for key 2
ERROR 1062 (23000) at line 423098: Duplicate entry '0-Benny_Goodman_Orchestra' for key 2
ERROR 1062 (23000) at line 427487: Duplicate entry '0-Long_John_Silver' for key 2
ERROR 1062 (23000) at line 432944: Duplicate entry '0-Schroedinger_equation' for key 2
ERROR 1062 (23000) at line 442032: Duplicate entry '0-Hölder_inequality' for key 2
ERROR 1062 (23000) at line 444903: Duplicate entry '0-Jay_Treaty' for key 2
ERROR 1062 (23000) at line 446458: Duplicate entry '0-Back_River' for key 2
ERROR 1062 (23000) at line 453610: Duplicate entry '0-1980' for key 2
ERROR 1062 (23000) at line 478337: Duplicate entry '0-Hudson_Bay' for key 2
ERROR 1062 (23000) at line 479826: Duplicate entry '0-Hilbert_basis_theorem' for key 2
ERROR 1062 (23000) at line 484443: Duplicate entry '0-Saint_Mary' for key 2
Jeffrey Vernon Merkey wrote:
NOTE: Many of these errors are self-referencing #REDIRECT statements which cause database corruption if not applied in the right order.
Offhand I'd guess that your table schemas are wrong, using case-insensitive collation. Page title fields must be set as binary (varbinary or varchar binary) to ensure you don't get duplicate key errors.
Can you double-check?
-- brion vibber (brion @ wikimedia.org)
Jeffrey Vernon Merkey wrote:
NOTE: Many of these errors are self-referencing #REDIRECT statements which cause database corruption if not applied in the right order.
Offhand I'd guess that your table schemas are wrong, using case-insensitive collation. Page title fields must be set as binary (varbinary or varchar binary) to ensure you don't get duplicate key errors.
Can you double-check?
Table Schema's are those produced by tables.sql. I use the following script (passwords removed) to create the shell database. Opening the XML dump with hexedit reveals there are in fact a large number of duplicate titles.
I am guessing this may be due to the clustering setup you are using not checking for duplicate titles. <title>Ohm Law</title> is one example to check.
Here is the method I use for each MediaWiki version to setup the base tables.
mysqladmin drop endb --password=XXXX mysqladmin create endb --password=XXXX echo "grant all privileges on endb.* to wgchr@localhost identified by 'dhbowt';" | mysql --password=XXXX echo "flush privileges" | mysql --password=XXXX mysql --password=XXXX endb < /wikidump/en/maintenance/tables.sql mysql --password=XXXX endb < /wikidump/en/maintenance/wikipedia-interwiki.sql php maintenance/createBcrat.php WikiSysop XXXX php maintenance/changePassword.php --user=WikiSysop --password=XXXX
Jeff
-- brion vibber (brion @ wikimedia.org)
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l
jmerkey@wolfmountaingroup.com wrote:
Jeffrey Vernon Merkey wrote:
NOTE: Many of these errors are self-referencing #REDIRECT statements which cause database corruption if not applied in the right order.
Offhand I'd guess that your table schemas are wrong, using case-insensitive collation. Page title fields must be set as binary (varbinary or varchar binary) to ensure you don't get duplicate key errors.
Can you double-check?
Table Schema's are those produced by tables.sql. I use the following script (passwords removed) to create the shell database. Opening the XML dump with hexedit reveals there are in fact a large number of duplicate titles.
I am guessing this may be due to the clustering setup you are using not checking for duplicate titles. <title>Ohm Law</title> is one example to check.
Duplicate titles in the source should be impossible, as the title records are read straight out of the page table in a single pass.
I've confirmed:
[brion@benet 20070908]$ gzip -dc enwiki-20070908-stub-articles.xml.gz | grep '<title>Ohm Law</title>' <title>Ohm Law</title> [brion@benet 20070908]$ bzip2 -dc enwiki-20070908-pages-articles.xml.bz2 | grep '<title>Ohm Law</title>' <title>Ohm Law</title>
[brion@benet 20070908]$ gzip -dc enwiki-20070908-stub-meta-current.xml.gz | grep '<title>Ohm Law</title>' <title>Ohm Law</title> [brion@benet 20070908]$ bzip2 -dc enwiki-20070908-pages-meta-current.xml.bz2 | grep '<title>Ohm Law</title>' <title>Ohm Law</title>
So, no problem in the dump files. Can you confirm these are the affected files? I notice that you didn't mention which dump files you are using ("the latest", but of which wiki, and which data set?), nor have you told how you got SQL out of the XML dumps, with what software, what version of it, what if any options, what if any processing...
Another possibility might be if you're using a dump-to-SQL tool that's broken, perhaps in its handling of namespaces, or if you've done some processing on the XML dump file which damages the namespace list.
mwdumper, for instance, will require the <namespaces> info to properly split titles.
MW's internal importDump.php (which is much slower) may have unexpected results if your local namespaces don't match, particularly where some namespaces in the dump match local interwiki prefixes (eg 'Wikipedia:') and are not defined locally.
Yet another possibility is a corrupt download, where parts of the file have been duplicated within itself.
Or of course you may just be importing the dump twice by mistake somehow.
However given your note about #redirect entries, the most likely explanation is mismarked indexes.
Can you provide the exact 'page' table definition you're using?
Trunk's maintenance/tables.sql (as of r26282) defines page_title as:
-- The rest of the title, as text. -- Spaces are transformed into underscores in title storage. page_title varchar(255) binary NOT NULL,
which should be nicely binary-safe for sorting and unique index matches. Please confirm that your table has the same definition for the field.
Here is the method I use for each MediaWiki version to setup the base tables.
mysqladmin drop endb --password=XXXX mysqladmin create endb --password=XXXX echo "grant all privileges on endb.* to wgchr@localhost identified by 'dhbowt';" | mysql --password=XXXX echo "flush privileges" | mysql --password=XXXX mysql --password=XXXX endb < /wikidump/en/maintenance/tables.sql mysql --password=XXXX endb < /wikidump/en/maintenance/wikipedia-interwiki.sql php maintenance/createBcrat.php WikiSysop XXXX php maintenance/changePassword.php --user=WikiSysop --password=XXXX
Confirm the version and the table definition, and your XML-to-SQL conversion.
-- brion vibber (brion @ wikimedia.org)
wikitech-l@lists.wikimedia.org