I recently dumped (well, someone else dumped it) the data from a MediaWiki database stored in a MySQL v3.23.58 server and imported it into a v4.0.21 server.
I noticed on the new instance of the wiki, running on MySQL v4.0.21, that there were some pages where the text wasn't displaying properly and on Editing the page, the 'bad' data was replaced with a number of question marks. Without saving the Edit, I noticed that characters in the database were garbled (I temporarily do not have access to the original site so I can't verify the exact original data but it was served correctly there just before that server was taken off-line and the dump produced.) and that the "garbling" originated in the dump file (probably created in the dump process).
Is this a known issue and is there a way to prevent/correct it? I know that MySQL v4.0.x is recommended for MediaWiki but the reasons given seem to be related to performance.
I saw a note in the list archive related to similar issues with MySQL v4.1.x,
http://mail.wikipedia.org/pipermail/mediawiki-l/2004-November/ 002245.html
but I'm not certain this is exactly the same issue since the dump didn't actually turn them into question marks and I can't positively identify the characters from the original database that were corrupted. I think one of them was 0xe8 or 0xe9 (è or é, if those display correctly in this email -- è or é in HTML encoding) but, being a hopeless English speaking monoglot, I don't know for sure which would have been used.
John Blumel
John Blumel wrote:
I recently dumped (well, someone else dumped it) the data from a MediaWiki database stored in a MySQL v3.23.58 server and imported it into a v4.0.21 server.
I noticed on the new instance of the wiki, running on MySQL v4.0.21, that there were some pages where the text wasn't displaying properly and on Editing the page, the 'bad' data was replaced with a number of question marks. Without saving the Edit, I noticed that characters in the database were garbled (I temporarily do not have access to the original site so I can't verify the exact original data but it was served correctly there just before that server was taken off-line and the dump produced.) and that the "garbling" originated in the dump file (probably created in the dump process).
Using the same encoding?
Latin-1 or UTF-8 wiki?
-- brion vibber (brion @ pobox.com)
On Apr 1, 2005, at 5:03pm, Brion Vibber wrote:
Using the same encoding?
Latin-1 or UTF-8 wiki?
Do you mean using the same encoding on the import as the dump? Or, was the correct encoding (to match the DB) used on the dump (or import)? Or something else?
I'll have to check on exactly how the dump was generated. I've been using CocoaMySQL (Mac OS X) to do the imports and have tried both encodings, although, perhaps I should try it from the command line to make sure CocoaMySQL is doing what I think it is. (The dump was generated from a Linux system, although, I wouldn't think that would make a difference.)
As to the wikis, the following are set to their DefaultSettings.php values:
$wgInputEncoding = 'ISO-8859-1'; # LanguageUtf8.php normally overrides this $wgOutputEncoding = 'ISO-8859-1'; # unless you set the next option to true: $wgUseLatin1 = false; # Enable ISO-8859-1 compatibility mode $wgEditEncoding = '';
But it sounds like I need to learn a bit more about this.
Assuming you can give an answer based on the information I've provided, what is the correct way to do the dumps and imports? (URL or page name on Meta?)
John Blumel
mediawiki-l@lists.wikimedia.org