Wikihackers,
You may remember me from last year, pushing for UTF-8 on the English wikipedia. There was more to it than I thought at the time, so I retreated, checking the site's encoding every few weeks to see if anyone had made a breakthrough. Alas, here we are, still chugging along in trusty old ISO-8859-1.
Knowing that shutting down the site for a few days is unacceptable, I'm working on a technique to keep the site up, already in UTF-8 mode, while 'cur' and 'old' are converted. The site itself would be doing the bulk of the conversion. The idea is to intercept calls to these tables and convert the encoding before they are sent to the user. If a conversion is made, save it to the database so we don't have to convert it again. Here's an outline:
1. Take site offline. :(
2. Convert smaller, essential tables. (user? what else?)
3. Apply code changes to intercept retrievals from cur and old: Change A: For _title queries that don't return anything, try converting querystring to Latin1. Change B: Check the _text (and _user_text ?) strings to see if they are valid UTF-8. If not, convert and update the database. Change C: $wgUseLatin1 = false !
4. Take site back online. :D
5. When the dust settles, start an external script to run through the database and update any unpopular entries that haven't been converted.
I've written some code for the cur_text intercept to make sure that it will work. It's running well for me on a test DB I made, a simple English DB converted to Latin1. It's only for the cur_text field right now, but it should work the others if expanded. I haven't looked at the problem with titles (Change A), but I think it could be similarly tackled.
Before I continue, I wonder if anyone sees catastrophic problems with this approach? My familiarity with the codebase is still pretty spotty.
Here's the patch:
Index: Database.php =================================================================== RCS file: /cvsroot/wikipedia/phase3/includes/Database.php,v retrieving revision 1.71.2.7 diff --unified=3 -w -B -r1.71.2.7 Database.php --- Database.php 14 Jan 2005 13:04:04 -0000 1.71.2.7 +++ Database.php 8 Feb 2005 20:10:37 -0000 @@ -681,11 +681,37 @@ } $obj = $this->fetchObject( $res ); $this->freeResult( $res ); + + if ($conds['cur_id']) + $this->checkEncoding($conds['cur_id'], $obj); + return $obj; } /** + * Checks if the object's encoding is UTF-8. If it is not, convert to UTF-8 and update + * the database record as well. + * + * @param $id the cur_id or old_id + * @param $obj object whose fields will be verified and possibly updated + */ + function checkEncoding($id, $obj) { + + // If a conversion from UTF-8 to UTF-8 does not result in the same string, it is + // not a valid UTF-8 string. + if ($obj->cur_text != iconv("UTF-8", "UTF-8", $obj->cur_text)) { + + // Assume it is Windows-1252 (Latin1 + em dashes etc., then convert. + $obj->cur_text = iconv("Windows-1252", "UTF-8", $obj->cur_text); + + //save converted string to db + $cur_text = mysql_real_escape_string($obj->cur_text); + $this->query("UPDATE cur SET cur_text = '$cur_text' WHERE cur_id = $id"); + } + } + + /** * Removes most variables from an SQL query and replaces them with X or N for numbers. * It's only slightly flawed. Don't use for anything important. *