Wikihackers,
You may remember me from last year, pushing for UTF-8 on the English
wikipedia. There was more to it than I thought at the time, so I
retreated, checking the site's encoding every few weeks to see if anyone
had made a breakthrough. Alas, here we are, still chugging along in
trusty old ISO-8859-1.
Knowing that shutting down the site for a few days is unacceptable, I'm
working on a technique to keep the site up, already in UTF-8 mode,
while 'cur' and 'old' are converted. The site itself would be doing the
bulk of the conversion. The idea is to intercept calls to these tables
and convert the encoding before they are sent to the user. If a
conversion is made, save it to the database so we don't have to convert
it again. Here's an outline:
1. Take site offline. :(
2. Convert smaller, essential tables. (user? what else?)
3. Apply code changes to intercept retrievals from cur and old:
Change A: For _title queries that don't return anything, try converting
querystring to Latin1.
Change B: Check the _text (and _user_text ?) strings to see if they are
valid UTF-8. If not, convert and update the database.
Change C: $wgUseLatin1 = false !
4. Take site back online. :D
5. When the dust settles, start an external script to run through the
database and update any unpopular entries that haven't been converted.
I've written some code for the cur_text intercept to make sure that it
will work. It's running well for me on a test DB I made, a simple
English DB converted to Latin1. It's only for the cur_text field right
now, but it should work the others if expanded. I haven't looked at the
problem with titles (Change A), but I think it could be similarly tackled.
Before I continue, I wonder if anyone sees catastrophic problems with
this approach? My familiarity with the codebase is still pretty spotty.
Here's the patch:
Index: Database.php
===================================================================
RCS file: /cvsroot/wikipedia/phase3/includes/Database.php,v
retrieving revision 1.71.2.7
diff --unified=3 -w -B -r1.71.2.7 Database.php
--- Database.php 14 Jan 2005 13:04:04 -0000 1.71.2.7
+++ Database.php 8 Feb 2005 20:10:37 -0000
@@ -681,11 +681,37 @@
}
$obj = $this->fetchObject( $res );
$this->freeResult( $res );
+
+ if ($conds['cur_id'])
+ $this->checkEncoding($conds['cur_id'], $obj);
+
return $obj;
}
/**
+ * Checks if the object's encoding is UTF-8. If it is not, convert to
UTF-8 and update
+ * the database record as well.
+ *
+ * @param $id the cur_id or old_id
+ * @param $obj object whose fields will be verified and possibly updated
+ */
+ function checkEncoding($id, $obj) {
+
+ // If a conversion from UTF-8 to UTF-8 does not result in the same
string, it is
+ // not a valid UTF-8 string.
+ if ($obj->cur_text != iconv("UTF-8", "UTF-8",
$obj->cur_text)) {
+
+ // Assume it is Windows-1252 (Latin1 + em dashes etc., then convert.
+ $obj->cur_text = iconv("Windows-1252", "UTF-8",
$obj->cur_text);
+
+ //save converted string to db
+ $cur_text = mysql_real_escape_string($obj->cur_text);
+ $this->query("UPDATE cur SET cur_text = '$cur_text' WHERE cur_id =
$id");
+ }
+ }
+
+ /**
* Removes most variables from an SQL query and replaces them with X
or N for numbers.
* It's only slightly flawed. Don't use for anything important.
*
Show replies by date
We'll be converting to UTF-8 when we restructure the database for the
1.5 upgrade. This will be sometime in the next couple of months.
1.5 supports on the fly conversion of the old text blobs on load, so
what needs to be converted is the usernames, titles, comments, etc in
the other support tables. This should be a relatively minor expense of
the overall conversion (which is necessary as it will be removing some
of the nasty bottlenecks of our current database layout).
-- brion vibber (brion @
pobox.com)