UTF-8 for en.wikipedia: convert as you go? - Wikitech-l

8 Feb 2005

Wikihackers,

You may remember me from last year, pushing for UTF-8 on the English 
wikipedia. There was more to it than I thought at the time, so I 
retreated, checking the site's encoding every few weeks to see if anyone 
had made a breakthrough. Alas, here we are, still chugging along in 
trusty old ISO-8859-1.

Knowing that shutting down the site for a few days is unacceptable, I'm 
working on a technique to keep the site up, already in UTF-8 mode, 
while 'cur' and 'old' are converted. The site itself would be doing the 
bulk of the conversion. The idea is to intercept calls to these tables 
and convert the encoding before they are sent to the user. If a 
conversion is made, save it to the database so we don't have to convert 
it again. Here's an outline:

1.	Take site offline. :(

2.	Convert smaller, essential tables. (user? what else?)

3.	Apply code changes to intercept retrievals from cur and old:
	Change A: For _title queries that don't return anything, try converting 
querystring to Latin1.
	Change B: Check the _text (and _user_text ?) strings to see if they are 
valid UTF-8. If not, convert and update the database.
	Change C: $wgUseLatin1 = false !

4.	Take site back online. :D

5.	When the dust settles, start an external script to run through the 
database and update any unpopular entries that haven't been converted.

I've written some code for the cur_text intercept to make sure that it 
will work. It's running well for me on a test DB  I made, a simple 
English DB converted to Latin1. It's only for the cur_text field right 
now, but it should work the others if expanded. I haven't looked at the 
problem with titles (Change A), but I think it could be similarly tackled.

Before I continue, I wonder if anyone sees catastrophic problems with 
this approach? My familiarity with the codebase is still pretty spotty.

Here's the patch:

Index: Database.php
===================================================================
RCS file: /cvsroot/wikipedia/phase3/includes/Database.php,v
retrieving revision 1.71.2.7
diff --unified=3 -w -B -r1.71.2.7 Database.php

--- Database.php	14 Jan 2005 13:04:04 -0000	1.71.2.7
+++ Database.php	8 Feb 2005 20:10:37 -0000
@@ -681,11 +681,37 @@
  		}
  		$obj = $this->fetchObject( $res );
  		$this->freeResult( $res );
+		
+		if ($conds['cur_id'])
+			$this->checkEncoding($conds['cur_id'], $obj);
+
  		return $obj;
  		
  	}
  	
  	/**
+	 * Checks if the object's encoding is UTF-8. If it is not, convert to 
UTF-8 and update
+	 * the database record as well.
+	 *
+	 * @param $id the cur_id or old_id
+	 * @param $obj object whose fields will be verified and possibly updated
+	 */
+	function checkEncoding($id, $obj) {
+		
+		// If a conversion from UTF-8 to UTF-8 does not result in the same 
string, it is
+		// not a valid UTF-8 string.
+		if ($obj->cur_text != iconv("UTF-8", "UTF-8",
$obj->cur_text)) {
+		
+			// Assume it is Windows-1252 (Latin1 + em dashes etc., then convert.
+			$obj->cur_text = iconv("Windows-1252", "UTF-8",
$obj->cur_text);
+			
+			//save converted string to db
+			$cur_text = mysql_real_escape_string($obj->cur_text);
+			$this->query("UPDATE cur SET cur_text = '$cur_text' WHERE cur_id = 
$id");
+		}
+	}
+
+	/**
  	 * Removes most variables from an SQL query and replaces them with X 
or N for numbers.
  	 * It's only slightly flawed. Don't use for anything important.
  	 *