On Tue, 5 Mar 2002, Jimmy Wales wrote:
So, where do we stand on the issue of international upgrades?
I'd like to get back to these quickly, if possible. Starting with esperanto, and then polish. And then probably spanish, although of course we'll now need to co-ordinate with the forpas forked group, so that we minimize the extent of the forkage in the hopes of bringing things back together soon.
There was a discussion some days ago on how to best implement a more or less character set independent underlying system, but it sort of died out without any clear consensus. In particular, no response from lcrocker, who made the initial claim that the present system of using language-tied encodings was Wrong.
As it left off, Jan and I had more or less agreed on something like: * Check the browser's Accept-charset header; if available, use UTF-8. If not, use the most likely encoding used for that wiki's language. * Where necessary, convert characters into/out of HTML entities so that non-Unicode browsers can safely handle all characters. * Internally, non-ascii characters will need to be escaped somehow in the search index field to allow correct indexing.
Somewhat less solid was how to store the actual text internally: Lee suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
* UTF-8 is more space- and bandwidth-efficient and doesn't require outgoing transliteration for the many users with relatively current, UTF8-savvy browsers, but needs to be translated into native code / HTML entities for non-UTF-8-savvy browsers (old old old ones, and Netscape 4 which has very buggy Unicode support).
* ASCII+HTML entities won't require outgoing translation for non-UTF8-savvy browsers that nonetheless understand unicode-numbered character entities, but may not be much of an improvement for older browsers that don't know the character entities are always numbers in Unicode, not the current character set. Thus outgoing translation to the browser's character set is recommended to be safe.
Incoming translation is always required, as edited text will come to us in the character encoding used by the browser and may or may not have HTML entities typed by the user mixed in.
The character set translation can probably be done mostly via PHP's iconv support -- however this is an optional component and must be enabled during compile time (same as the annoying 4-letter minimum for search index terms). Also, some slight customisation of the process is necessary for for instance the Esperanto transliteration schema (basically in place in $RecodeInput/$RecodeOutput).
If there's some consensus on this, we can get crackin' and get this implemented so the upgrades can proceed.
-- brion vibber (brion @ pobox.com)