From: "Brion L. VIBBER" brion@pobox.com
On ĵaŭ, 2002-03-07 at 10:50, Jan Hidders wrote:
From: "Brion Vibber" brion@pobox.com
Somewhat less solid was how to store the actual text internally: Lee suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
We cannot index UTF-8.
Nor HTML entities, naturally.
That's not so obvious. If you replace the &, # and ; with something that is indexed like ' and _ then indexing would work. But like I said, I favour UTF-8 anyway and we can solve the indexing problem with an extra column where everything is more or less represented as entities anyway. (We could even replace uppercase with lowercase there and have case insensitive indexing.)
[Incidentally; if we are to switch to UTF-8, we'll obviously want to do something about the fact that the current English wikipedia uses ISO-8859-1 high characters extensively. These pages can be converted fairly easily, either as a one time search & replace or as a normalise-an-old-page-when-we-first-load-it thing.]
I like the one-time-search-and-replace approach. No need to complicate and/or slow down the run-time code with checks and translation code.
Well, I was hoping there would be some evidence of some kind of consensus before anyone goes writing documents or code! :)
Of course. :-) I am still wondering what our great leader thinks of all this.
We can probably add this stuff to
http://meta.wikipedia.com/wiki.phtml?title=Proposed_Wikipedia_policy_on_fore ign_characters
This page is more about the local policy on the English Wikipedia. I would like to see a page with a title like "A common architecture for Wikipedias in all languages". I'd start writing it, but work is really busy at the moment.
The only sane format for URLs would be url-encoded UTF-8. This is the recommended norm (http://www.w3.org/International/O-URL-and-ident.html), it is the most future-proof (can you imagine if we kept all our URLs in EBCDIC instead of ASCII because everybody still had links & bookmarks from their old IBM mainframe days?), and it allows links across languages to be consistently represented.
Completely agreed.
-- Jan Hidders