On ven, 2002-03-08 at 01:10, Jan Hidders wrote:
From: "Brion L. VIBBER" brion@pobox.com
On ĵaŭ, 2002-03-07 at 10:50, Jan Hidders wrote:
From: "Brion Vibber" brion@pobox.com
Somewhat less solid was how to store the actual text internally: Lee suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
We cannot index UTF-8.
Nor HTML entities, naturally.
That's not so obvious. If you replace the &, # and ; with something that is indexed like ' and _ then indexing would work.
Exactly what I'm saying; in both cases we can't index words including non-ASCII characters unless we munge the text somehow.
Which reminds me, this will not work well for Japanese and Chinese, which don't separate words by spaces... ugh!
But like I said, I favour UTF-8 anyway and we can solve the indexing problem with an extra column where everything is more or less represented as entities anyway.
Don't we already have a separate index column?
(We could even replace uppercase with lowercase there and have case insensitive indexing.)
Hmm, can be done.
[Incidentally; if we are to switch to UTF-8, we'll obviously want to do something about the fact that the current English wikipedia uses ISO-8859-1 high characters extensively. These pages can be converted fairly easily, either as a one time search & replace or as a normalise-an-old-page-when-we-first-load-it thing.]
I like the one-time-search-and-replace approach. No need to complicate and/or slow down the run-time code with checks and translation code.
Okay. We'll need another update script for poor Jimbo, then!
Well, I was hoping there would be some evidence of some kind of consensus before anyone goes writing documents or code! :)
Of course. :-) I am still wondering what our great leader thinks of all this.
The ways of the Great Leader are mysterious indeed, we must await revelation... :)
We can probably add this stuff to
http://meta.wikipedia.com/wiki.phtml?title=Proposed_Wikipedia_policy_on_fore ign_characters
This page is more about the local policy on the English Wikipedia. I would like to see a page with a title like "A common architecture for Wikipedias in all languages". I'd start writing it, but work is really busy at the moment.
Okay, but they are clearly related topics since we'd be affecting the English wikipedia as well! I might throw up a quick page in the morning to start with.
The only sane format for URLs would be url-encoded UTF-8. This is the recommended norm (http://www.w3.org/International/O-URL-and-ident.html), it is the most future-proof (can you imagine if we kept all our URLs in EBCDIC instead of ASCII because everybody still had links & bookmarks from their old IBM mainframe days?), and it allows links across languages to be consistently represented.
Completely agreed.
That's you, me, and Carey Evans then...
-- brion vibber (brion @ pobox.com)