On ven, 2002-03-08 at 01:10, Jan Hidders wrote:
From: "Brion L. VIBBER"
<brion(a)pobox.com>
On ĵaŭ, 2002-03-07 at 10:50, Jan Hidders wrote:
From: "Brion Vibber"
<brion(a)pobox.com>
Somewhat less solid was how to store the actual text internally: Lee
suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
We cannot index UTF-8.
Nor HTML entities, naturally.
That's not so obvious. If you replace the &, # and ; with something that is
indexed like ' and _ then indexing would work.
Exactly what I'm saying; in both cases we can't index words including
non-ASCII characters unless we munge the text somehow.
Which reminds me, this will not work well for Japanese and Chinese,
which don't separate words by spaces... ugh!
But like I said, I favour
UTF-8 anyway and we can solve the indexing problem with an extra column
where everything is more or less represented as entities anyway.
Don't we already have a separate index column?
(We could
even replace uppercase with lowercase there and have case insensitive
indexing.)
Hmm, can be done.
[Incidentally;
if we are to switch to UTF-8, we'll obviously want to do
something about the fact that the current English wikipedia uses
ISO-8859-1 high characters extensively. These pages can be converted
fairly easily, either as a one time search & replace or as a
normalise-an-old-page-when-we-first-load-it thing.]
I like the one-time-search-and-replace approach. No need to complicate
and/or slow down the run-time code with checks and translation code.
Okay. We'll need another update script for poor Jimbo, then!
Well, I was
hoping there would be some evidence of some kind of
consensus before anyone goes writing documents or code! :)
Of course. :-) I am still wondering what our great leader thinks of all
this.
The ways of the Great Leader are mysterious indeed, we must await
revelation... :)
We can
probably add this stuff to
http://meta.wikipedia.com/wiki.phtml?title=Proposed_Wikipedia_policy_on_fore
ign_characters
This page is more about the local policy on the English Wikipedia. I would
like to see a page with a title like "A common architecture for Wikipedias
in all languages". I'd start writing it, but work is really busy at the
moment.
Okay, but they are clearly related topics since we'd be affecting the
English wikipedia as well! I might throw up a quick page in the morning
to start with.
The only sane
format for URLs would be url-encoded UTF-8. This is the
recommended norm (
http://www.w3.org/International/O-URL-and-ident.html),
it is the most future-proof (can you imagine if we kept all our URLs in
EBCDIC instead of ASCII because everybody still had links & bookmarks
from their old IBM mainframe days?), and it allows links across
languages to be consistently represented.
Completely agreed.
That's you, me, and Carey Evans then...
-- brion vibber (brion @
pobox.com)