On ĵaŭ, 2002-03-07 at 10:50, Jan Hidders wrote:
From: "Brion Vibber" brion@pobox.com
Somewhat less solid was how to store the actual text internally: Lee suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
We cannot index UTF-8.
Nor HTML entities, naturally.
But if we introduce a redundant indexable field where all characters (even the ASII ones) are represented with their unicode number, then we would have a way around the 4-letter indexing boundary and the problem that you cannot index anything but letters. So in that case I would vote for UTF-8 since that would probably be the most efficient anyway.
Hmm, that's an idea.
[Incidentally; if we are to switch to UTF-8, we'll obviously want to do something about the fact that the current English wikipedia uses ISO-8859-1 high characters extensively. These pages can be converted fairly easily, either as a one time search & replace or as a normalise-an-old-page-when-we-first-load-it thing.]
If there's some consensus on this, we can get crackin' and get this implemented so the upgrades can proceed.
Er, I would suggest that before coding we set up a document that describes what the consenus is. It should say what codings are used for what, when and where. It should also say which functions take care of this coding. This would also include the coding used in URLs.
Well, I was hoping there would be some evidence of some kind of consensus before anyone goes writing documents or code! :)
We can probably add this stuff to http://meta.wikipedia.com/wiki.phtml?title=Proposed_Wikipedia_policy_on_fore...
The only sane format for URLs would be url-encoded UTF-8. This is the recommended norm (http://www.w3.org/International/O-URL-and-ident.html), it is the most future-proof (can you imagine if we kept all our URLs in EBCDIC instead of ASCII because everybody still had links & bookmarks from their old IBM mainframe days?), and it allows links across languages to be consistently represented.
-- brion vibber (brion @ pobox.com)