From: "Brion Vibber" brion@pobox.com
Somewhat less solid was how to store the actual text internally: Lee suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
We cannot index UTF-8. But if we introduce a redundant indexable field where all characters (even the ASII ones) are represented with their unicode number, then we would have a way around the 4-letter indexing boundary and the problem that you cannot index anything but letters. So in that case I would vote for UTF-8 since that would probably be the most efficient anyway.
If there's some consensus on this, we can get crackin' and get this implemented so the upgrades can proceed.
Er, I would suggest that before coding we set up a document that describes what the consenus is. It should say what codings are used for what, when and where. It should also say which functions take care of this coding. This would also include the coding used in URLs.
-- Jan Hidders