Clutch wrote:
Toby Bartels wrote:
You can (and we often do, on [[en:]]) using HTML entities, such as Č (for "C(", "C" with a hacek, TeX's "\v C").
That approach borks things up. Specifically, it screws websearches. How many people are going to enter, or know to enter, the HTML entity when they type in a search term?
Google is quite capable of finding "Č" when a user enters "C(" (well not literally "C(", but the actual Czech letter itself). If Wikipedia's own search engine isn't, then we should fix that anyway.
Related, but slightly different, it screws up collation. With collation you can find things with diacritics even when you aren't putting the diacritics in yourself, and sorting order gets done properly.
I don't see how this is relevant to text. It *is* relevant to titles, but I already agree with you that UTF-8 would be nice to have for those!
I think UTF-8 is the way to go. It's been out for years, and is now widely supported.
Not widely enough, if anthere is accurate evidence. (Better evidence would be citations from server logs for the various Latin-1 wikis that people want to switch over.) I don't know anybody that would oppose switching everything to Unicode once it's nearly universally supported -- so that's what it comes down to.
I don't want to get into this argument too much -- I support switching to UTF-8 if it won't screw things up, and I oppose switching if it will screw things up. Other than that, I just have some evidence (from meta) that it *can* screw things up, so we need to watch for it; but switching may well still be the right thing to do! I just wanted to point that the functionality is there (but not conveniently) in the body of the article (but not the title).
-- Toby