Switching everything to UTF-8

List overview All Threads
Download

newer

older

Re: [Wikipedia-l] Switching...

Daniel Mayer

18 Nov 2003 18 Nov '03

10:28 a.m.

Peter Gervai wrote:

...

Could you point us to the page and revision of the problem?

couple examples:

http://meta.wikipedia.org/w/wiki.phtml?title=What_to_do_with_www.wikipedia.o... http://meta.wikipedia.org/w/wiki.phtml?title=Main_Page&diff=20132&ol...

This happens on meta's Main Page often. Ask Anthere and Erik for other examples.

...

I'm curious what kind of problem it might have been, as many of the Wikipedias are in UTF-8 from the start, and we had no problem whatsoever.

Probably because their browsers work nicely in UTF-8 because they have to. If they didn't they would be useless for any language where UTF-8 is required. In places where UTF-8 isn't required, browsers that can't support it tend to slip by without being fixed or upgraded. If it ain't broke...

...

However we *do* have problems with english wikipedia when pages contain unrepresentable literal characters, which makes the page break after editing. See "Budapest" article on wikitravel, where every special dash and curly quote marks became question marks. Truly ugly.

I don't understand. Is Wikitravel in UTF-8?

-- Daniel Mayer (aka mav)

Show replies by date

Peter Gervai

18 Nov 18 Nov

4:30 p.m.

On Tue, Nov 18, 2003 at 04:28:32AM -0500, Daniel Mayer wrote:

...

Peter Gervai wrote:

...
Could you point us to the page and revision of the problem?

couple examples:

http://meta.wikipedia.org/w/wiki.phtml?title=What_to_do_with_www.wikipedia.o... http://meta.wikipedia.org/w/wiki.phtml?title=Main_Page&diff=20132&ol...

This happens on meta's Main Page often. Ask Anthere and Erik for other examples.

I see. First is not a good example, Opera 5 is _ancient_, you can't expect that anyone would support it, as upgrading is clearly painless.

Second example is indeed valid, but it isn't a problem for you: if the page does not contain non-8859-1 characters, nothing gets garbled. If it does contain others then, well, you *need* utf-8 on that page anyway. (Embed codes are a little bit slow to type, don't you agree? If not, write your reply manually by using embeds. :))

...

...
I'm curious what kind of problem it might have been, as many of the Wikipedias are in UTF-8 from the start, and we had no problem whatsoever.

Probably because their browsers work nicely in UTF-8 because they have to. If they didn't they would be useless for any language where UTF-8 is required. In places where UTF-8 isn't required, browsers that can't support it tend to slip by without being fixed or upgraded. If it ain't broke...

I understand your problem, it is valid, and that's probably the reason it's topic on wikitech. Still I believe we can expect editors to use non-ancient browsers (remember, reading is not a problem). As far as I know most browsers handle this very well (including, for example, unix character mode browsers).

...

...
However we *do* have problems with english wikipedia when pages contain unrepresentable literal characters, which makes the page break after editing. See "Budapest" article on wikitravel, where every special dash and curly quote marks became question marks. Truly ugly.

I don't understand. Is Wikitravel in UTF-8?

Not at all! It clearly shows what happens when a page _IS_ 8859-1 encoded but editors want to use fancy characters. Same hppens when they do it with an old browser on UTF-8 pages. So, you get trash either way, other editors revert the same way, so you may well use utf-8, don't you? :-)

Peter

Delirium

11:49 p.m.

Peter Gervai wrote:

...

On Tue, Nov 18, 2003 at 04:28:32AM -0500, Daniel Mayer wrote:

...
Peter Gervai wrote:

...
Could you point us to the page and revision of the problem?

couple examples:

http://meta.wikipedia.org/w/wiki.phtml?title=What_to_do_with_www.wikipedia.o... http://meta.wikipedia.org/w/wiki.phtml?title=Main_Page&diff=20132&ol...

This happens on meta's Main Page often. Ask Anthere and Erik for other examples.

I see. First is not a good example, Opera 5 is _ancient_, you can't expect that anyone would support it, as upgrading is clearly painless.

Second example is indeed valid, but it isn't a problem for you: if the page does not contain non-8859-1 characters, nothing gets garbled. If it does contain others then, well, you *need* utf-8 on that page anyway. (Embed codes are a little bit slow to type, don't you agree? If not, write your reply manually by using embeds. :))

I don't really see the problem with typing embed codes manually on, for example, the English Wikipedia. I presume the vast majority of the text will be in English (that is the language, after all), with only a very few words and phrases not in English (Greek or Chinese or etc. versions of foreign place names). It shouldn't be *too* much of a hassle to type those few phrases with embed codes.

Actually, I think with the current setup, on en: at least, you can type them literally and when you hit submit or preview it'll automatically convert them to the numeric codes. I seem to recall this happening with some Greek text I pasted in (though I could be mistaken).

-Mark

Brion Vibber

19 Nov 19 Nov

12:06 a.m.

On Nov 18, 2003, at 14:49, Delirium wrote:

...

Actually, I think with the current setup, on en: at least, you can type them literally and when you hit submit or preview it'll automatically convert them to the numeric codes. I seem to recall this happening with some Greek text I pasted in (though I could be mistaken).

That's a behavior of some, but not all, browsers. A property of those browsers that do this is they also properly submit UTF-8 data when the pages are in UTF-8.

-- brion vibber (brion @ pobox.com)

Peter Gervai

9:45 a.m.

On Tue, Nov 18, 2003 at 02:49:12PM -0800, Delirium wrote:

...

Peter Gervai wrote:

...
On Tue, Nov 18, 2003 at 04:28:32AM -0500, Daniel Mayer wrote:

...
Peter Gervai wrote:

...

I don't really see the problem with typing embed codes manually on, for example, the English Wikipedia.

Try to type greek, chinese phrases this way. Go on, try. See any chinese related articles, see the name there (don't cheat, don't edit the article), and type its embed codes in your favourite editor. I'd like to know how many percentage do you guess chinese characters right by just looking at them. :-)

May I put bets on you? ;->

...

Actually, I think with the current setup, on en: at least, you can type them literally and when you hit submit or preview it'll automatically convert them to the numeric codes. I seem to recall this happening with some Greek text I pasted in (though I could be mistaken).

Yes, you are pretty wrong here. Some browsers actually change those codes to embeds (like Mozilla), which is *illegal*. There is not guarantee at the time of the POST that the server wants _HTML_ (or anyway else) encoded text, so the browser do a _wild_guess_, and pick the most used encoding.

Other browsers' authors say that they conform standards and are not willing to make wild, baseless guesses, and they do not submit illegal characters to a POST where the encoding type does not support it.

Anyone may debate that but I see no basis to force anyone not to follow the standards.

(And not every browsers do it this way. Some does it the other way, submitting anything they like, and php and mysql and other underlying crap changes it to god knows what, and the article changes the next time someone edits it, and nobody going to be able to tell why did it happen. It's fun. :))

grin

Till Westermayer

12:03 p.m.

. . . . . . . . . . . . . . . . . . . . . . . . . . . till we *) . . .

Hi,

...

...
Actually, I think with the current setup, on en: at least, you can type them literally and when you hit submit or preview it'll automatically convert them to the numeric codes. I seem to recall this happening with some Greek text I pasted in (though I could be mistaken).

...

Yes, you are pretty wrong here. Some browsers actually change those codes to embeds (like Mozilla), which is *illegal*. There is not guarantee at the time of the POST that the server wants _HTML_ (or anyway else) encoded text, so the browser do a _wild_guess_, and pick the most used encoding.

I strongly disagree to cutting backward compatibility. But couldn't we find a technical solution, i.e.:

- storing everything in the database in UTF-8

- converting it automagically into embedded code (maybe even using HTML- entities like ü or ß) before an edit

- converting everything after the edit in UTF-8, including correct UTF8- characters inserted by the edit and embedded code

Best regards, __ . / / / / ... Till Westermayer - till we *) . . . mailto:till@tillwe.de . www.westermayer.de/till/ . icq 320393072 . Habsburgerstr. 82 . 79104 Freiburg . 0761 55697152 . 0160 96619179 . . . . .

Peter Gervai

4:33 p.m.

On Wed, Nov 19, 2003 at 12:02:00PM +0100, Till Westermayer wrote:

...

I strongly disagree to cutting backward compatibility. But couldn't we find a technical solution, i.e.:

storing everything in the database in UTF-8

converting it automagically into embedded code (maybe even using HTML-

entities like ü or ß) before an edit

Preferably (de)activated by a user preference item. (It's pretty hard to edit with embeds, and it's useless for a modern browser.)

...

converting everything after the edit in UTF-8, including correct UTF8-

characters inserted by the edit and embedded code

Apart from my comment: Good idea!

grin

7702

Age (days ago)

7703

Last active (days ago)

wikipedia-l@lists.wikimedia.org

6 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Daniel Mayer
Delirium
Peter Gervai
Till Westermayer