I was wondering about the pros and cons of utf-8 for the French Wikipedia:
Pros: * improve interwiki links : any non latin-1 (ISO-8859-1 = default charset) link would be possible without any transformation. Many non latin-1 links are copied and pasted in raw format by interwiki wanderers but not checked afterwards; they are always miscoded, which results in loss of time to fix them;
* improve orthography (and articles naming): French uses the famous <oe> digraph that is not encoded by latin-1 (latin-9 does); every editor must either type the HTML entity œ or prefer not to encode it, resulting in misspelled words (one of our bots, Orthogaffe, when it was used for orthography purpose, had many "oeuvre -> œuvre" replacements to do);
* terminate transcodage problems: many editors do not use Windows and its codepages; other do, but with Win-1252 or Unicode as default charset. When some text is pasted from an application not using strict latin-1 (but Win-1252, MacRoman, etc.) to some wiki editing area, it is badly transcoded by the Wiki-soft, resulting in many raw quotation marks and <oe> ligatures being replaced by question marks.
========
Cons: * any text containing non ASCII characters would increase its weight : instead of one byte for a single <c with cedilla>, it would require two; French uses lots of non ASCII characters, as é è ç à ù; * I do not see other cons.
Would it be possible, thus, to make utf-8 default charset for the French Wikipedia?
Vincent Ramos
On Mon, Nov 17, 2003 at 06:06:00PM +0100, Vincent Ramos wrote:
Cons:
- any text containing non ASCII characters would increase
its weight : instead of one byte for a single <c with cedilla>, it would require two; French uses lots of non ASCII characters, as é è ç à ù;
It's not obvious it's going to be bigger at all - you need only 2-3 chars instead of about 8 (ሴ) for characters not in Latin 1.
French main page: Latin 1 (as is): 28035 Naively converted to UTF-8: 28235 (0.7% bigger) With all &entities; UTF-8-ized 27867 (0.6% smaller)
The last number is smaller than Latin 1, but some of it is cheating - French main page contained &codes; for some characters that could be represented directly (ü), and some of conversions weren't completely legal (&->&).
Still, there's no reason to believe UTF-8 is going to be significantly worse spacewise than ISO-8859-1. For English Wikipedia it's going to be obviously smaller, as it uses even less characters from 128-255 range.
Le Mon, 17 Nov 2003 18:31:40 +0100, Tomasz Wegrzanowski taw@users.sf.net a écrit:
Still, there's no reason to believe UTF-8 is going to be significantly worse spacewise than ISO-8859-1. For English Wikipedia it's going to be obviously smaller, as it uses even less characters from 128-255 range.
I agree with you.
Vincent
Vincent Ramos wrote:
I was wondering about the pros and cons of utf-8 for the French
Wikipedia:
%<--
Would it be possible, thus, to make utf-8 default charset
for the French Wikipedia?
Vincent Ramos
Please don't! This is totally unneeded for the french charset and will only causing problems with browser which don't support UTF-8.
-- Looxix
On Mon, 2003-11-17 at 18:33, Luc Van Oostenryck wrote:
Vincent Ramos wrote:
I was wondering about the pros and cons of utf-8 for the French
Wikipedia:
%<--
Would it be possible, thus, to make utf-8 default charset
for the French Wikipedia?
Vincent Ramos
Please don't! This is totally unneeded for the french charset and will only causing problems with browser which don't support UTF-8.
Don't you ever use mathematical symbols in France?
-- Looxix
Wikitech-l mailing list Wikitech-l@Wikipedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
Le Mon, 17 Nov 2003 19:33:19 +0100, Luc Van Oostenryck luc.vanoostenryck@easynet.be a écrit:
Please don't! This is totally unneeded for the french charset and will only causing problems with browser which don't support UTF-8.
Do you know what percentage do this trick?
The point is: why do other Wikimedia projects use utf-8 and not fr:? Are there so many utf-8 non compliant browsers?
Vincent
Vincent Ramos wrote:
Le Mon, 17 Nov 2003 19:33:19 +0100, Luc Van Oostenryck luc.vanoostenryck@easynet.be a écrit:
Please don't! This is totally unneeded for the french charset and will only causing problems with browser which don't support UTF-8.
Do you know what percentage do this trick? The point is: why do other Wikimedia projects use utf-8
and not fr:? Are there so many utf-8 non compliant browsers?
Vincent
I don't know how much browser are non-UTF-8 compliant, but I'm sure there will problems. My point of view is more like the following: why should we need UTF-8 on fr when 99.9999% of text are fairly happy with ISO8859-1 (the charset that most european use and french in particular) and named or numeric HTML entities are OK for the rest.
Note, Vincent, that I understand your needs for special characters.
-- Looxix
On Mon, Nov 17, 2003 at 11:29:23PM +0100, Luc Van Oostenryck wrote:
I don't know how much browser are non-UTF-8 compliant, but I'm sure there will problems. My point of view is more like the following: why should we need UTF-8 on fr when 99.9999% of text are fairly happy with ISO8859-1 (the charset that most european use and french in particular) and named or numeric HTML entities are OK for the rest.
Why should we need ISO 8859-N with named or numeric HTML entities when 100% of text works with UTF-8 ? The number is much smaller than 99.9999%. It's probably much smaller than 99% too.
I worked with articles about Japanese language when Polish Wikipedia used ISO 8859-2 and it wasn't fun. I wasted lot of time trying to locate which &-entity corresponds to which character to fix a typo or make some other change. In the end I installed local mirror Wikipedia with UTF-8, and make a bunch of Perlscripts that converted from ISO 8859-2 + &codes; to readable representation.
On Mon, 17 Nov 2003, Vincent Ramos wrote:
Do you know what percentage do this trick?
The point is: why do other Wikimedia projects use utf-8 and not fr:? Are there so many utf-8 non compliant browsers?
I think it's historical. When switching over to the phase 3, ISO 8859-1 was used for the western languages, and UTF-8 only for those languages that really could not dow ithout, like Japanese. Nowadays, all newly converted languages are immediately put on UTF-8, while Polish (formerly 8859-4) and Russian (formerly 8859-5) have also been switched to UTF-8, but the 'older' ISO 8859-1 languages have all been kept on ISO 8859-1.
Andre Engels
Vincent Ramos wrote:
- improve interwiki links: any non latin-1 (ISO-8859-1 = default charset)
link would be possible without any transformation. Many non latin-1 links are copied and pasted in raw format by interwiki wanderers but not checked afterwards; they are always miscoded, which results in loss of time to fix them;
Most French accented characters are interchangeable between Unicode, ISO-8859-1, and the more recent ISO-8859-15 (which was a modification that allowed for such things as the Euro symbol). It's the other renditions that seem to cause the problem.
- improve orthography (and articles naming): French uses the famous <oe>
digraph that is not encoded by latin-1 (latin-9 does); every editor must either type the HTML entity œ or prefer not to encode it, resulting in misspelled words (one of our bots, Orthogaffe, when it was used for orthography purpose, had many "oeuvre -> œuvre" replacements to do);
I can use Alt+0156 to create the œ ligature, but I believe that this is in the unstable area of coding. It is not normally on the keyboard. Simply using "oe" without a digraph is not a spelling error, but a breach of typographical convention. Unfortunately most books that I do use for French language reference use the digraph but do not discuss the problem at all. The book by Léandre Bergeron, "Dictionnaire de la langue québécoise", uses the two letter format in its listings, but also without explanation. Alphabetical lists treat the ligature as if it were two letters, so it should be treated as optional, and the initiating author's choice should be respected. I would simply use the two letter form and would object if it were changed. In article titles the author's choice should also be respected, but a redirect should be set up from the alternative.
- terminate transcodage problems: many editors do not use Windows
and its codepages; other do, but with Win-1252 or Unicode as default charset. When some text is pasted from an application not using strict latin-1 (but Win-1252, MacRoman, etc.) to some wiki editing area, it is badly transcoded by the Wiki-soft, resulting in many raw quotation marks and <oe> ligatures being replaced by question marks.
Yes, I would be glad to be rid of the anoying question marks, squares or diamonds.
Cons:
- any text containing non ASCII characters would increase
its weight : instead of one byte for a single <c with cedilla>, it would require two; French uses lots of non ASCII characters, as
The diamonds that appeared in your original letter did not reproduce at all when I quoted the letter for this answer. The c-cedilla is a part of ISO-8859-1 and I enter it with Alt+0231. The two byte encoding should not be a worry. It is unavoidable for all chinese characters.
Would it be possible, thus, to make utf-8 default charset
for the French Wikipedia?
I believe that it should be the standard for all the Wikis.
Ec
Le Mon, 17 Nov 2003 10:44:25 -0800, Ray Saintonge saintonge@telus.net a écrit:
I can use Alt+0156 to create the œ ligature, but I believe that this is in the unstable area of coding. It is not normally on the keyboard. Simply using "oe" without a digraph is not a spelling error, but a breach of typographical convention.
[snip]
The question is not clearly answered and many French linguists and grammarians do not agree. If you read French, I may suggest you these documents: http://www.gutenberg.eu.org/pub/GUTenberg/publicationsPDF/25-andre.pdf http://www.gutenberg.eu.org/pub/GUTenberg/publicationsPDF/22-yannis.pdf.
The c-cedilla is a part of ISO-8859-1 and I enter it with Alt+0231. The two byte encoding should not be a worry. It is unavoidable for all chinese characters.
Yes; articles with non latin-1 writing systems, linguistics texts and other are such a burden to format; moreover, html entities make the file quite not readable : You see what I mean...
Would it be possible, thus, to make utf-8 default charset
for the French Wikipedia?
I believe that it should be the standard for all the Wikis.
So do I. Thanks for your answer,
Vincent
Vincent Ramos wrote:
Le Mon, 17 Nov 2003 10:44:25 -0800, Ray Saintonge saintonge@telus.net a rit:
I can use Alt+0156 to create the ligature, but I believe that this is in the unstable area of coding. It is not normally on the keyboard. Simply using "oe" without a digraph is not a spelling error, but a breach of typographical convention.
The question is not clearly answered and many French linguists and grammarians do not agree. If you read French, I may suggest you these documents: http://www.gutenberg.eu.org/pub/GUTenberg/publicationsPDF/25-andre.pdf http://www.gutenberg.eu.org/pub/GUTenberg/publicationsPDF/22-yannis.pdf.
I've looked at them. I find the first to be terribly pedantic, and the second to really be concerned about much broader issues about digraphs in general.
Some of the French arguments are hilarious. Being outmanœvred by the Icelanders who wanted to insure that the thorn was included? Considering the digraph as a separate letter so that Georges Perec could avoid considering the e of the digraph as a real e in a novel that he wrote without using the letter e? WOW!!!
Assuming that my facts are not mistaken, the strongest argument against the digraph is that it does not even appear on a standard French keyboard.
Ec
wikitech-l@lists.wikimedia.org