Vincent Ramos wrote:
- improve interwiki links: any non latin-1 (ISO-8859-1 = default charset)
link would be possible without any transformation. Many non latin-1 links are copied and pasted in raw format by interwiki wanderers but not checked afterwards; they are always miscoded, which results in loss of time to fix them;
Most French accented characters are interchangeable between Unicode, ISO-8859-1, and the more recent ISO-8859-15 (which was a modification that allowed for such things as the Euro symbol). It's the other renditions that seem to cause the problem.
- improve orthography (and articles naming): French uses the famous <oe>
digraph that is not encoded by latin-1 (latin-9 does); every editor must either type the HTML entity œ or prefer not to encode it, resulting in misspelled words (one of our bots, Orthogaffe, when it was used for orthography purpose, had many "oeuvre -> œuvre" replacements to do);
I can use Alt+0156 to create the œ ligature, but I believe that this is in the unstable area of coding. It is not normally on the keyboard. Simply using "oe" without a digraph is not a spelling error, but a breach of typographical convention. Unfortunately most books that I do use for French language reference use the digraph but do not discuss the problem at all. The book by Léandre Bergeron, "Dictionnaire de la langue québécoise", uses the two letter format in its listings, but also without explanation. Alphabetical lists treat the ligature as if it were two letters, so it should be treated as optional, and the initiating author's choice should be respected. I would simply use the two letter form and would object if it were changed. In article titles the author's choice should also be respected, but a redirect should be set up from the alternative.
- terminate transcodage problems: many editors do not use Windows
and its codepages; other do, but with Win-1252 or Unicode as default charset. When some text is pasted from an application not using strict latin-1 (but Win-1252, MacRoman, etc.) to some wiki editing area, it is badly transcoded by the Wiki-soft, resulting in many raw quotation marks and <oe> ligatures being replaced by question marks.
Yes, I would be glad to be rid of the anoying question marks, squares or diamonds.
Cons:
- any text containing non ASCII characters would increase
its weight : instead of one byte for a single <c with cedilla>, it would require two; French uses lots of non ASCII characters, as
The diamonds that appeared in your original letter did not reproduce at all when I quoted the letter for this answer. The c-cedilla is a part of ISO-8859-1 and I enter it with Alt+0231. The two byte encoding should not be a worry. It is unavoidable for all chinese characters.
Would it be possible, thus, to make utf-8 default charset
for the French Wikipedia?
I believe that it should be the standard for all the Wikis.
Ec