Re: [Wikitech-l] Internationalization issues

List overview All Threads
Download

newer

older

wikitech-l change

Re: [Wikitech-l] requested change...

lcrocker＠nupedia.com

22 Aug 2002 22 Aug '02

5:11 p.m.

...

I'm not sure I'm the right person to raise this question but I wondered what the current thinking is on adapting the code for other character sets. If I recall correctly we or now assuming UTF-8, right? What exactly does that mean, btw? That we changed the MySQL character tables for those above 7F? Anyting else?

The English Wikipedia, and the German one being tested now, are both ISO-8859-1, not UTF-8. UTF-8 will be needed for Polish and other languages. There won't be much software change involved; just telling MySQL to index the right way.

As for a special notation for accented characters, I'm not fond of the idea. Foreign users should have foreign keyboards. Others should still be able to enter accents by whatever means their OS and browser allow, and I'm not aware of any that don't have some feature for it. I don't like duplicating effort that should be already done elsewhere.

Show replies by date

Jan Hidders

22 Aug 22 Aug

9:32 p.m.

New subject: Internationalization issues

On Thu, Aug 22, 2002 at 10:11:41AM -0700, lcrocker@nupedia.com wrote:

...

The English Wikipedia, and the German one being tested now, are both ISO-8859-1, not UTF-8. UTF-8 will be needed for Polish and other languages. There won't be much software change involved; just telling MySQL to index the right way.

That may in fact involve defining our own new character set for MySQL that defines the properties of the subset of UTF-8 that covers English, German and Polish. Or is each Wikipedia going to get its own mysql server? Anyway, I'll start asking around if something like that not already exists somewhere.

...

As for a special notation for accented characters, I'm not fond of the idea. Foreign users should have foreign keyboards. Others should still be able to enter accents by whatever means their OS and browser allow, and I'm not aware of any that don't have some feature for it.

All I know at the moment is that the request has been made by a member of the German community. I don't know how many people asked for it, why they wanted it or how badly they need it, but I'll ask them. I'm a bit surprised that Magnus hasn't brought this up, (I'm not German) but I have the impression he has been busy lately.

...

I don't like duplicating effort that should be already done elsewhere.

The question is not if you would implement it, but only if it would be Ok to define some hooks so that they can implement it themselves if they wanted to without changing any common code.

-- Jan Hidders

Kurt Jansson

23 Aug 23 Aug

5:43 a.m.

New subject: Internationalization issues

...

As for a special notation for accented characters, I'm not fond of the idea. Foreign users should have foreign keyboards.

Of course that's not the problem.

...

Others should still be able to enter accents by whatever means their OS and browser allow, and I'm not aware of any that don't have some feature for it.

I don't know which feature you mean. Some foreign contributors use html entities for umlauts, others type ae for ä, oe for ö and ue fur ü. The first one makes the EditBox look ugly, and Links with entities in them don't work, and the second one has to be corrected by someone. At the moment it's not a big deal, just annoying. But I think it could be easily automated, if entities would automaticaly be turned into umlauts when the text is saved. An easier way of entering umlauts, like "o, would make foreign contributors even happier, but that should be standardized then in all wikipedias that use umlauts, accents, etc.

Could someone please configure this mailinglist the same way as wikipedia-l, so that replies go to the list?

Kurt

Jason Richey

3:43 p.m.

New subject: requested change for wikitech-l behavior

I'm happy to make this change... Of course, I'd like to get the general consensus first. Anybody agree/disagree/not care?

Jason

Kurt Jansson wrote:

...

Could someone please configure this mailinglist the same way as wikipedia-l, so that replies go to the list?

Kurt

-- "Jason C. Richey" jasonr@bomis.com

Jan Hidders

6:19 p.m.

New subject: requested change for wikitech-l behavior

On Fri, Aug 23, 2002 at 08:43:01AM -0700, Jason Richey wrote:

...

I'm happy to make this change... Of course, I'd like to get the general consensus first. Anybody agree/disagree/not care?

Agree.

-- Jan Hidders

Pierre Abbat

24 Aug 24 Aug

1:01 a.m.

New subject: requested change for wikitech-l behavior

On Friday 23 August 2002 11:43, Jason Richey wrote:

...

I'm happy to make this change... Of course, I'd like to get the general consensus first. Anybody agree/disagree/not care?

I disagree. To reply to the list when reply-to is not set to the list, one does this: 1. Hit the reply button with two arrows. 2. Highlight the sender's name. 3. Hit the delete button. It's a bit more complicated if the list address is in the Cc: header, but still involves no typing of addresses. To reply to the sender when reply-to is set to the list, one does this: 1. Highlight the address. 2. Find the sender's address. (It's not in the current window.) 3. Type it. (Attempting to highlight it has a high probability of bringing up a blank message window addressed to him.)

If the list address is set on the mail folder, one need only type L to reply to the list. I don't have it set that way because I have both wikitech-l and wikipedia-l in the same folder.

phma

Jan Hidders

11:14 a.m.

New subject: requested change for wikitech-l behavior

On Fri, Aug 23, 2002 at 09:01:18PM -0400, Pierre Abbat wrote:

...

On Friday 23 August 2002 11:43, Jason Richey wrote:

...
I'm happy to make this change... Of course, I'd like to get the general consensus first. Anybody agree/disagree/not care?

I disagree. To reply to the list when reply-to is not set to the list, one does this:

Hit the reply button with two arrows.

Highlight the sender's name.

Hit the delete button.

It's a bit more complicated if the list address is in the Cc: header, but still involves no typing of addresses. To reply to the sender when reply-to is set to the list, one does this:

Highlight the address.

Find the sender's address. (It's not in the current window.)

Type it. (Attempting to highlight it has a high probability of bringing up

a blank message window addressed to him.)

If the list address is set on the mail folder, one need only type L to reply to the list. I don't have it set that way because I have both wikitech-l and wikipedia-l in the same folder.

It's not a matter of how easy it is (on my good old mutt it's even easier than on your Kmail) but how intuitive. I've already received several private mails that were meant for the mailing list, and I have yet to see a message on the mailing list that was intended to be private. That tells me that the list isn't properly configured.

-- Jan Hidders

Kurt Jansson

11:59 a.m.

New subject: requested change for wikitech-l behavior

...

It's not a matter of how easy it is (on my good old mutt it's even

easier

...

than on your Kmail) but how intuitive. I've already received several

private

...

mails that were meant for the mailing list

Thank god I'm not the only one who hits the same button in all wikipedia mailinglists, without much thinking about it. :-) I know that OE users enjoy the questionable reputation of being very learning resistant. But I'll switch to Linux soon, and then everything will get better. :-)

Kurt

Jaap van Ganswijk

12:42 p.m.

New subject: Internationalization issues

At 2002-08-23 07:43 +0200, Kurt Jansson wrote:

...

...
As for a special notation for accented characters, I'm not fond of the idea. Foreign users should have foreign keyboards.

Of course that's not the problem.

...
Others should still be able to enter accents by whatever means their OS and browser allow, and I'm not aware of any that don't have some feature for it.

I don't know which feature you mean. Some foreign contributors use html entities for umlauts, others type ae for ä, oe for ö and ue fur ü. The first one makes the EditBox look ugly, and Links with entities in them don't work, and the second one has to be corrected by someone. At the moment it's not a big deal, just annoying. But I think it could be easily automated, if entities would automaticaly be turned into umlauts when the text is saved. An easier way of entering umlauts, like "o, would make foreign contributors even happier, but that should be standardized then in all wikipedias that use umlauts, accents, etc.

As far as I can oversee the problem the best way to use accent letters in this (HTML rendering) environment is to use 'ä' etc. since every visitor can make sure his browser can render it correctly. NN and IE can already do it since version 4 or even earlier.

Every other method will depend on whatever font the editor of an article is using and that may not be the same as what the next editor is using or what the Wikipedia web-server is saying to the visitor it is serving.

Perhaps the Wikipedia software could try to translate ae and such to the appropriate HTML abreviations like ä but that would be risky, because it would have to know in what language each word was written. In Dutch we have the 'oe' as a valid combination which is not equal to 'ö', so if Dutch and German were mixed in an article it would cause problems.

Please also consider that font problems may seem to be solved for now, but how local is that solution? And how new and MS-based does the system have to be and how long will it last? Perhaps someone invents a much better system in ten years time. Will all texts become worthless then? At least with the 'ä'-system it's all in ASCII and therefore human-readible and even understandable with a little effort.

Also consider that a lot of PDA's don't use or only offer a few standard fonts.

...

Could someone please configure this mailinglist the same way as wikipedia-l, so that replies go to the list?

I agree. (Add the 'Reply-To:' header, don't change the 'From:' header.)

Greetings, Jaap

Kurt Jansson

1:35 p.m.

New subject: Internationalization issues

...

As far as I can oversee the problem the best way to use accent letters in this (HTML rendering) environment is to use 'ä' etc. since every visitor can make sure his browser can render it correctly. NN and IE can already do it since version 4 or even earlier.

You mean every umlaut should be shown in the EditBox as it's entity? That would make most German articles very hard readable. It should be possible to enter entities, but they should be shown as umlauts to the 'normal' user. If foreign contributors don't see umlauts but strange characters in their EditBox, maybe they could have an option in their preferences so that they are shown as entities. But I haven't heard of this problem.

...

Perhaps the Wikipedia software could try to translate ae and such to the appropriate HTML abreviations like ä but that would be risky, because it would have to know in what language each word was written. In Dutch we have the 'oe' as a valid combination which is not equal to 'ö', so if Dutch and German were mixed in an article it would cause problems.

I would never suggest this, because 'oe' is also a valid combination in German (e.g. the musical instrument "Oboe", or a city I lived in, "Itzehoe"). Same with "ae" and "ue".

Kurt

Jan Hidders

3:48 p.m.

New subject: Internationalization issues

At 2002-08-23 07:43 +0200, Kurt Jansson wrote:

...

[...] Some foreign contributors use html entities for umlauts, others type ae for ä, oe for ö and ue fur ü. The first one makes the EditBox look ugly, and Links with entities in them don't work, and the second one has to be corrected by someone. At the moment it's not a big deal, just annoying. But I think it could be easily automated, if entities would automaticaly be turned into umlauts when the text is saved. An easier way of entering umlauts, like "o, would make foreign contributors even happier, but that should be standardized then in all wikipedias that use umlauts, accents, etc.

Well, in some sense it is easy, and in some sense it isn't. It would mean introducing for the first time a difference in functionality depending upon the language of the concerning Wikipedia because for example the translation of entities is not the desired behavior on the English Wikipedia at the moment.

In itself this is not hard to implement by defining certain functions that are called in the common code and defined in the language-specific part of the code, but 1. deciding which functions we define and what they should do requires some deep architectural thinking, and 2. it would add an extra layer of complexity that makes it a bit harder to for example determine what the cause of a certain bug is (it could be language specific).

So Lee is probably right if he wants a good justification for such an addition. If you say that at the moment it is not a big deal (I assume that only a very small minority of the contributors of the German Wikipedia is working on a non-German keyboard, and I only saw a few instances where people had used "ö") then it is probably best to wait until the moment it does become a big deal.

-- Jan Hidders

Kurt Jansson

5:08 p.m.

New subject: Internationalization issues

...

So Lee is probably right if he wants a good justification for such an addition.

Okay, it seems I was a bit naive about how easy this whould be to implement.

Whould it be easier to make a script that converts every ß to ß, ö to ö, ü to ü and ä to ä and the same for Ä, Ö, Ü in the German database? I could then call it by hand if you tell me how.

If this also is too complicated or dangerous - forget about it :-)

Kurt

Jan Hidders

26 Aug 26 Aug

8:10 a.m.

New subject: Internationalization issues

On Sat, Aug 24, 2002 at 07:08:16PM +0200, Kurt Jansson wrote:

...

Whould it be easier to make a script that converts every ß to ß, ö to ö, ü to ü and ä to ä and the same for Ä, Ö, Ü in the German database? I could then call it by hand if you tell me how.

I assume you want to do this on a regular basis? In that case the script has to behave as a normal user, i.e., the change should look as any other regular minor update. I suppose I could write a script in PHP that you could run and that simply interacts with the site as a normal user. It would do a search (simply by getting the corresonding URL) for "ouml", "uuml" et cetera, and do a minor edit that changes them. But note that this is also pretty easily done by hand.

-- Jan Hidders

Ray Saintonge

3:13 p.m.

New subject: Internationalization issues

Jaap van Ganswijk wrote:

...

Perhaps the Wikipedia software could try to translate ae and such to the appropriate HTML abreviations like ä but that would be risky, because it would have to know in what language each word was written. In Dutch we have the 'oe' as a valid combination which is not equal to 'ö', so if Dutch and German were mixed in an article it would cause problems.

To do this the system also needs to distinguish between an umlaut and a diaresis. The famous painter Raphael is often spelled with a diaresis as "Raphaël". It wouldn't do do have him automagically turned into "Raphäl". The important thing to me is not in having the machines put in umlauts or other accents, but having the search engine regard spellings with or without accents as equivalent. This would be a great help for the searcher who doesn't know if a word has an accent or exactly what accent it has. No non-french speaking person should be required to know about how some verbs change their accent patterns, or the subtleties about an acute or grave accent on a final "e" in Catalan. The uniquely German treatment of umlauted vowels can then probably be treated with redirects.

Someone also made a comment an anglo-centric comment about foreign users using foreign keyboards. What does this mean if I write in more than one language? Maybe I should connect a separate keyboard for each language. The computer should be smart enough to know that it does things differntly when I'm on my Russian or Turkish or Devanagiri keyboard. [;-)] .

Eclecticology

PS: At first I thought that I was replying to the list but apparently it only went to Jaap. I suppose I'll should use the reply all when answering to the list. -Ec

Pierre Abbat

6:08 p.m.

New subject: Internationalization issues

On Monday 26 August 2002 11:13, Ray Saintonge wrote:

...

To do this the system also needs to distinguish between an umlaut and a diaresis. The famous painter Raphael is often spelled with a diaresis as "Raphaël". It wouldn't do do have him automagically turned into "Raphäl". The important thing to me is not in having the machines put in umlauts or other accents, but having the search engine regard spellings with or without accents as equivalent. This would be a great help for the searcher who doesn't know if a word has an accent or exactly what accent it has. No non-french speaking person should be required to know about how some verbs change their accent patterns, or the subtleties about an acute or grave accent on a final "e" in Catalan. The uniquely German treatment of umlauted vowels can then probably be treated with redirects.

There are also a few German words in which "ue" is *not* equivalent to "ü": "Tuer" means "doer", the actor form of "tun", and is different from "Tür" meaning "door", and "Guericke", the Magdeburger who made the vacuum pump, is written sometimes "Gericke" but never "Güricke".

phma

Brion VIBBER

7:56 p.m.

New subject: Internationalization issues

Ray Saintonge wrote:

...

Someone also made a comment an anglo-centric comment about foreign users using foreign keyboards. What does this mean if I write in more than one language? Maybe I should connect a separate keyboard for each language. The computer should be smart enough to know that it does things differntly when I'm on my Russian or Turkish or Devanagiri keyboard. [;-)] .

That's why modern computers have software-switchable keyboard maps...

blá blà blä bła bła bła бла бла бла μβλα μβλα μβλα ブラー　ブラー　ブラー

-- brion vibber (brion @ pobox.com)

8170

Age (days ago)

8174

Last active (days ago)

wikitech-l@lists.wikimedia.org

15 comments

8 participants

tags (0)

participants (8)

Brion VIBBER
Jaap van Ganswijk
Jan Hidders
Jason Richey
Kurt Jansson
lcrocker＠nupedia.com
Pierre Abbat
Ray Saintonge