Unicode equivalence

List overview All Threads
Download

newer

older

Bugzilla Weekly Report

Ideatorrent

Praveen Prakash

1 Dec 2009 1 Dec '09

2:31 a.m.

Hi,

I am from Malayalam Wikipedia (ml.wikipedia - user:Praveenp), and my language is Malayalam. Consider our one big problem.

After the release of Unicode 5.1.0, there are two kind of encoding for some characters of Malayalam alphabet (because of reverse compatibility). This cause serious problems in linking, searching etc in mediawiki software. Currently Windows 7 is the only operating system which supports Unicode 5.1.0. (? according to my knowledge), but lot of third-party tools for writing and reading Malayalam supports new version. And now large quantity of data in Wikimedia projects are in new version. It is not possible to link, or search titles encoded in pre-Unicode 5.1.0 from Unicode 5.1.0 or vice versa. Currently one of our namespace ???????? (Category) also has one such character, so it is possible to write ???????? as ?????? which renders same as first but different in encoding. It causes problem in categorization also.

Is it possible to put some unicode equivalence http://en.wikipedia.org/wiki/Unicode_equivalence in mediawiki software? We need urgent help.

Pls check http://unicode.org/versions/Unicode5.1.0/#Malayalam_Chillu_Characters also

*Visual * *Representation in 5.0 and Prior* *Preferred 5.1 Representation* 1 CHILLU_NN.png 0D23, 0D4D, 200D 0D7A

CHILLU_N.png 0D28, 0D4D, 200D 0D7B 3 CHILLU_RR.png 0D30, 0D4D, 200D 0D7C 4 CHILLU_L.png 0D32, 0D4D, 200D 0D7D 5 CHILLU_LL.png 0D33, 0D4D, 200D 0D7E

* *Thanks

Wikipedia Affiliate Button http://wikimediafoundation.org/wiki/Support_Wikipedia/en

Show replies by date

Tim Starling

1 Dec 1 Dec

2:39 a.m.

Praveen Prakash wrote:

...

Currently Windows 7 is the only operating system which supports Unicode 5.1.0. (? according to my knowledge), but lot of third-party tools for writing and reading Malayalam supports new version.

So do you want everything to be converted to the Unicode 5.0 version, including page titles, namespaces and article content, and for Unicode 5.1 characters sent by browsers during editing to be converted to Unicode 5.0 before storage? We can probably set that up.

-- Tim Starling

Praveen Prakash

2:53 a.m.

Is it possible to implement some method to tell server both characters are same?? It is heard that more changes coming in future versions of unicode. And now almost half of the data coming is in unicode 5.1 version. I am not sure about reverse converting. Tim Starling wrote:

...

Praveen Prakash wrote:

...
Currently Windows 7 is the only operating system which supports Unicode 5.1.0. (? according to my knowledge), but lot of third-party tools for writing and reading Malayalam supports new version.

So do you want everything to be converted to the Unicode 5.0 version, including page titles, namespaces and article content, and for Unicode 5.1 characters sent by browsers during editing to be converted to Unicode 5.0 before storage? We can probably set that up.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Wikipedia Affiliate Button http://wikimediafoundation.org/wiki/Support_Wikipedia/en

Tim Starling

3:25 a.m.

Praveen Prakash wrote:

...

Is it possible to implement some method to tell server both characters are same??

No.

...

It is heard that more changes coming in future versions of unicode. And now almost half of the data coming is in unicode 5.1 version. I am not sure about reverse converting.

That link you gave in your last post had a conversion table, it looks pretty straightforward:

CHILLU NN -> NNA, VIRAMA, ZWJ CHILLU N -> NA, VIRAMA, ZWJ CHILLU RR -> RA, VIRAMA, ZWJ CHILLU L -> LA, VIRAMA, ZWJ CHILLU LL -> LLA, VIRAMA, ZWJ

The other new characters would remain unconverted.

Gerard Meijssen wrote:

...

Hoi, Given that we should be moving forward not backward, it makes more sense to provide Unicode 5.1 characters and webfonts.

The big thing of MediaWiki was that it supported Unicode when this was still a new thing to do. We should support the latest and the best Unicode support.

You did read the post didn't you? Forcing everyone to buy Windows 7 is not generally the way we do things. Unless the client situation is not as bad as it sounds, we will need to have a transition period where we support older clients until their market share falls far lower than 50%, which is where, by Praveen's figures, it is now.

-- Tim Starling

Praveen Prakash

4:49 a.m.

On Tue, Dec 1, 2009 at 7:55 AM, Tim Starling tstarling@wikimedia.orgwrote:

...

Praveen Prakash wrote:

...
Is it possible to implement some method to tell server both characters are same??

No.

...
It is heard that more changes coming in future versions of unicode. And now almost half of the data coming is in unicode 5.1 version. I am not sure about reverse converting.

That link you gave in your last post had a conversion table, it looks pretty straightforward:

CHILLU NN -> NNA, VIRAMA, ZWJ CHILLU N -> NA, VIRAMA, ZWJ CHILLU RR -> RA, VIRAMA, ZWJ CHILLU L -> LA, VIRAMA, ZWJ CHILLU LL -> LLA, VIRAMA, ZWJ

The other new characters would remain unconverted.

Gerard Meijssen wrote:

...
Hoi, Given that we should be moving forward not backward, it makes more sense

to

...
provide Unicode 5.1 characters and webfonts.

The big thing of MediaWiki was that it supported Unicode when this was

still

...
a new thing to do. We should support the latest and the best Unicode support.

You did read the post didn't you? Forcing everyone to buy Windows 7 is not generally the way we do things. Unless the client situation is not as bad as it sounds, we will need to have a transition period where we support older clients until their market share falls far lower than 50%, which is where, by Praveen's figures, it is now.

-- Tim Starling

Letter k (ൿ) was undefined in prior Unicode 5.1. According to my knowledge letter nta (ന്റ) and tta (റ്റ) need explicit OS support (?) to display, Which is not available yet (*Windows 7* ??). I am sorry, exclusion of these letters are not intentional. I included chillaksharams only because they are creating most of the problem. Frankly we didnt face any problem caused by other letters. But it is possible.

Popular transliteration tool for Malayalam typing (*Varamozhi*) and popular font (*Anjali OldLipi*) are currently supporting Unicode 5.1 in windows. Recently (two or three days before) Microsoft announced their own tool for Malayalam typing which also supporting 5.1. Microsoft's default Karthika font for Malayalam also now supporting 5.1. But IE6 is not supporting unicode 5.1 even with supporting fonts.

In the case of Linux there are third party input definitions for SCIM (Mozhi and Inscript), and altered default fonts (*Rachana*, *Meera* etc) to unicode 5.1 which are widely using ones. No Linux OS giving default support for unicode 5.1 on Malayalam, but it can be fixed by Firefox extension, altered tools and fonts etc. Sometimes that need technical knowledge.

In Mac OS there is no proper support for Malayalam by default. But Some Malayalies fixed it and its in Unicode 5.1.

So number of people using Unicode 5.1 is increasing day by day.

But there are lot of people who believes that new Chillaksharams definitions are grammatically not correct and not ready to switch. I personally not with them. Do I need to invite some people from both sides here? I afraid none of them are ready for some agreement. :-(. We are discussing this problem since before the release of Unicode 5.1 (More than 2 years), still not solved.

Once Unicode has implemented these atomic Chillaksharams (i think in 2004) and then they removed it on next version for further discussions. This is the second inclusion of those characters. It does'nt look like those chars will be compromised in future versions.

This is the condition of current stage of Malayalam Computing. So I thought it is better to put equivalence than switching to a particular version. If it not possible switching as said by Tim is appreciable. I prefer 5.1 which is future.

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Marco Schuster

3:30 p.m.

2009/12/1 Praveen Prakash me.praveen@gmail.com

...

Popular transliteration tool for Malayalam typing (*Varamozhi*) and popular font (*Anjali OldLipi*) are currently supporting Unicode 5.1 in windows. Recently (two or three days before) Microsoft announced their own tool for Malayalam typing which also supporting 5.1. Microsoft's default Karthika font for Malayalam also now supporting 5.1. But IE6 is not supporting unicode 5.1 even with supporting fonts.

Is dynamic reverse conversion at clientside using javascript possible? This way we could output UC5.1 to everything supporting it, and older / crappy browsers / OSes can display still correctly. Sure, it adds a JS dependency, but I do think we can require JS for that.

Marco

-- VMSoft GbR Nabburger Str. 15 81737 München Geschäftsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de

Aryeh Gregor

4:07 p.m.

On Tue, Dec 1, 2009 at 9:30 AM, Marco Schuster marco@harddisk.is-a-geek.org wrote:

...

Is dynamic reverse conversion at clientside using javascript possible?

I can't see any justification for requiring JavaScript here. We should be able to do it server-side if it needs to be done at all.

Tim Starling

2 Dec 2 Dec

7:55 a.m.

Praveen Prakash wrote:

...

This is the condition of current stage of Malayalam Computing. So I thought it is better to put equivalence than switching to a particular version. If it not possible switching as said by Tim is appreciable. I prefer 5.1 which is future.

The only way we can implement equivalence is by converting to a canoncial form, this is how it is done in every part of MediaWiki. The ability to treat strings as binary, and to compare them byte-by-byte, is essential to the performance of the system.

It may be possible to convert to the Unicode 5.0 form when we generate the edit page for certain browsers, and to convert back to 5.1 when they save the page. But that would be more complicated to develop than to just convert to Unicode 5.1 all the time.

If you say Unicode 5.1 is the best solution for your community, then I'm willing to take your word for that.

-- Tim Starling

Praveen Prakash

3 Dec 3 Dec

3:39 a.m.

Tim Starling wrote:

...

The only way we can implement equivalence is by converting to a canoncial form, this is how it is done in every part of MediaWiki. The ability to treat strings as binary, and to compare them byte-by-byte, is essential to the performance of the system.

It may be possible to convert to the Unicode 5.0 form when we generate the edit page for certain browsers, and to convert back to 5.1 when they save the page. But that would be more complicated to develop than to just convert to Unicode 5.1 all the time.

If possible that will be very useful. I afraid IE6 still has its share.

...

If you say Unicode 5.1 is the best solution for your community, then I'm willing to take your word for that.

http://ml.wikipedia.org/w/index.php?title=???????????:???????????_(?????????...

Here in this link most people proposed changing to 5.1.

We are currently using Unicode 5.1 Redirect to unicode 5.0 titled articles in some cases. After converting both these titles become same. Is that a problem?

-- Wikipedia Affiliate Button http://wikimediafoundation.org/wiki/Support_Wikipedia/en

Praveen Prakash

23 Dec 23 Dec

3:10 a.m.

Tim Starling wrote:

...

If you say Unicode 5.1 is the best solution for your community, then I'm willing to take your word for that.

Hi,

voting on this subject

http://ml.wikipedia.org/wiki/WP:Panchayath_(Technical)/Unicode_5.1.0

Praveen

Praveen Prakash

27 Jan 27 Jan

7:59 a.m.

On Wed, Dec 23, 2009 at 7:40 AM, Praveen Prakash me.praveen@gmail.comwrote:

...

Tim Starling wrote:

...
If you say Unicode 5.1 is the best solution for your community, then I'm willing to take your word for that.

Hi,

voting on this subject

http://ml.wikipedia.org/wiki/WP:Panchayath_(Technical)/Unicode_5.1.0 http://ml.wikipedia.org/wiki/WP:Panchayath_%28Technical%29/Unicode_5.1.0

Praveen

IE 6 fix is simple. http://ml.wikipedia.org/wiki/Help:To_Read_in_Malayalam#For_Windows.2FIE_6_us...

Robert Ullmann

22 May 22 May

11:28 a.m.

Hi,

If you don't still have this thread, the background is that the Malayam projects want to, and are, using Unicode 5.1 for five characters that have composed code points in 5.1, and decomposed in 5.0. The equivalences are:

CHILLU NN 0D23, 0D4D, 200D 0D7A CHILLU N 0D28, 0D4D, 200D 0D7B CHILLU RR 0D30, 0D4D, 200D 0D7C CHILLU L 0D32, 0D4D, 200D 0D7D CHILLU LL 0D33, 0D4D, 200D 0D7E

Somewhere in the server code, these are "normalized" to 5.1 for the ml projects. Problem:

http://ml.wiktionary.org/w/index.php?title=%E0%B4%95%E0%B5%81%E0%B4%B1%E0%B5...

What you see happening is Interwicket trying to create the language links. It adds the correct link(s), to the 5.0 forms on the other wikts; then on the next scan of the language links tables it removes the links as invalid, as the 5.1 titles don't exist on the other wikts. This then repeats. (;-)

The problem is that it can't write the correct link, as the text normalization "fixes" it.

The other direction isn't a problem, the links are to the 5.0 forms, and when followed are normalized to 5.1 in the title lookup, and the page found.

I'm not (yet) suggesting a particular solution, there are several possibilities (from fairly decent to grotesque hackery ...). But would someone tell me where in the server code this is done? I have not been able to find it. Then I can understand a bit better, possibly just fix it in the bot code somehow, or suggest a fix server-side.

Best Regards, Robert

Platonides

1:57 p.m.

We should probably normalise to 5.1 on all wikis. I can view the 5.0 characters but not the 5.1 ones, though.

...

But would someone tell me where in the server code this is done? I have not been able to find it. Then I can understand a bit better, possibly just fix it in the bot code somehow, or suggest a fix server-side.

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/languages/classes/Lan...

Robert Ullmann

29 May 29 May

3:57 p.m.

I've looked at this a bit more. There are more serious problems.

Apparently, no-one converted the 5.0 titles in the wiki to 5.1 when "normalization" was turned on; there are pages that can't be accessed. (!) for example, try this (Malayalam for "fish"):

http://ml.wiktionary.org/wiki/%E0%B4%AE%E0%B5%80%E0%B4%A8%E0%B5%8D%E2%80%8D

that gets normalized to 5.1, which is a redirect to the 5.0 form (in this case) which is normalized back to 5.1. (there is a variation in the 5.0 form too that complicates it) The content page exists (I can see it in the XML dump), but can't be accessed because there is no way of referring to it.

Was it necessary to force the normalization to 5.1? I would think just using the 5.1 forms by convention would be/would have been entirely adequate? Maybe with a bit of bot conversion? (Moving 5.0 to 5.1 leaving redirects, converting text while leaving iwiki links alone.)

The present state apparently can't be bot-fixed, as (some) content pages can't be read.

As it is, it is impossible to write valid iwiki language links to 5.0 forms on other wikis. One could create 5.1 redirects on the other wikis and link to them; but that doesn't help cases like above where one can't even access the content page. There are 998 of them (as of the last XML dump, 3 April) in this state apparently.

Mind you, I'm not sure I have all the details right yet, and I'd like to read through a current dump, now that they are running again.

Robert

On Sat, May 22, 2010 at 2:57 PM, Platonides Platonides@gmail.com wrote:

...

We should probably normalise to 5.1 on all wikis. I can view the 5.0 characters but not the 5.1 ones, though.

...
But would someone tell me where in the server code this is done? I have not been able to find it. Then I can understand a bit better, possibly just fix it in the bot code somehow, or suggest a fix server-side.

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/languages/classes/Lan...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Platonides

4:26 p.m.

Robert Ullmann wrote:

...

I've looked at this a bit more. There are more serious problems.

Apparently, no-one converted the 5.0 titles in the wiki to 5.1 when "normalization" was turned on; there are pages that can't be accessed. (!) for example, try this (Malayalam for "fish"):

http://ml.wiktionary.org/wiki/%E0%B4%AE%E0%B5%80%E0%B4%A8%E0%B5%8D%E2%80%8D

that gets normalized to 5.1, which is a redirect to the 5.0 form (in this case) which is normalized back to 5.1. (there is a variation in the 5.0 form too that complicates it) The content page exists (I can see it in the XML dump), but can't be accessed because there is no way of referring to it.

Request on bugzilla a run of cleanupTitles.php on mlwiki.

Robert Ullmann

4:31 p.m.

In December, Praveen Prakesh wrote "We are currently using Unicode 5.1 Redirect to unicode 5.0 titled articles in some cases. After converting both these titles become same. Is that a problem?"

Yes, it is ...

cleanupTitles will resolve collisions with redirects?

Robert

On Sat, May 29, 2010 at 5:26 PM, Platonides Platonides@gmail.com wrote:

...

Robert Ullmann wrote:

...
I've looked at this a bit more. There are more serious problems.

Apparently, no-one converted the 5.0 titles in the wiki to 5.1 when "normalization" was turned on; there are pages that can't be accessed. (!) for example, try this (Malayalam for "fish"):

http://ml.wiktionary.org/wiki/%E0%B4%AE%E0%B5%80%E0%B4%A8%E0%B5%8D%E2%80%8D

that gets normalized to 5.1, which is a redirect to the 5.0 form (in this case) which is normalized back to 5.1. (there is a variation in the 5.0 form too that complicates it) The content page exists (I can see it in the XML dump), but can't be accessed because there is no way of referring to it.

Request on bugzilla a run of cleanupTitles.php on mlwiki.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Robert Ullmann

4:42 p.m.

To answer my own question, no, it won't: it will move all the 5.0 pages that have 5.1 redirects to pages named "Broken/ID:nnnn" and the result is a huge mess.

Either the redirects must be deleted first, or the code in cleanupTitles fixed to move over redirects.

Robert

On Sat, May 29, 2010 at 5:31 PM, Robert Ullmann rlullmann@gmail.com wrote:

...

In December, Praveen Prakesh wrote "We are currently using Unicode 5.1 Redirect to unicode 5.0 titled articles in some cases. After converting both these titles become same. Is that a problem?"

Yes, it is ...

cleanupTitles will resolve collisions with redirects?

Robert

On Sat, May 29, 2010 at 5:26 PM, Platonides Platonides@gmail.com wrote:

...
Robert Ullmann wrote:

...
I've looked at this a bit more. There are more serious problems.

Apparently, no-one converted the 5.0 titles in the wiki to 5.1 when "normalization" was turned on; there are pages that can't be accessed. (!) for example, try this (Malayalam for "fish"):

http://ml.wiktionary.org/wiki/%E0%B4%AE%E0%B5%80%E0%B4%A8%E0%B5%8D%E2%80%8D

that gets normalized to 5.1, which is a redirect to the 5.0 form (in this case) which is normalized back to 5.1. (there is a variation in the 5.0 form too that complicates it) The content page exists (I can see it in the XML dump), but can't be accessed because there is no way of referring to it.

Request on bugzilla a run of cleanupTitles.php on mlwiki.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Ilmari Karonen

6:34 p.m.

On 05/29/2010 05:42 PM, Robert Ullmann wrote:

...

To answer my own question, no, it won't: it will move all the 5.0 pages that have 5.1 redirects to pages named "Broken/ID:nnnn" and the result is a huge mess.

Either the redirects must be deleted first, or the code in cleanupTitles fixed to move over redirects.

Fixing cleanupTitles.php to allow moving pages over one-rev self-redirects, just like with normal page moving, would IMHO be a good idea anyway. The code can be adapted from Title::isValidMoveTarget() and Title::moveOverExistingRedirect(). In fact, the former method could probably be used as is -- while the situation in cleanupTitles is a bit unusual, in that title normalization will cause the redirect to point to itself, Title::isValidMoveTarget() seems to already have code in place to handle that special case.

-- Ilmari Karonen

Junaid P V

30 May 30 May

5:41 a.m.

Hi, Tim Starling had done title cleanups for all malayalam wikis just after 5.1 normalization activated, to avoid title conflict he renamed all titles those are using 5.0 chills by prepending 'Broken/'. So you can see those pages herehttp://ml.wiktionary.org/w/index.php?title=%E0%B4%AA%E0%B5%8D%E0%B4%B0%E0%B4%A4%E0%B5%8D%E0%B4%AF%E0%B5%87%E0%B4%95%E0%B4%82%3A%E0%B4%AA%E0%B5%82%E0%B5%BC%E0%B4%B5%E0%B5%8D%E0%B4%B5%E0%B4%AA%E0%B4%A6%E0%B4%B8%E0%B5%82%E0%B4%9A%E0%B4%BF%E0%B4%95&prefix=Broken%2F&namespace=0 .

We had restored those titles in ml.wikipedia, but not yet on ml.wiktionary :(

On 29 May 2010 20:34, Ilmari Karonen nospam@vyznev.net wrote:

...

On 05/29/2010 05:42 PM, Robert Ullmann wrote:

...
To answer my own question, no, it won't: it will move all the 5.0 pages that have 5.1 redirects to pages named "Broken/ID:nnnn" and the result is a huge mess.

Either the redirects must be deleted first, or the code in cleanupTitles fixed to move over redirects.

Fixing cleanupTitles.php to allow moving pages over one-rev self-redirects, just like with normal page moving, would IMHO be a good idea anyway. The code can be adapted from Title::isValidMoveTarget() and Title::moveOverExistingRedirect(). In fact, the former method could probably be used as is -- while the situation in cleanupTitles is a bit unusual, in that title normalization will cause the redirect to point to itself, Title::isValidMoveTarget() seems to already have code in place to handle that special case.

-- Ilmari Karonen

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://junaidpv.in

Marcus Buck

1 Dec 1 Dec

12:40 p.m.

Tim Starling hett schreven:

...

Gerard Meijssen wrote:

...
Hoi, Given that we should be moving forward not backward, it makes more sense to provide Unicode 5.1 characters and webfonts.

The big thing of MediaWiki was that it supported Unicode when this was still a new thing to do. We should support the latest and the best Unicode support.

You did read the post didn't you? Forcing everyone to buy Windows 7 is not generally the way we do things. Unless the client situation is not as bad as it sounds, we will need to have a transition period where we support older clients until their market share falls far lower than 50%, which is where, by Praveen's figures, it is now.

I guess you are both right. To me the best solution seems to be: accept both as input (obviously), normalize everything to 5.1 and store it in that codeset (so our data is consistently 5.1). For output convert it to 5.0 to evade problems with clients not yet ready for 5.1. The advantage is, that our data is stored in the most modern format, but still the clients are served data that they can process. If there are performance problems with the conversion on serving or anything like that, of course storing the data in 5.0 is still good enough. More important than discussing the specific technical details is actually doing it, implementing it.

Marcus Buck User:Slomox

Roan Kattouw

2:03 p.m.

2009/12/1 Marcus Buck wiki@marcusbuck.org:

...

If there are performance problems with the conversion on serving or anything like that, of course storing the data in 5.0 is still good enough.

You're answering your own question here: converting the data once and storing it in 5.0 so it's ready-to-serve is of course faster and easier than juggling between 5.0 and 5.1 all the time.

Roan Kattouw (Catrope)

William Pietri

6:03 p.m.

Roan Kattouw wrote:

...

2009/12/1 Marcus Buck wiki@marcusbuck.org:

...
If there are performance problems with the conversion on serving or anything like that, of course storing the data in 5.0 is still good enough.

You're answering your own question here: converting the data once and storing it in 5.0 so it's ready-to-serve is of course faster and easier than juggling between 5.0 and 5.1 all the time.

That was my first thought, but I notice in Praveen's link that chillu-k has no Unicode 5.0 representation. It's described as "not very common", so maybe this isn't a big deal, but is your thinking just to store (and transmit) the 5.1 code for chillu-k, while converting the rest to 5.0? Or is their some Malayalam typographical convention that is used now that we should honor instead?

William

Praveen Prakash

2 Dec 2 Dec

2:17 a.m.

William Pietri wrote:

...

That was my first thought, but I notice in Praveen's link that chillu-k has no Unicode 5.0 representation. It's described as "not very common", so maybe this isn't a big deal, but is your thinking just to store (and transmit) the 5.1 code for chillu-k, while converting the rest to 5.0? Or is their some Malayalam typographical convention that is used now that we should honor instead?

William

Some fonts were used a method to draw chillu-k without unicode specification, that was same as other chillu construction ( 0D15, 0D4D, 200D - ka, virama, zwj). Zero width joiner in these definitions itself cause some problems in searching. Mediawiki not including zwj in its search result. Please check attachment. Usually search engines like google or yahoo also not including zwj in their result.

My note about IE6 is not correct. IE6 can render Unicode 5.1 chillu in normal pages. But some problem in editbox and title bar. (I am using ubuntu - I am not directly aware of this problem)

Praveen

Andrew Dunbar

12:57 a.m.

2009/12/1 Marcus Buck wiki@marcusbuck.org:

...

Tim Starling hett schreven:

...
Gerard Meijssen wrote:

...
Hoi, Given that we should be moving forward not backward, it makes more sense to provide Unicode 5.1 characters and webfonts.

The big thing of MediaWiki was that it supported Unicode when this was still a new thing to do. We should support the latest and the best Unicode support.

You did read the post didn't you? Forcing everyone to buy Windows 7 is not generally the way we do things. Unless the client situation is not as bad as it sounds, we will need to have a transition period where we support older clients until their market share falls far lower than 50%, which is where, by Praveen's figures, it is now.

I guess you are both right. To me the best solution seems to be: accept both as input (obviously), normalize everything to 5.1 and store it in that codeset (so our data is consistently 5.1). For output convert it to 5.0 to evade problems with clients not yet ready for 5.1. The advantage is, that our data is stored in the most modern format, but still the clients are served data that they can process. If there are performance problems with the conversion on serving or anything like that, of course storing the data in 5.0 is still good enough. More important than discussing the specific technical details is actually doing it, implementing it.

This problem seems closely allied to Unicode normalization of Hebrew and Arabic where we chose to go with the official standard thus breaking at least most current Microsoft installations at the time which had fonts designed for a different sequence of letters and modifiers.

The code would logically belong in the same place I suspect.

See:

Unicode normalization "sorts" Hebrew/Arabic/Myanmar vowels wrongly https://bugzilla.wikimedia.org/show_bug.cgi?id=2399

http://www.mediawiki.org/wiki/Unicode_normalization_considerations

Andrew Dunbar (hippietrail)

...

Marcus Buck User:Slomox

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- http://wiktionarydev.leuksman.com http://linguaphile.sf.net

Gerard Meijssen

1 Dec 1 Dec

3:06 a.m.

Hoi, Given that we should be moving forward not backward, it makes more sense to provide Unicode 5.1 characters and webfonts.

The big thing of MediaWiki was that it supported Unicode when this was still a new thing to do. We should support the latest and the best Unicode support.

NB this is not an issue that is problematic for Malayam alone. Another script that was updated was Devanagari ... used for Hindi for instance.. We have a request for support for fonts for the Ge'ez script ... used for Amharic and a few others. Thanks, GerardM

2009/12/1 Tim Starling tstarling@wikimedia.org

...

Praveen Prakash wrote:

...
Currently Windows 7 is the only operating system which supports Unicode 5.1.0. (? according to my knowledge), but lot of third-party tools for writing and reading Malayalam supports new version.

So do you want everything to be converted to the Unicode 5.0 version, including page titles, namespaces and article content, and for Unicode 5.1 characters sent by browsers during editing to be converted to Unicode 5.0 before storage? We can probably set that up.

-- Tim Starling

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

William Pietri

3:31 a.m.

Gerard Meijssen wrote:

...

Hoi, Given that we should be moving forward not backward, it makes more sense to provide Unicode 5.1 characters and webfonts.

If you'll indulge my curiosity for a moment, how well is this dealt with on clients? Presumably webfonts would solve the display issue, but I'm wondering about things like copy-pasting, bookmarking, feed readers, and the like.

Naively, I'd expect that whatever we ended up storing internally, the Robustness Principle would suggest we'd accept either sort of character, but emit the older one. But since my work with non-roman character sets is modest, naiveté is all I have.

William

5335

Age (days ago)

5515

Last active (days ago)

wikitech-l@lists.wikimedia.org

25 comments

13 participants

tags (0)

participants (13)

Andrew Dunbar
Aryeh Gregor
Gerard Meijssen
Ilmari Karonen
Junaid P V
Marco Schuster
Marcus Buck
Platonides
Praveen Prakash
Roan Kattouw
Robert Ullmann
Tim Starling
William Pietri