Indic languages & unicode issues.

List overview All Threads
Download

newer

older

Have some editing queries, where...

Tamil Wiki workshop in Eastern...

CherianTinu Abraham

26 Dec 2010 26 Dec '10

8:13 a.m.

Hi all, Happened to see Gerard's blog post on issues with Malayalam Wikipedia & Unicode upgrade to 5.1 http://ultimategerardm.blogspot.com/2010/12/malayalam-enigma.html While traffic to websites ( yes, including Wikipedia ) is primary driven by search traffic ,apparently this had a negative effect as shown by WM stats. As he explains, the problem is there needs to be canonical equivalence between the old and new Unicode .. This is a larger issue and it is time that the foundation has a stake on the Unicode consortium ( and search engines too ? ;) )..

Thoughts ?

Regards Tinu Cherian

Attachments:

attachment.htm (text/html — 750 bytes)

Show replies by date

Hari Prasad Nadig

26 Dec 26 Dec

8:22 a.m.

New subject: [Wikimediaindia-l] Indic languages & unicode issues.

On Sun, Dec 26, 2010 at 7:43 PM, CherianTinu Abraham tinucherian@gmail.comwrote:

...

Hi all, Happened to see Gerard's blog post on issues with Malayalam Wikipedia & Unicode upgrade to 5.1 http://ultimategerardm.blogspot.com/2010/12/malayalam-enigma.html While traffic to websites ( yes, including Wikipedia ) is primary driven by search traffic ,apparently this had a negative effect as shown by WM stats. As he explains, the problem is there needs to be canonical equivalence between the old and new Unicode .. This is a larger issue and it is time that the foundation has a stake on the Unicode consortium ( and search engines too ? ;) )..

...

From the blog post:

...

Traffic for the Malayalam Wikipedia http://ml.wikipedia.org/ has gone down dramatically while the number of articles and the number of editors has gone up. WHY??

When you look at the statisticshttp://stats.wikimedia.org/EN_India/TablesPageViewsMonthly.htm, you will see that traffic halved in a couple of months.

That is quite sad and at the same time intriguing!

Has this affected other Indian languages as well?

-- Hari Prasad Nadig http://hpnadig.net | http://twitter.com/hpnadig http://flickr.com/hpnadig

Santhosh Thottingal

10:58 a.m.

New subject: [Wikimediaindia-l] Indic languages & unicode issues.

On Sun, Dec 26, 2010 at 7:43 PM, CherianTinu Abraham tinucherian@gmail.com wrote:

...

Hi all, Happened to see Gerard's blog post on issues with Malayalam Wikipedia & Unicode upgrade to 5.1 http://ultimategerardm.blogspot.com/2010/12/malayalam-enigma.html

The issue is very complex. There were heated debates around this topic in Unicode Indic Mailing list for years. In short the issue is about dual encoding- representing a letter using two types of unicode character codes. Unicode's decision to bring the second encoding in standard was widely debated and opposed mainly by FOSS developer community from Malayalam. Unicode announced the dual encoding scheme without canonical equivalence definition in 2005 and reverted it when scholars and developers opposed it. The same proposal again introduced. Foss community, language scholars protested the proposal. The SMC community submitted a document with 17 reasons why dual encoding should not be introduced.- see http://wiki.smc.org.in/images/2/23/SMC_Unicode_5.1.pdf Similarly a seminar conducted to discuss the issue by University of Kerala opposed the proposal. see http://images2.wikia.nocookie.net/__cb20080131071131/fci/images/1/19/Report_... But Unicode technical consortium did not bother to answer both of these reports and went ahead with the decision in Unicode 5.1. The dual encoding scheme is with out any canonical equivalence definition. Since it is not there in standard I doubt whether Operating systems will implement it, not to mention about search engines.

Since the new encoding scheme is defined without backward compatibility, or against unicode's stability policy, Malayalam FOSS community decided not to implement it until issues are resolved and continuing with unicode 5.0 encoding. Malayalam news portals also follow unicode 5.0. Most of the tools from Google also continue with unicode 5.0 based encoding. Malayalam wikipedia decided to go ahead with latest version of unicode. I had resisted this move in the discussion pages of Malayalam wikipedia. The decision was taken based on voting by a small community of editors and not based on proper technical analysis.

Believe it or not, this is how Malayalam wiki is rendered inWindows XP IE 8 box with OS default font: http://thottingal.in/tmp/ml-wiki-winxp-IE8.png I hope it gives some clue about the issue that Gerard mentioned.

Most of the discussions happened around the encoding issue was in Malayalam(in Malayalam wiki or in blogs), but this English blog post might summarize it http://www.j4v4m4n.in/2009/11/07/unicode-or-malayalam/

Discussions happened in Malayalam wikipedia(content in Malayalam language) http://ml.wikipedia.org/wiki/%E0%B4%B5%E0%B4%BF%E0%B4%95%E0%B5%8D%E0%B4%95%E...)

Thanks Santhosh Thottingal http://thottingal.in

BalaSundaraRaman

27 Dec 27 Dec

12:29 a.m.

New subject: [Wikimediaindia-l] Indic languages & unicode issues.

...

Unicode's decision to bring the second encoding in

...

standard was widely debated and opposed mainly by FOSS developer community from Malayalam. Unicode announced the dual encoding scheme without canonical equivalence definition in 2005 and reverted it when scholars and developers opposed it.

Sadly, you're not alone in this, Santhosh. We have had canonical non-equivalence issues and many more (similar to the atomic chillu issue) in Tamil too. :( Part of it was inherited from the umbrellaish ISCII model (done with good intentions, I believe). They put the abugidas of the Indo-Aryan languages and other systems like Tamil (haven't studied other writing systems enough to comment upon) into one bucket and we're still suffering for that. They cite stability when legitimate changes are sought, but allow such breaking changes.

I'm sure you'll be working with the search engines to map the equivalent glyph sequences. Also, please explore mediawiki tech solutions to add redirects or hidden texts (though not ideal).

- Sundar

"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted." - George Boole, quoted in Iverson's Turing Award Lecture

----- Original Message ----

...

From: Santhosh Thottingal santhosh.thottingal@gmail.com To: Discussion list on Indian language projects of Wikimedia. wikimediaindia-l@lists.wikimedia.org Sent: Sun, December 26, 2010 10:28:17 PM Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.

On Sun, Dec 26, 2010 at 7:43 PM, CherianTinu Abraham tinucherian@gmail.com wrote:

...
Hi all, Happened to see Gerard's blog post on issues with Malayalam Wikipedia & Unicode upgrade to 5.1 http://ultimategerardm.blogspot.com/2010/12/malayalam-enigma.html

The issue is very complex. There were heated debates around this topic in Unicode Indic Mailing list for years. In short the issue is about dual encoding- representing a letter using two types of unicode character codes. Unicode's decision to bring the second encoding in standard was widely debated and opposed mainly by FOSS developer community from Malayalam. Unicode announced the dual encoding scheme without canonical equivalence definition in 2005 and reverted it when scholars and developers opposed it. The same proposal again introduced. Foss community, language scholars protested the proposal. The SMC community submitted a document with 17 reasons why dual encoding should not be introduced.- see http://wiki.smc.org.in/images/2/23/SMC_Unicode_5.1.pdf Similarly a seminar conducted to discuss the issue by University of Kerala opposed the proposal. see http://images2.wikia.nocookie.net/__cb20080131071131/fci/images/1/19/Report_... f But Unicode technical consortium did not bother to answer both of these reports and went ahead with the decision in Unicode 5.1. The dual encoding scheme is with out any canonical equivalence definition. Since it is not there in standard I doubt whether Operating systems will implement it, not to mention about search engines.

Since the new encoding scheme is defined without backward compatibility, or against unicode's stability policy, Malayalam FOSS community decided not to implement it until issues are resolved and continuing with unicode 5.0 encoding. Malayalam news portals also follow unicode 5.0. Most of the tools from Google also continue with unicode 5.0 based encoding. Malayalam wikipedia decided to go ahead with latest version of unicode. I had resisted this move in the discussion pages of Malayalam wikipedia. The decision was taken based on voting by a small community of editors and not based on proper technical analysis.

Believe it or not, this is how Malayalam wiki is rendered inWindows XP IE 8 box with OS default font: http://thottingal.in/tmp/ml-wiki-winxp-IE8.png I hope it gives some clue about the issue that Gerard mentioned.

Most of the discussions happened around the encoding issue was in Malayalam(in Malayalam wiki or in blogs), but this English blog post might summarize it http://www.j4v4m4n.in/2009/11/07/unicode-or-malayalam/

Discussions happened in Malayalam wikipedia(content in Malayalam language) http://ml.wikipedia.org/wiki/%E0%B4%B5%E0%B4%BF%E0%B4%95%E0%B5%8D%E0%B4%95%E...)

Thanks Santhosh Thottingal http://thottingal.in

Wikimediaindia-l l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

Ragib Hasan

28 Dec 28 Dec

10:53 p.m.

New subject: [Wikimediaindia-l] Indic languages & unicode issues.

I'm curious about the issue you are discussing ... is this similar to a long-standing bug that affects Bengali, Assamese, and Bishnupriya Manipuri wikipedias? https://bugzilla.wikimedia.org/show_bug.cgi?id=5948

Ragib

User:Ragib on en and bn

-- Ragib Hasan, Ph.D NSF Computing Innovation Fellow and Assistant Research Scientist

Dept of Computer Science Johns Hopkins University 3400 N Charles Street Baltimore, MD 21218

Website: http://www.ragibhasan.com

On Mon, Dec 27, 2010 at 1:29 AM, BalaSundaraRaman sundarbecse@yahoo.com wrote:

...

...
Unicode's decision to bring the second encoding in

...
standard was widely debated and opposed mainly by FOSS developer community from Malayalam. Unicode announced the dual encoding scheme without canonical equivalence definition in 2005 and reverted it when scholars and developers opposed it.

Sadly, you're not alone in this, Santhosh. We have had canonical non-equivalence issues and many more (similar to the atomic chillu issue) in Tamil too. :( Part of it was inherited from the umbrellaish ISCII model (done with good intentions, I believe). They put the abugidas of the Indo-Aryan languages and other systems like Tamil (haven't studied other writing systems enough to comment upon) into one bucket and we're still suffering for that. They cite stability when legitimate changes are sought, but allow such breaking changes.

I'm sure you'll be working with the search engines to map the equivalent glyph sequences. Also, please explore mediawiki tech solutions to add redirects or hidden texts (though not ideal).

Sundar

"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted."

George Boole, quoted in Iverson's Turing Award Lecture

----- Original Message ----

...
From: Santhosh Thottingal santhosh.thottingal@gmail.com To: Discussion list on Indian language projects of Wikimedia. wikimediaindia-l@lists.wikimedia.org Sent: Sun, December 26, 2010 10:28:17 PM Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.

On Sun, Dec 26, 2010 at 7:43 PM, CherianTinu Abraham tinucherian@gmail.com wrote:

...
Hi all, Happened to see Gerard's blog post on issues with Malayalam Wikipedia & Unicode upgrade to 5.1 http://ultimategerardm.blogspot.com/2010/12/malayalam-enigma.html

The issue is very complex. There were heated debates around this topic in Unicode Indic Mailing list for years. In short the issue is about dual encoding- representing a letter using two types of unicode character codes. Unicode's decision to bring the second encoding in standard was widely debated and opposed mainly by FOSS developer community from Malayalam. Unicode announced the dual encoding scheme without canonical equivalence definition in 2005 and reverted it when scholars and developers opposed it. The same proposal again introduced. Foss community, language scholars protested the proposal. The SMC community submitted a document with 17 reasons why dual encoding should not be introduced.- see http://wiki.smc.org.in/images/2/23/SMC_Unicode_5.1.pdf Similarly a seminar conducted to discuss the issue by University of Kerala opposed the proposal. see http://images2.wikia.nocookie.net/__cb20080131071131/fci/images/1/19/Report_... f But Unicode technical consortium did not bother to answer both of these reports and went ahead with the decision in Unicode 5.1. The dual encoding scheme is with out any canonical equivalence definition. Since it is not there in standard I doubt whether Operating systems will implement it, not to mention about search engines.

Since the new encoding scheme is defined without backward compatibility, or against unicode's stability policy, Malayalam FOSS community decided not to implement it until issues are resolved and continuing with unicode 5.0 encoding. Malayalam news portals also follow unicode 5.0. Most of the tools from Google also continue with unicode 5.0 based encoding. Malayalam wikipedia decided to go ahead with latest version of unicode. I had resisted this move in the discussion pages of Malayalam wikipedia. The decision was taken based on voting by a small community of editors and not based on proper technical analysis.

Believe it or not, this is how Malayalam wiki is rendered inWindows XP IE 8 box with OS default font: http://thottingal.in/tmp/ml-wiki-winxp-IE8.png I hope it gives some clue about the issue that Gerard mentioned.

Most of the discussions happened around the encoding issue was in Malayalam(in Malayalam wiki or in blogs), but this English blog post might summarize it http://www.j4v4m4n.in/2009/11/07/unicode-or-malayalam/

Discussions happened in Malayalam wikipedia(content in Malayalam language) http://ml.wikipedia.org/wiki/%E0%B4%B5%E0%B4%BF%E0%B4%95%E0%B5%8D%E0%B4%95%E...)

Thanks Santhosh Thottingal http://thottingal.in

Wikimediaindia-l l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

BalaSundaraRaman

29 Dec 29 Dec

12:04 a.m.

New subject: [Wikimediaindia-l] Indic languages & unicode issues.

Ragib,

(copied Tamil Wiki list)

We've faced an issue similar to Bug #5948. Due to non-canonicalisation, there are two articles on the same title in Tamil Wikipedia! http://ta.wikipedia.org/wiki/%E0%AE%AA%E0%AF%87%E0%AE%9A%E0%AF%8D%E0%AE%9A%E... (Tamil discussion)

- Sundar

"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted." - George Boole, quoted in Iverson's Turing Award Lecture

----- Original Message ----

...

From: Ragib Hasan ragibhasan@gmail.com To: Discussion list on Indian language projects of Wikimedia. wikimediaindia-l@lists.wikimedia.org Sent: Wed, December 29, 2010 10:23:06 AM Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.

I'm curious about the issue you are discussing ... is this similar to a long-standing bug that affects Bengali, Assamese, and Bishnupriya Manipuri wikipedias? https://bugzilla.wikimedia.org/show_bug.cgi?id=5948

Ragib

User:Ragib on en and bn

-- Ragib Hasan, Ph.D NSF Computing Innovation Fellow and Assistant Research Scientist

Dept of Computer Science Johns Hopkins University 3400 N Charles Street Baltimore, MD 21218

Website: http://www.ragibhasan.com

On Mon, Dec 27, 2010 at 1:29 AM, BalaSundaraRaman sundarbecse@yahoo.com wrote:

...
...
Unicode's decision to bring the second encoding in

...
standard was widely debated and opposed mainly by FOSS developer community from Malayalam. Unicode announced the dual encoding scheme without canonical equivalence definition in 2005 and reverted it when scholars and developers opposed it.

Sadly, you're not alone in this, Santhosh. We have had canonical non-equivalence issues and many more (similar to the atomic chillu issue) in Tamil too. :( Part of it was inherited from the umbrellaish ISCII model (done with good intentions, I believe). They put the abugidas of the Indo-Aryan languages and other systems like

Tamil

...
(haven't studied other writing systems enough to comment upon) into one

bucket

...
and we're still suffering for that. They cite stability when legitimate

changes

...
are sought, but allow such breaking changes.

I'm sure you'll be working with the search engines to map the equivalent

glyph

...
sequences. Also, please explore mediawiki tech solutions to add redirects

...

...
hidden texts (though not ideal).

Sundar

"That language is an instrument of human reason, and not merely a medium for

the

...
expression of thought, is a truth generally admitted."

George Boole, quoted in Iverson's Turing Award Lecture

----- Original Message ----

...
From: Santhosh Thottingal santhosh.thottingal@gmail.com To: Discussion list on Indian language projects of Wikimedia. wikimediaindia-l@lists.wikimedia.org Sent: Sun, December 26, 2010 10:28:17 PM Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.

On Sun, Dec 26, 2010 at 7:43 PM, CherianTinu Abraham tinucherian@gmail.com wrote:

...
Hi all, Happened to see Gerard's blog post on issues with Malayalam Wikipedia & Unicode upgrade to 5.1 http://ultimategerardm.blogspot.com/2010/12/malayalam-enigma.html

The issue is very complex. There were heated debates around this topic in Unicode Indic Mailing list for years. In short the issue is about dual encoding- representing a letter using two types of unicode character codes. Unicode's decision to bring the second encoding in standard was widely debated and opposed mainly by FOSS developer community from Malayalam. Unicode announced the dual encoding scheme without canonical equivalence definition in 2005 and reverted it when scholars and developers opposed it. The same proposal again introduced. Foss community, language scholars protested the proposal. The SMC community submitted a document with 17 reasons why dual encoding should not be introduced.- see http://wiki.smc.org.in/images/2/23/SMC_Unicode_5.1.pdf Similarly a seminar conducted to discuss the issue by University of Kerala opposed the proposal. see http://images2.wikia.nocookie.net/__cb20080131071131/fci/images/1/19/Report_...

f

...
...
f But Unicode technical consortium did not bother to answer both of these reports and went ahead with the decision in Unicode 5.1. The dual encoding scheme is with out any canonical equivalence definition. Since it is not there in standard I doubt whether Operating systems will implement it, not to mention about search engines.

Since the new encoding scheme is defined without backward compatibility, or against unicode's stability policy, Malayalam FOSS community decided not to implement it until issues are resolved and continuing with unicode 5.0 encoding. Malayalam news portals also follow unicode 5.0. Most of the tools from Google also continue with unicode 5.0 based encoding. Malayalam wikipedia decided to go ahead with latest version of unicode. I had resisted this move in the discussion pages of Malayalam wikipedia. The decision was taken based on voting by a small community of editors and not based on proper technical analysis.

Believe it or not, this is how Malayalam wiki is rendered inWindows XP IE 8 box with OS default font: http://thottingal.in/tmp/ml-wiki-winxp-IE8.png I hope it gives some clue about the issue that Gerard mentioned.

Most of the discussions happened around the encoding issue was in Malayalam(in Malayalam wiki or in blogs), but this English blog post might summarize it http://www.j4v4m4n.in/2009/11/07/unicode-or-malayalam/

Discussions happened in Malayalam wikipedia(content in Malayalam language) http://ml.wikipedia.org/wiki/%E0%B4%B5%E0%B4%BF%E0%B4%95%E0%B5%8D%E0%B4%95%E...)

)

...
...
Thanks Santhosh Thottingal http://thottingal.in

Wikimediaindia-l l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

BalaSundaraRaman

3:38 a.m.

New subject: [Wikimediaindia-l] Indic languages & unicode issues.

Update: A Tamil Wikipedian, Mahir, went to the core of the issue that I cited in my previous email and identified that the issue in that instance was due to the superfluous use of the zero width non-joiner HTML entity. We're going to file a bug asking Mediwiki to chomp those entities when they occur in inappropriate places.

- Sundar

"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted." - George Boole, quoted in Iverson's Turing Award Lecture

----- Original Message ----

...

From: BalaSundaraRaman sundarbecse@yahoo.com To: Discussion list on Indian language projects of Wikimedia. wikimediaindia-l@lists.wikimedia.org Cc: Wikita wikita-l@lists.wikimedia.org Sent: Wed, December 29, 2010 11:34:06 AM Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.

Ragib,

(copied Tamil Wiki list)

We've faced an issue similar to Bug #5948. Due to non-canonicalisation, there

...

are two articles on the same title in Tamil Wikipedia! http://ta.wikipedia.org/wiki/%E0%AE%AA%E0%AF%87%E0%AE%9A%E0%AF%8D%E0%AE%9A%E... 8 (Tamil discussion)

Sundar

"That language is an instrument of human reason, and not merely a medium for

...

the expression of thought, is a truth generally admitted."

George Boole, quoted in Iverson's Turing Award Lecture

----- Original Message ----

...
From: Ragib Hasan ragibhasan@gmail.com To: Discussion list on Indian language projects of Wikimedia. wikimediaindia-l@lists.wikimedia.org Sent: Wed, December 29, 2010 10:23:06 AM Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.

I'm curious about the issue you are discussing ... is this similar to a long-standing bug that affects Bengali, Assamese, and Bishnupriya Manipuri wikipedias? https://bugzilla.wikimedia.org/show_bug.cgi?id=5948

Ragib

User:Ragib on en and bn

-- Ragib Hasan, Ph.D NSF Computing Innovation Fellow and Assistant Research Scientist

Dept of Computer Science Johns Hopkins University 3400 N Charles Street Baltimore, MD 21218

Website: http://www.ragibhasan.com

On Mon, Dec 27, 2010 at 1:29 AM, BalaSundaraRaman sundarbecse@yahoo.com

...

...
wrote:

...
...
Unicode's decision to bring the second encoding in

...
standard was widely debated and opposed mainly by FOSS developer community from Malayalam. Unicode announced the dual encoding scheme without canonical equivalence definition in 2005 and reverted it when scholars and developers opposed it.

Sadly, you're not alone in this, Santhosh. We have had canonical non-equivalence issues and many more (similar to

the

...
...
atomic chillu issue) in Tamil too. :( Part of it was inherited from the umbrellaish ISCII model (done with

good

...

...
...
intentions, I believe). They put the abugidas of the Indo-Aryan languages and other systems like

...
Tamil

...
(haven't studied other writing systems enough to comment upon) into one

bucket

...
and we're still suffering for that. They cite stability when legitimate

changes

...
are sought, but allow such breaking changes.

I'm sure you'll be working with the search engines to map the equivalent

...

...
glyph

...
sequences. Also, please explore mediawiki tech solutions to add redirects

or

...
...
hidden texts (though not ideal).

Sundar

"That language is an instrument of human reason, and not merely a medium

for

...
the

...
expression of thought, is a truth generally admitted."

George Boole, quoted in Iverson's Turing Award Lecture

----- Original Message ----

...
From: Santhosh Thottingal santhosh.thottingal@gmail.com To: Discussion list on Indian language projects of Wikimedia. wikimediaindia-l@lists.wikimedia.org Sent: Sun, December 26, 2010 10:28:17 PM Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.

On Sun, Dec 26, 2010 at 7:43 PM, CherianTinu Abraham tinucherian@gmail.com wrote:

...
Hi all, Happened to see Gerard's blog post on issues with Malayalam

Wikipedia

...

...
...
...
...
& Unicode upgrade to 5.1

http://ultimategerardm.blogspot.com/2010/12/malayalam-enigma.html

...
...
...
The issue is very complex. There were heated debates around this

topic

...

...
...
...
in Unicode Indic Mailing list for years. In short the issue is about dual encoding- representing a letter using two types of unicode character codes. Unicode's decision to bring the second encoding in standard was widely debated and opposed mainly by FOSS developer community from Malayalam. Unicode announced the dual encoding scheme without canonical equivalence definition in 2005 and reverted it when scholars and developers opposed it. The same proposal again introduced. Foss community, language scholars protested the proposal. The SMC community submitted a document with 17 reasons why dual encoding should not be introduced.- see http://wiki.smc.org.in/images/2/23/SMC_Unicode_5.1.pdf Similarly a seminar conducted to discuss the issue by University of Kerala opposed the proposal. see http://images2.wikia.nocookie.net/__cb20080131071131/fci/images/1/19/Report_...

f

...
f

...
...
f But Unicode technical consortium did not bother to answer both of these reports and went ahead with the decision in Unicode 5.1. The dual encoding scheme is with out any canonical equivalence

definition.

...
...
...
Since it is not there in standard I doubt whether Operating systems will implement it, not to mention about search engines.

Since the new encoding scheme is defined without backward compatibility, or against unicode's stability policy, Malayalam FOSS community decided not to implement it until issues are resolved and continuing with unicode 5.0 encoding. Malayalam news portals also follow unicode 5.0. Most of the tools from Google also continue with unicode 5.0 based encoding. Malayalam wikipedia decided to go ahead with latest version of unicode. I had resisted this move in the discussion pages of Malayalam wikipedia. The decision was taken based on voting by a small community of editors and not based on proper technical analysis.

Believe it or not, this is how Malayalam wiki is rendered inWindows XP IE 8 box with OS default font: http://thottingal.in/tmp/ml-wiki-winxp-IE8.png I hope it gives some clue about the issue that Gerard mentioned.

Most of the discussions happened around the encoding issue was in Malayalam(in Malayalam wiki or in blogs), but this English blog post might summarize it http://www.j4v4m4n.in/2009/11/07/unicode-or-malayalam/

Discussions happened in Malayalam wikipedia(content in Malayalam language) http://ml.wikipedia.org/wiki/%E0%B4%B5%E0%B4%BF%E0%B4%95%E0%B5%8D%E0%B4%95%E...)

)

...
)

...
...
Thanks Santhosh Thottingal http://thottingal.in

Wikimediaindia-l l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

santhosh.thottingal＠gmail.com

3:53 a.m.

New subject: [Wikimediaindia-l] Indic languages & unicode issues.

...

Update: A Tamil Wikipedian, Mahir, went to the core of the issue that I cited in my previous email and identified that the issue in that instance was due to the superfluous use of the zero width non-joiner HTML entity. We're going to file a bug asking Mediwiki to chomp those entities when they occur in inappropriate places.

Qn: Definition for "inappropriate places"? Ans: Wikipedia URLs should be considered as "identifiers" and should use Unicode standard for Identifier definition using unicode data. Unicode Standard Annex #31 defines this clearly. http://unicode.org/reports/tr31/ IMHO, Mediawiki should implement this standard.

But for Tamil, I am not aware of any valid pattern where ZWJ or ZWNJ is valid. I am aware of valid patterns for other Indian languages. So in that case we should remove all zwj,zwnj from Tamil urls. Sometime back, the inbuilt tool in Malayalam wiki used to allow putting n number of zwj in text and we corrected the script to disallow user to put more than one zwj, zwnj in sequence(this is what UAX #31 says too).

Thanks Santhosh

BalaSundaraRaman

4:06 a.m.

New subject: [Wikimediaindia-l] Indic languages & unicode issues.

Thanks Santhosh for the excellent resource (http://unicode.org/reports/tr31/).

...

But for Tamil, I am not aware of any valid pattern where ZWJ or ZWNJ is valid.

It's valid only to force the decomposition of ksha (க்ஷ) into k- (க்) followed by sha (ஷ). It's an extremely rare and only an historically relevant grantha character and we can certainly live without this decomposition in urls. In fact, a) people have disputed the inclusion of ksha and sha under the Tamil chart in the first place and b) people have argued that, when used, the default behaviour should be the decomposed form, and a joiner be used to force concatenation.

After seeing the linked resource, we can safely ask for dropping of these characters in titles.

- Sundar

"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted." - George Boole, quoted in Iverson's Turing Award Lecture

----- Original Message ----

...

From: "santhosh.thottingal@gmail.com" santhosh.thottingal@gmail.com To: Discussion list on Indian language projects of Wikimedia. wikimediaindia-l@lists.wikimedia.org Sent: Wed, December 29, 2010 3:23:37 PM Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.

...
Update: A Tamil Wikipedian, Mahir, went to the core of the issue that I cited in my previous email and identified that the issue in that instance was due to the superfluous use of the zero width non-joiner HTML entity. We're going to file a bug asking Mediwiki to chomp those entities when they occur in inappropriate places.

Qn: Definition for "inappropriate places"? Ans: Wikipedia URLs should be considered as "identifiers" and should use Unicode standard for Identifier definition using unicode data. Unicode Standard Annex #31 defines this clearly. http://unicode.org/reports/tr31/ IMHO, Mediawiki should implement this standard.

But for Tamil, I am not aware of any valid pattern where ZWJ or ZWNJ is valid. I am aware of valid patterns for other Indian languages. So in that case we should remove all zwj,zwnj from Tamil urls. Sometime back, the inbuilt tool in Malayalam wiki used to allow putting n number of zwj in text and we corrected the script to disallow user to put more than one zwj, zwnj in sequence(this is what UAX #31 says too).

Thanks Santhosh

Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

4937

Age (days ago)

4940

Last active (days ago)

wikimediaindia-l@lists.wikimedia.org

8 comments

6 participants

tags (0)

participants (6)

BalaSundaraRaman
CherianTinu Abraham
Hari Prasad Nadig
Ragib Hasan
Santhosh Thottingal
santhosh.thottingal＠gmail.com