Thanks Santhosh for the excellent resource (http://unicode.org/reports/tr31/).
But for Tamil, I am not aware of any valid pattern where ZWJ or ZWNJ is valid.
It's valid only to force the decomposition of ksha (க்ஷ) into k- (க்) followed by sha (ஷ). It's an extremely rare and only an historically relevant grantha character and we can certainly live without this decomposition in urls. In fact, a) people have disputed the inclusion of ksha and sha under the Tamil chart in the first place and b) people have argued that, when used, the default behaviour should be the decomposed form, and a joiner be used to force concatenation.
After seeing the linked resource, we can safely ask for dropping of these characters in titles.
- Sundar
"That language is an instrument of human reason, and not merely a medium for the expression of thought, is a truth generally admitted." - George Boole, quoted in Iverson's Turing Award Lecture
----- Original Message ----
From: "santhosh.thottingal@gmail.com" santhosh.thottingal@gmail.com To: Discussion list on Indian language projects of Wikimedia. wikimediaindia-l@lists.wikimedia.org Sent: Wed, December 29, 2010 3:23:37 PM Subject: Re: [Wikimediaindia-l] Indic languages & unicode issues.
Update: A Tamil Wikipedian, Mahir, went to the core of the issue that I cited in my previous email and identified that the issue in that instance was due to the superfluous use of the zero width non-joiner HTML entity. We're going to file a bug asking Mediwiki to chomp those entities when they occur in inappropriate places.
Qn: Definition for "inappropriate places"? Ans: Wikipedia URLs should be considered as "identifiers" and should use Unicode standard for Identifier definition using unicode data. Unicode Standard Annex #31 defines this clearly. http://unicode.org/reports/tr31/ IMHO, Mediawiki should implement this standard.
But for Tamil, I am not aware of any valid pattern where ZWJ or ZWNJ is valid. I am aware of valid patterns for other Indian languages. So in that case we should remove all zwj,zwnj from Tamil urls. Sometime back, the inbuilt tool in Malayalam wiki used to allow putting n number of zwj in text and we corrected the script to disallow user to put more than one zwj, zwnj in sequence(this is what UAX #31 says too).
Thanks Santhosh
Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l