[Wikimediaindia-l] (OT) On the importance of Unicode

Anivar Aravind anivar.aravind at gmail.com
Thu Feb 24 03:21:32 UTC 2011


On 2/23/11, Gautam John <gautam at prathambooks.org> wrote:
> Dear Anivar:
>
>> There are Four Components
>
> Thanks for the addendum - how important is the rendering engine in the
> scheme of things? Is work on that pretty much done or are there issues
> there too?

If your language have some errors in Complex Glyph formation, it is a
rendering engine issue.
You can find more here
http://en.wikipedia.org/wiki/Wikipedia:Enabling_complex_text_support_for_Indic_scripts

Rendering Engines like Pango evolved through more than 10 years of
patching & correction by language communities. It work Pretty well in
most of the indic languages.
Harfbuzz(http://www.freedesktop.org/wiki/Software/HarfBuzz) is
relatively new player in the field by taking code from Pango QT & ICU
. Harfuzz-ng is used in new Firefox 4 as its default Rendering engine
.Uniscribe engine in  Windows based systems started supporting Indic
fonts  from Windows XP SP2  onwards.

Let me give an example for why Rendering engine is important.
Now For latin script wiki's there is PDF download option & Pediapress
to print them directly
But Such Options are not available for Non Latin wikis
Character Rendering is a the block here. Pedia press's library fails
to render non latin  content , because the library they use is not
making use of rendering engines.

If a teacher went to internet cafe for reading a wikipedia entry in
indian language , she must ensure following things before
reading/printing articles

1. ensure the Operating system have Indic support
2. Ensure It have a font to display content correctly
3. Browser renders well

Then only she can read it/ print it in human readable form. If there
is PDF export facility with server side rendering , it was so easy for
her to to take it /print it for students.

Sometime back Santhosh Posted his project Pypdflib for testing in this
list. It is  a library for rendering PDF from Indic language wiki
pages . It uses functionality of pango for generating PDF
In short Rendering is a major roadblock in reaching wikipedia to
masses. The projects like santhosh's effort  are very important to
fill this gap.


>
>> It is Font dependent. There is a need of Preparing Conversion maps for
>> each Ascii font to convert data encoded in them to unicode.
>> Swathanthra Malayalam Computing's Payyan's
>> (http://wiki.smc.org.in/Payyans ) is a tool developed for converting
>> ASCII to Unicode easily  for any Indic Language by building a Font map
>> for each needed font . This tool helped Malayalam Wiktionary to
>> convert many copyright expired books in non standard encodings to
>> Unicode
>> Popular Firefox extension named Padma uses similar encoding conversion
>> tables to display ASCII news websites in Unicode
>
> So how do these work? They have built a map for every single ASCII
> encoding/font pair (since this is some ugly hack) and the
> corresponding Unicode value?

Yes. payyan's wikipage have an Howto for creating fontmaps

>There must be thousands of ASCII
> encoding/font pairs right? Is this even a viable option? Are there
> alternatives to this?

This is the only viable option as of now. Most of the languages have
around 10-20 popular fonts . Creating Mapping tables for them is
anyway a big task . But if each language communities are contributing,
it is not a big task. And Padma project has done mapping of many news
website fonts already through the contributions of many people.

There is no other free alternative .  BTW Document Conversion is a big
business and many corporates are working on this area to provide
solutions for companies & governments

>> I dont think this will happen. There is a long history of lobbying for
>> thiswith CDAC from 2001 Onwards and nothing happened. CDAC made enough
>> money by selling ASCII fonts(and still makes) and They cant even think
>> about giving them away with a FOSS License . And during frequent terms
>>  they eat more government money for making yet another CD to ship with
>> their FOSS project forks (such ad Bhaathiya OO , IndiFox etc )+ These
>> fonts. In the same way most of the TDIL funding to CDAC for Indic
>> Language technology research does not make output at all or not
>> getting released, even after TDIL's policy decision to release them
>> under a foss license.
>
> I can see the frustration of this - so in your opinion, an effort not
> worth undertaking? Assuming they were ready to use a FOSS license, are
> the fonts good enough to want to use?

In my opinion, Efforts on this will be waste of time & money .I dont
believe in miracles with CDAC.

CDACMumbai have a history of GPL Licensing one font series as a part
of their indix project  , Raghu Series, by Late. Prof. R.K.Joshi,
Famous Calligrapher and Researcher in Type faces.
http://en.wikipedia.org/wiki/R_K_Joshi
Rebranding his Jana Series fonts to raghu series & GPLing them  was
his long term effort from inside CDAC. But its font tables need to be
corrected to make them usable .
We did this work for malayalam and Raghu-Malayalam is currently
maintained by SMC.
Anyway it is an exceptional case

>> Searching and sorting algorithms for Indic languages are in
>> development and are not bug free. Indic support is not yet available
>> in most of the search solutions (including FOSS solutions like Lucene
>> or Solr) because of the complex word formation characteristics.
>
> But if I understand correctly, this is *only* possible using Unicode
> encoding. Right?

Yes. And Problems & instability  in unicode encoding also affects this
Sometime back GerardM 's post is shared in this list
http://ultimategerardm.blogspot.com/2010/12/malayalam-enigma.html
Also read these break thoughts on unicode by Indic language
communities  http://www.j4v4m4n.in/2009/11/07/unicode-or-malayalam/


Anivar

>
> Thank you, Anivar.
>
> Best,
>
> Gautam
> ________
> http://social.prathambooks.org/
>


-- 
"[It is not] possible to distinguish between 'numerical' and
'nonnumerical' algorithms, as if numbers were somehow different from
other kinds of precise information." - Donald Knuth



More information about the Wikimediaindia-l mailing list