On 2/23/11, Gautam John gautam@prathambooks.org wrote:
Dear Anivar:
There are Four Components
Thanks for the addendum - how important is the rendering engine in the scheme of things? Is work on that pretty much done or are there issues there too?
If your language have some errors in Complex Glyph formation, it is a rendering engine issue. You can find more here http://en.wikipedia.org/wiki/Wikipedia:Enabling_complex_text_support_for_Ind...
Rendering Engines like Pango evolved through more than 10 years of patching & correction by language communities. It work Pretty well in most of the indic languages. Harfbuzz(http://www.freedesktop.org/wiki/Software/HarfBuzz) is relatively new player in the field by taking code from Pango QT & ICU . Harfuzz-ng is used in new Firefox 4 as its default Rendering engine .Uniscribe engine in Windows based systems started supporting Indic fonts from Windows XP SP2 onwards.
Let me give an example for why Rendering engine is important. Now For latin script wiki's there is PDF download option & Pediapress to print them directly But Such Options are not available for Non Latin wikis Character Rendering is a the block here. Pedia press's library fails to render non latin content , because the library they use is not making use of rendering engines.
If a teacher went to internet cafe for reading a wikipedia entry in indian language , she must ensure following things before reading/printing articles
1. ensure the Operating system have Indic support 2. Ensure It have a font to display content correctly 3. Browser renders well
Then only she can read it/ print it in human readable form. If there is PDF export facility with server side rendering , it was so easy for her to to take it /print it for students.
Sometime back Santhosh Posted his project Pypdflib for testing in this list. It is a library for rendering PDF from Indic language wiki pages . It uses functionality of pango for generating PDF In short Rendering is a major roadblock in reaching wikipedia to masses. The projects like santhosh's effort are very important to fill this gap.
It is Font dependent. There is a need of Preparing Conversion maps for each Ascii font to convert data encoded in them to unicode. Swathanthra Malayalam Computing's Payyan's (http://wiki.smc.org.in/Payyans ) is a tool developed for converting ASCII to Unicode easily for any Indic Language by building a Font map for each needed font . This tool helped Malayalam Wiktionary to convert many copyright expired books in non standard encodings to Unicode Popular Firefox extension named Padma uses similar encoding conversion tables to display ASCII news websites in Unicode
So how do these work? They have built a map for every single ASCII encoding/font pair (since this is some ugly hack) and the corresponding Unicode value?
Yes. payyan's wikipage have an Howto for creating fontmaps
There must be thousands of ASCII encoding/font pairs right? Is this even a viable option? Are there alternatives to this?
This is the only viable option as of now. Most of the languages have around 10-20 popular fonts . Creating Mapping tables for them is anyway a big task . But if each language communities are contributing, it is not a big task. And Padma project has done mapping of many news website fonts already through the contributions of many people.
There is no other free alternative . BTW Document Conversion is a big business and many corporates are working on this area to provide solutions for companies & governments
I dont think this will happen. There is a long history of lobbying for thiswith CDAC from 2001 Onwards and nothing happened. CDAC made enough money by selling ASCII fonts(and still makes) and They cant even think about giving them away with a FOSS License . And during frequent terms they eat more government money for making yet another CD to ship with their FOSS project forks (such ad Bhaathiya OO , IndiFox etc )+ These fonts. In the same way most of the TDIL funding to CDAC for Indic Language technology research does not make output at all or not getting released, even after TDIL's policy decision to release them under a foss license.
I can see the frustration of this - so in your opinion, an effort not worth undertaking? Assuming they were ready to use a FOSS license, are the fonts good enough to want to use?
In my opinion, Efforts on this will be waste of time & money .I dont believe in miracles with CDAC.
CDACMumbai have a history of GPL Licensing one font series as a part of their indix project , Raghu Series, by Late. Prof. R.K.Joshi, Famous Calligrapher and Researcher in Type faces. http://en.wikipedia.org/wiki/R_K_Joshi Rebranding his Jana Series fonts to raghu series & GPLing them was his long term effort from inside CDAC. But its font tables need to be corrected to make them usable . We did this work for malayalam and Raghu-Malayalam is currently maintained by SMC. Anyway it is an exceptional case
Searching and sorting algorithms for Indic languages are in development and are not bug free. Indic support is not yet available in most of the search solutions (including FOSS solutions like Lucene or Solr) because of the complex word formation characteristics.
But if I understand correctly, this is *only* possible using Unicode encoding. Right?
Yes. And Problems & instability in unicode encoding also affects this Sometime back GerardM 's post is shared in this list http://ultimategerardm.blogspot.com/2010/12/malayalam-enigma.html Also read these break thoughts on unicode by Indic language communities http://www.j4v4m4n.in/2009/11/07/unicode-or-malayalam/
Anivar
Thank you, Anivar.
Best,
Gautam ________ http://social.prathambooks.org/