Re: [Wikimediaindia-l] (OT) On the importance of Unicode

23 Feb 2011


      On 2/23/11, Gautam John gautam@prathambooks.org wrote:
...
Dear Anivar:
...
There are Four Components
Thanks for the addendum - how important is the rendering engine in the
scheme of things? Is work on that pretty much done or are there issues
there too?
If your language have some errors in Complex Glyph formation, it is a
rendering engine issue.
You can find more here
http://en.wikipedia.org/wiki/Wikipedia:Enabling_complex_text_support_for_Ind...
Rendering Engines like Pango evolved through more than 10 years of
patching & correction by language communities. It work Pretty well in
most of the indic languages.
Harfbuzz(http://www.freedesktop.org/wiki/Software/HarfBuzz) is
relatively new player in the field by taking code from Pango QT & ICU
. Harfuzz-ng is used in new Firefox 4 as its default Rendering engine
.Uniscribe engine in  Windows based systems started supporting Indic
fonts  from Windows XP SP2  onwards.
Let me give an example for why Rendering engine is important.
Now For latin script wiki's there is PDF download option & Pediapress
to print them directly
But Such Options are not available for Non Latin wikis
Character Rendering is a the block here. Pedia press's library fails
to render non latin  content , because the library they use is not
making use of rendering engines.
If a teacher went to internet cafe for reading a wikipedia entry in
indian language , she must ensure following things before
reading/printing articles
1. ensure the Operating system have Indic support
2. Ensure It have a font to display content correctly
3. Browser renders well
Then only she can read it/ print it in human readable form. If there
is PDF export facility with server side rendering , it was so easy for
her to to take it /print it for students.
Sometime back Santhosh Posted his project Pypdflib for testing in this
list. It is  a library for rendering PDF from Indic language wiki
pages . It uses functionality of pango for generating PDF
In short Rendering is a major roadblock in reaching wikipedia to
masses. The projects like santhosh's effort  are very important to
fill this gap.
...
...
It is Font dependent. There is a need of Preparing Conversion maps for
each Ascii font to convert data encoded in them to unicode.
Swathanthra Malayalam Computing's Payyan's
(http://wiki.smc.org.in/Payyans ) is a tool developed for converting
ASCII to Unicode easily  for any Indic Language by building a Font map
for each needed font . This tool helped Malayalam Wiktionary to
convert many copyright expired books in non standard encodings to
Unicode
Popular Firefox extension named Padma uses similar encoding conversion
tables to display ASCII news websites in Unicode
So how do these work? They have built a map for every single ASCII
encoding/font pair (since this is some ugly hack) and the
corresponding Unicode value?
Yes. payyan's wikipage have an Howto for creating fontmaps
...
There must be thousands of ASCII
encoding/font pairs right? Is this even a viable option? Are there
alternatives to this?
This is the only viable option as of now. Most of the languages have
around 10-20 popular fonts . Creating Mapping tables for them is
anyway a big task . But if each language communities are contributing,
it is not a big task. And Padma project has done mapping of many news
website fonts already through the contributions of many people.
There is no other free alternative .  BTW Document Conversion is a big
business and many corporates are working on this area to provide
solutions for companies & governments
...
...
I dont think this will happen. There is a long history of lobbying for
thiswith CDAC from 2001 Onwards and nothing happened. CDAC made enough
money by selling ASCII fonts(and still makes) and They cant even think
about giving them away with a FOSS License . And during frequent terms
 they eat more government money for making yet another CD to ship with
their FOSS project forks (such ad Bhaathiya OO , IndiFox etc )+ These
fonts. In the same way most of the TDIL funding to CDAC for Indic
Language technology research does not make output at all or not
getting released, even after TDIL's policy decision to release them
under a foss license.
I can see the frustration of this - so in your opinion, an effort not
worth undertaking? Assuming they were ready to use a FOSS license, are
the fonts good enough to want to use?
In my opinion, Efforts on this will be waste of time & money .I dont
believe in miracles with CDAC.
CDACMumbai have a history of GPL Licensing one font series as a part
of their indix project  , Raghu Series, by Late. Prof. R.K.Joshi,
Famous Calligrapher and Researcher in Type faces.
http://en.wikipedia.org/wiki/R_K_Joshi
Rebranding his Jana Series fonts to raghu series & GPLing them  was
his long term effort from inside CDAC. But its font tables need to be
corrected to make them usable .
We did this work for malayalam and Raghu-Malayalam is currently
maintained by SMC.
Anyway it is an exceptional case
...
...
Searching and sorting algorithms for Indic languages are in
development and are not bug free. Indic support is not yet available
in most of the search solutions (including FOSS solutions like Lucene
or Solr) because of the complex word formation characteristics.
But if I understand correctly, this is *only* possible using Unicode
encoding. Right?
Yes. And Problems & instability  in unicode encoding also affects this
Sometime back GerardM 's post is shared in this list
http://ultimategerardm.blogspot.com/2010/12/malayalam-enigma.html
Also read these break thoughts on unicode by Indic language
communities  http://www.j4v4m4n.in/2009/11/07/unicode-or-malayalam/
Anivar
...
Thank you, Anivar.
Best,
Gautam
________
http://social.prathambooks.org/
-- 
"[It is not] possible to distinguish between 'numerical' and
'nonnumerical' algorithms, as if numbers were somehow different from
other kinds of precise information." - Donald Knuth

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wikimediaindia-l] (OT) On the importance of Unicode