Re: [Wikimediaindia-l] (OT) On the importance of Unicode

23 Feb 2011

On 2/22/11, Gautam John &lt;gautam(a)prathambooks.org&gt; wrote:
...
  On 22 February 2011 22:29, Santhosh Thottingal
 &lt;santhosh.thottingal(a)gmail.com&gt; wrote:

  I think you have some confusion on Unicode and
Fonts. Let me try to
 clarify in simple words. 
 Yes - I did! And thank you for such a detailed response.

 To see if I have understood this - there are three components:

 1. Input (Different types of keyboard layouts are used but are
 independent of the method of encoding - correct?)
 2. Encoding and storing the input (ASCII is the older method - have
 heard of ISCII as well but do not know what that is but Unicode is the
 standard.
 3. Representing, visually for the human user, what has been inputed
 and encoded. (Font or type faces and these are, to an extent,
 independent of the encoding method used.) 
There are Four Components

1. Input Methods ( GOI approved Inscript layout, Various Popular
Layouts , Translitraton Keyboards, Phonetic Keyboards)
2. Encoding ( unicode)
3. Font (Opentype Fonts  ie. supporting Unicode)
4. Rendering Engines (this does the shaping of Complex Glyphs using
the Open type font table in Fonts . eg. Pango in Gnome, Harfbuzz in
KDE, ICU in Openoffice & java based programmes , Uniscribe in Windows
etc )

...
   But I know
that many people still use the term "data in unicode
 fonts", data in xyz font etc. This usage came into existence just
 because,  before unicode was popular, most of the Indian publishers
 used a non-standard way of representing our data- using English(or
 latin -ascii)  data and change the font's 'face' to Indian glyph. "a
 fancy dress" hack. The letter "k" will be shown as hindi "ka"
with the
 help of a font. ie the data is still english, but what you "see" is
 Hindi. 
 So if I understand correctly, not only is the encoding in ASCII but
 the representation of that encoding is tied to a particular font (that
 was used for representation at entry?) and will only be represented
 properly when using that font? However, what I am trying to understand
 is whether there is consistency across the ASCII encoding? Will ka in
 Hindi be encoded in ASCII only one way or is there a linkage, that I
 do not understand, to the font used to represent it as well? 
ASCII is not like Unicode. It only understands latin, not any other
language. All over India, legacy, non-standard local language
"technologies" (ugly hacks) have gained deep roots. Local newspaper
websites as well as publishing houses seem to use their own
non-standard fonts. This means that documents and web sites get tied
to fonts. These fonts may or may not be freely available, and in some
extreme cases, may be no longer available at all. If you lose the
font, you lose the content as well.

Ka in Hindi may be mapped in the position of A in some font , in the
position of H in some other font as per the convenience of font
developer

...

 The reason I ask is because if ka in Hindi is always encoded the same
 way irrespective of the font used to represent it, then it should not
 be hard to build an ASCII to Unicode map of encoding that will only
 have to be done once for each language? Though something tells me I am
 way off on this assumption. 
It is Font dependent. There is a need of Preparing Conversion maps for
each Ascii font to convert data encoded in them to unicode.
Swathanthra Malayalam Computing's Payyan's
(http://wiki.smc.org.in/Payyans ) is a tool developed for converting
ASCII to Unicode easily  for any Indic Language by building a Font map
for each needed font . This tool helped Malayalam Wiktionary to
convert many copyright expired books in non standard encodings to
Unicode

Popular Firefox extension named Padma uses similar encoding conversion
tables to display ASCII news websites in Unicode

...

  This is true. Fonts exist for all scripts ,  but
the variety , or
 quality of the existing fonts varies. Availability of fonts licensed
 in foss compatible license is also a problem. For a detailed list of
 Indic fonts with license info, see
 http://indlinux.org/wiki/index.php/IndicFontsList 
 Thanks, Santosh. This is a really useful. Also, are these screen or
 print ready fonts? 
Each Language Communities can answer this question well. In Malayalam
we have both screen and print fonts, including one Ornamental font .

...
   You are
correct.  I would say "fonts licensed under any FOSS license"
 instead of "free use/reuse". 
 Indeed. FOSS license is what I should have said.

  In fact, the funds were spent(read wasted) for
the development of
 Proprietary fonts by government agencies like CDAC. Fonts with
 free(dom) licenses were developed and maintained by FOSS developer
 communities. 
 *sigh* In your opinion, would they be any real benefit if they did
 license the ILDC series under a true FOSS license? 
I dont think this will happen. There is a long history of lobbying for
thiswith CDAC from 2001 Onwards and nothing happened. CDAC made enough
money by selling ASCII fonts(and still makes) and They cant even think
about giving them away with a FOSS License . And during frequent terms
 they eat more government money for making yet another CD to ship with
their FOSS project forks (such ad Bhaathiya OO , IndiFox etc )+ These
fonts. In the same way most of the TDIL funding to CDAC for Indic
Language technology research does not make output at all or not
getting released, even after TDIL's policy decision to release them
under a foss license.

...
   Each Unicode
character is multi-byte character while in ASCII, it is
 single byte. 
 Ah. Okay. I understand now.

  This is not comparable since search is not
possible in ascii font way
 of representing data. Since the data is not in Hindi , but we just
 "see" as Hindi, one cannot do a search or any such data processing on
 that data. 
 If I understand, it is not possible to search within ASCII encoded
 text but this can be done in Unicode encoded text? 
Searching and sorting algorithms for Indic languages are in
development and are not bug free. Indic support is not yet available
in most of the search solutions (including FOSS solutions like Lucene
or Solr) because of the complex word formation characteristics. Most
of the existing applications tries exact string-matching algorithms on
Indic content yielding only 20% of results. Indic search algorithms
should use language and grammar aware algorithms

...

 Thank you very much Santosh - I have learned a lot from this.

 Best,

 Gautam

 _______________________________________________
 Wikimediaindia-l mailing list
 Wikimediaindia-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l

-- 
"[It is not] possible to distinguish between 'numerical' and
'nonnumerical' algorithms, as if numbers were somehow different from
other kinds of precise information." - Donald Knuth

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wikimediaindia-l] (OT) On the importance of Unicode