Re: [Wikimediaindia-l] (OT) On the importance of Unicode

23 Feb 2011

      On 2/22/11, Gautam John gautam@prathambooks.org wrote:
...
On 22 February 2011 22:29, Santhosh Thottingal
santhosh.thottingal@gmail.com wrote:
...
I think you have some confusion on Unicode and Fonts. Let me try to
clarify in simple words.
Yes - I did! And thank you for such a detailed response.
To see if I have understood this - there are three components:

Input (Different types of keyboard layouts are used but are

independent of the method of encoding - correct?)
2. Encoding and storing the input (ASCII is the older method - have
heard of ISCII as well but do not know what that is but Unicode is the
standard.
3. Representing, visually for the human user, what has been inputed
and encoded. (Font or type faces and these are, to an extent,
independent of the encoding method used.)
There are Four Components
1. Input Methods ( GOI approved Inscript layout, Various Popular
Layouts , Translitraton Keyboards, Phonetic Keyboards)
2. Encoding ( unicode)
3. Font (Opentype Fonts  ie. supporting Unicode)
4. Rendering Engines (this does the shaping of Complex Glyphs using
the Open type font table in Fonts . eg. Pango in Gnome, Harfbuzz in
KDE, ICU in Openoffice & java based programmes , Uniscribe in Windows
etc )
...
...
But I know that many people still use the term "data in unicode
fonts", data in xyz font etc. This usage came into existence just
because,  before unicode was popular, most of the Indian publishers
used a non-standard way of representing our data- using English(or
latin -ascii)  data and change the font's 'face' to Indian glyph. "a
fancy dress" hack. The letter "k" will be shown as hindi "ka" with the
help of a font. ie the data is still english, but what you "see" is
Hindi.
So if I understand correctly, not only is the encoding in ASCII but
the representation of that encoding is tied to a particular font (that
was used for representation at entry?) and will only be represented
properly when using that font? However, what I am trying to understand
is whether there is consistency across the ASCII encoding? Will ka in
Hindi be encoded in ASCII only one way or is there a linkage, that I
do not understand, to the font used to represent it as well?
ASCII is not like Unicode. It only understands latin, not any other
language. All over India, legacy, non-standard local language
"technologies" (ugly hacks) have gained deep roots. Local newspaper
websites as well as publishing houses seem to use their own
non-standard fonts. This means that documents and web sites get tied
to fonts. These fonts may or may not be freely available, and in some
extreme cases, may be no longer available at all. If you lose the
font, you lose the content as well.
Ka in Hindi may be mapped in the position of A in some font , in the
position of H in some other font as per the convenience of font
developer
...
The reason I ask is because if ka in Hindi is always encoded the same
way irrespective of the font used to represent it, then it should not
be hard to build an ASCII to Unicode map of encoding that will only
have to be done once for each language? Though something tells me I am
way off on this assumption.
It is Font dependent. There is a need of Preparing Conversion maps for
each Ascii font to convert data encoded in them to unicode.
Swathanthra Malayalam Computing's Payyan's
(http://wiki.smc.org.in/Payyans ) is a tool developed for converting
ASCII to Unicode easily  for any Indic Language by building a Font map
for each needed font . This tool helped Malayalam Wiktionary to
convert many copyright expired books in non standard encodings to
Unicode
Popular Firefox extension named Padma uses similar encoding conversion
tables to display ASCII news websites in Unicode
...
...
This is true. Fonts exist for all scripts ,  but the variety , or
quality of the existing fonts varies. Availability of fonts licensed
in foss compatible license is also a problem. For a detailed list of
Indic fonts with license info, see
http://indlinux.org/wiki/index.php/IndicFontsList
Thanks, Santosh. This is a really useful. Also, are these screen or
print ready fonts?
Each Language Communities can answer this question well. In Malayalam
we have both screen and print fonts, including one Ornamental font .
...
...
You are correct.  I would say "fonts licensed under any FOSS license"
instead of "free use/reuse".
Indeed. FOSS license is what I should have said.
...
In fact, the funds were spent(read wasted) for the development of
Proprietary fonts by government agencies like CDAC. Fonts with
free(dom) licenses were developed and maintained by FOSS developer
communities.
*sigh* In your opinion, would they be any real benefit if they did
license the ILDC series under a true FOSS license?
I dont think this will happen. There is a long history of lobbying for
thiswith CDAC from 2001 Onwards and nothing happened. CDAC made enough
money by selling ASCII fonts(and still makes) and They cant even think
about giving them away with a FOSS License . And during frequent terms
 they eat more government money for making yet another CD to ship with
their FOSS project forks (such ad Bhaathiya OO , IndiFox etc )+ These
fonts. In the same way most of the TDIL funding to CDAC for Indic
Language technology research does not make output at all or not
getting released, even after TDIL's policy decision to release them
under a foss license.
...
...
Each Unicode character is multi-byte character while in ASCII, it is
single byte.
Ah. Okay. I understand now.
...
This is not comparable since search is not possible in ascii font way
of representing data. Since the data is not in Hindi , but we just
"see" as Hindi, one cannot do a search or any such data processing on
that data.
If I understand, it is not possible to search within ASCII encoded
text but this can be done in Unicode encoded text?
Searching and sorting algorithms for Indic languages are in
development and are not bug free. Indic support is not yet available
in most of the search solutions (including FOSS solutions like Lucene
or Solr) because of the complex word formation characteristics. Most
of the existing applications tries exact string-matching algorithms on
Indic content yielding only 20% of results. Indic search algorithms
should use language and grammar aware algorithms
...
Thank you very much Santosh - I have learned a lot from this.
Best,
Gautam

Wikimediaindia-l mailing list
Wikimediaindia-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l
-- 
"[It is not] possible to distinguish between 'numerical' and
'nonnumerical' algorithms, as if numbers were somehow different from
other kinds of precise information." - Donald Knuth

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wikimediaindia-l] (OT) On the importance of Unicode