On 2/22/11, Gautam John gautam@prathambooks.org wrote:
On 22 February 2011 22:29, Santhosh Thottingal santhosh.thottingal@gmail.com wrote:
I think you have some confusion on Unicode and Fonts. Let me try to clarify in simple words.
Yes - I did! And thank you for such a detailed response.
To see if I have understood this - there are three components:
- Input (Different types of keyboard layouts are used but are
independent of the method of encoding - correct?) 2. Encoding and storing the input (ASCII is the older method - have heard of ISCII as well but do not know what that is but Unicode is the standard. 3. Representing, visually for the human user, what has been inputed and encoded. (Font or type faces and these are, to an extent, independent of the encoding method used.)
There are Four Components
1. Input Methods ( GOI approved Inscript layout, Various Popular Layouts , Translitraton Keyboards, Phonetic Keyboards) 2. Encoding ( unicode) 3. Font (Opentype Fonts ie. supporting Unicode) 4. Rendering Engines (this does the shaping of Complex Glyphs using the Open type font table in Fonts . eg. Pango in Gnome, Harfbuzz in KDE, ICU in Openoffice & java based programmes , Uniscribe in Windows etc )
But I know that many people still use the term "data in unicode fonts", data in xyz font etc. This usage came into existence just because, before unicode was popular, most of the Indian publishers used a non-standard way of representing our data- using English(or latin -ascii) data and change the font's 'face' to Indian glyph. "a fancy dress" hack. The letter "k" will be shown as hindi "ka" with the help of a font. ie the data is still english, but what you "see" is Hindi.
So if I understand correctly, not only is the encoding in ASCII but the representation of that encoding is tied to a particular font (that was used for representation at entry?) and will only be represented properly when using that font? However, what I am trying to understand is whether there is consistency across the ASCII encoding? Will ka in Hindi be encoded in ASCII only one way or is there a linkage, that I do not understand, to the font used to represent it as well?
ASCII is not like Unicode. It only understands latin, not any other language. All over India, legacy, non-standard local language "technologies" (ugly hacks) have gained deep roots. Local newspaper websites as well as publishing houses seem to use their own non-standard fonts. This means that documents and web sites get tied to fonts. These fonts may or may not be freely available, and in some extreme cases, may be no longer available at all. If you lose the font, you lose the content as well.
Ka in Hindi may be mapped in the position of A in some font , in the position of H in some other font as per the convenience of font developer
The reason I ask is because if ka in Hindi is always encoded the same way irrespective of the font used to represent it, then it should not be hard to build an ASCII to Unicode map of encoding that will only have to be done once for each language? Though something tells me I am way off on this assumption.
It is Font dependent. There is a need of Preparing Conversion maps for each Ascii font to convert data encoded in them to unicode. Swathanthra Malayalam Computing's Payyan's (http://wiki.smc.org.in/Payyans ) is a tool developed for converting ASCII to Unicode easily for any Indic Language by building a Font map for each needed font . This tool helped Malayalam Wiktionary to convert many copyright expired books in non standard encodings to Unicode
Popular Firefox extension named Padma uses similar encoding conversion tables to display ASCII news websites in Unicode
This is true. Fonts exist for all scripts , but the variety , or quality of the existing fonts varies. Availability of fonts licensed in foss compatible license is also a problem. For a detailed list of Indic fonts with license info, see http://indlinux.org/wiki/index.php/IndicFontsList
Thanks, Santosh. This is a really useful. Also, are these screen or print ready fonts?
Each Language Communities can answer this question well. In Malayalam we have both screen and print fonts, including one Ornamental font .
You are correct. I would say "fonts licensed under any FOSS license" instead of "free use/reuse".
Indeed. FOSS license is what I should have said.
In fact, the funds were spent(read wasted) for the development of Proprietary fonts by government agencies like CDAC. Fonts with free(dom) licenses were developed and maintained by FOSS developer communities.
*sigh* In your opinion, would they be any real benefit if they did license the ILDC series under a true FOSS license?
I dont think this will happen. There is a long history of lobbying for thiswith CDAC from 2001 Onwards and nothing happened. CDAC made enough money by selling ASCII fonts(and still makes) and They cant even think about giving them away with a FOSS License . And during frequent terms they eat more government money for making yet another CD to ship with their FOSS project forks (such ad Bhaathiya OO , IndiFox etc )+ These fonts. In the same way most of the TDIL funding to CDAC for Indic Language technology research does not make output at all or not getting released, even after TDIL's policy decision to release them under a foss license.
Each Unicode character is multi-byte character while in ASCII, it is single byte.
Ah. Okay. I understand now.
This is not comparable since search is not possible in ascii font way of representing data. Since the data is not in Hindi , but we just "see" as Hindi, one cannot do a search or any such data processing on that data.
If I understand, it is not possible to search within ASCII encoded text but this can be done in Unicode encoded text?
Searching and sorting algorithms for Indic languages are in development and are not bug free. Indic support is not yet available in most of the search solutions (including FOSS solutions like Lucene or Solr) because of the complex word formation characteristics. Most of the existing applications tries exact string-matching algorithms on Indic content yielding only 20% of results. Indic search algorithms should use language and grammar aware algorithms
Thank you very much Santosh - I have learned a lot from this.
Best,
Gautam
Wikimediaindia-l mailing list Wikimediaindia-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l