[Foundation-l] Frustration with the conversion engines issue

Thu Apr 2 09:38:34 UTC 2009

On Thu, Apr 2, 2009 at 9:49 AM, Ray Saintonge <saintonge at telus.net> wrote:
> Aryeh Gregor wrote:
>> On Wed, Apr 1, 2009 at 11:32 AM, Ziko van Dijk <zvandijk at googlemail.com> wrote:
>>
>>> I am sceptical about automatic conversion. As you said, it is mainly a
>>> solution for reading, but not for writing, because the source text is in one
>>> specific spelling or character system.
>>>
>> Why couldn't that be converted on the fly as well?  Choose one variant
>> as the canonical one, and store only that in the database.  Anyone
>> wanting to use other formats would have the text in the edit box
>> automatically converted to their preferred variant on the fly, and
>> converted back when they saved.
>
> When you declare one version canonical the risk is that you will have
> supporters of the losing version(s) becoming irrationally angry.

Not just that... It is computationally non-sustainable.

Even in the most simplest cases, like Serbian script conversion is,
conversion is not transitive (however, intransitivity is small and
approximation works good enough).

So, one of the simplest cases assumes:
* Usually, it is thought that Serbian Cyrillic alphabet has more
informations than Serbian Latin. In Cyrillic, sound "dzh" is marked
with letter "џ", while it is marked as digraph in Latin -- "dž".
However, there are cases where combination "d+zh" is regular, so it is
in Cyrillic "дж", while in Latin it marked as the sound "dzh": as
"dž". So, it means that if you are keeping text in Cyrillic, as a
canonical version, you'll be able to regenerate Latin (while not vice
versa).
* However, because of those digraphs, Latin differs capital letters
from heading letters. If you are converting Cyrillic capital letter
"Џ" into Latin, you'll put "Dž" as its counterpart. However, if it is
a part of heading letters, let's say "ЏАК", you'll get "DžAK", while
the correct form should be "DŽAK".

Of course, it is possible to solve it by testing are the surrounding
letters are capital or not (as well as it is not a big deal in
Serbian). However, this is a very simple case for conversion rules.
Usually, it is much cheaper to do conversion at the time of
adding/changing text and to keep both versions inside of databases.
Because there are two different sets of rules for conversion. The
other option is to keep one meta text inside of database, which would
have internal markup. So, the previous example may look like "{Latin:
{DŽ}AK}".

And, of course, if there are more than two script/orthography versions
(Kurdish is an example), it would be necessary to make conversion
rules for all combinations. Of course, a lot of generalizations are
possible, but, it isn't possible to generalize all of the rules.