[Foundation-l] Frustration with the conversion engines issue
Milos Rancic
millosh at gmail.com
Wed Apr 1 05:55:45 UTC 2009
For a couple of years I am talking to different people inside of WMF
about the need for solving conversion engines issue systematically.
However, all of the responses which I am getting are non-understanding
(in better cases) or silence.
== Why do we need conversion engines? ==
Unlike, for example, French, English, German and Russian, there are
languages which have more than trivial internal differences. It may
vary between:
* slightly different orthographies, so, person who knows one
orthography is not able to write in another;
* slightly different language varieties (or "dialects"), so, person
who knows one variety is not able to write in another;
* different scripts, so, person who knows one script doesn't know
[well] another;
* some combination of the previous possibilities.
Options which we have are:
* Not to care about differences. The most known situation is related
to the English language projects, which allows writing in both major
varieties. However, difference between "kilometer" and "kilometre" is
small and it belongs to the common knowledge of every educated English
speaker. The other situations known to me are Persian language
projects (Farsi and Dari are allowed) and Serbian language projects
(Ekavian and Iyekavian allowed).
Problems with such approach is that at least one group, usually a
bigger one, doesn't know to write in the other variety. Speakers of
Farsi don't know to write Dari, as well as speakers of Ekavian don't
know to write Iyekavian. There are significant problems in keeping and
expanding articles written in a variety of minority group: Even with a
lot of good will, speaker of majority group has to ask a speaker of
minority group to check consistency of an article, *if* there are
active speakers of minority group at the project.
* To make different projects. This is the case with Belarus projects.
(Parts of Belarus diaspora don't want to write in the "communist"
orthography, while the educational system (including the educational
system for Belarus minority in Poland) is using that orthography.)
I see that as the worst possible solution: Instead of having one
project for one language system, there are two projects; which means
that efforts needed to make a good source of knowledge are doubled.
* To use a conversion engine. There are few of implemented conversion
engines: Chinese, Serbian and Kazakh (I think that this is the full
list, but I am not sure). This is the best possible solution *if* it
is working.
The smallest issue is in the Serbian case. All literate people in
Serbia know to write in both scripts: Cyrillic and Latin. Usage of
scripts is at the level of preference and rarely at the level of
functional styles (usually, materials for children will be written in
Cyrillic, while emails will be written usually in Latin; formal acts
have to be written in Cyrillic).
Chinese is a little bit more complex because there are a number of
characters. However, AFAIK, Simplified and Traditional scripts share a
number of characters and some of others may be guessed form context.
But, again, current implementation may solve just cases which fulfill
the next two conditions: (1) they are more or less straight-forward
(more or less one character for one character) and (2) speakers are
able to read and write (at least partially) the other script.
== Problems with the current conversion engine ==
* Current conversion engine is able to convert the text just for
reading. When you switch to edit mode, you'll are able to see just
text in one script (in which article is written). This is not a
problem for Serbian case and this is a small scale problem in Chinese
case.
However, this would be a significant problem for cases like
Azerbaijani is: one Azerbaijani from Azerbaijan doesn't know
Perso-Arabic script, while just educated Azerbaijanis from Iran know
not so well Latin script (note that literacy in Iran is ~80%, which is
quite low for Western standards; it means that one in five persons
doesn't know to read and write). In other words, make a simple
conversion engine, one on one, from Latin to Arabic script for English
and try to read converted text. If you don't want to bother yourself
with right-to-left text, try with Devanagari.
* Current conversion engine converts *everything* into the output
script. This means that text with mixed scripts will be converted in
one. This is useful for Chinese case because contributors may write
text in any script, while readers would be able to read in one of
them. This is a redundant (and sometimes irritating) feature for
Serbian case because no one is writing Serbian texts by mixing
Cyrillic and Latin (except, of course, for scientific purposes).
But, it makes the engine useless in the cases where just orthographies
or language varieties need to be converted. For example, if Dari has
word which form is X and meaning A (and written in Farsi as Y) and
Farsi has word which form is X (and written in Dari as Z) and meaning
is B, the only option which conversion engine gives is escape syntax
like -{ Dari: X; Farsi: Y }-. Imagine now how the wiki code would look
like if, for example, genitive case is written Dari like accusative
case in Farsi: All syntactic objects will have to be escaped; which
means that almost every sentence will have one escape from regular
rules.
== What do we need? ==
Actually, we don't need a lot to solve this problem. I have the
solution for the most important part of the problem, the linguistic
one. Even if I don't have enough of time to deal with all cases, I am
able to find students or professors of linguists who are willing to
work on those issues for free (they would have scientific papers after
the work is done). We need "just" a PHP programmer who is willing to
work on this problem. And for a couple of years I didn't find any
(even I know a lot of PHP programmers).
P.S. I am writing this because I've got an email with an ask to help
in solving an orthography problem. The only option which I am able to
give them is to make a Python script which would make four articles
from one at their project.
More information about the foundation-l
mailing list