[Foundation-l] Frustration with the conversion engines issue

Milos Rancic millosh at gmail.com
Thu Apr 2 16:23:26 UTC 2009


On Thu, Apr 2, 2009 at 3:52 PM, Aryeh Gregor
<Simetrical+wikilist at gmail.com> wrote:
> I suspect this would be feasible to get working to an acceptable
> level, but only with a lot of effort.  Natural languages are really
> messy.  :(

If you treat words as strings, they are really messy, yes. But, if you
treat words as words, you'll have much better chances to make
something useful :)

As I said, the main problem is intransitivity of conversions between
written language varieties. So, if we know that, we are able to
realize that we need as many records in database as we have language
varieties or that we'll use some meta language inside of the database.

As MW engine is already able to "understand" differences at the word
level, we need the next to solve the described case:

If we choose to have two records (and without using a dictionary!),
the algorithm may be the next:
* We write in Cyrillic: "Љуљашка, конјункција и ЏАК."
* Output in Latin is: "Ljuljaška, konjunkcija i DžAK."
* We correct "DžAK" into "DŽAK". So, by default, we'll get in
Cyrillic: "Љуљашка, коњункција и ЏАК." (Note that "нј" switched to "њ"
because default for "nj" conversion is "њ".) However, MW engine may
test all changed words and may realize that "конјукција" is also a
correct conversion for "konjukcija", so it won't change it.

If we use just one record (which may be a more reasonable option), we
may use just Cyrillic or just Latin variant; or, we we want to be
"fair", we may use random Unicode characters from the Private Areas :)
They'll may look like:

* We write in Cyrillic: "Љуљашка, конјункција и ЏАК."
* Latin meta markup is: "{Lj}u{lj}aška, konjunkcija i {Dž}AK."
* But, Latin wiki code is: "Ljuljaška i DžAK."
* We correct "DžAK" into "DŽAK". Then MW engine compares changed word
and realizes that "DŽAK" is the same as "ЏАК" and as it finds that it
is, it treats the change as a conversion fix.
* Cyrillic meta markup is: "Љуљашка, конјункција и {Џ=DŽ}АК."
* Latin meta markup is: "{Lj}u{lj}aška, konjunkcija i {DŽ}AK."

So, there are two options for changing MW code:
* To have as many tables as the number of varieties is. This is a
space consuming method, but CPU won't need to work a lot.
* To have one table with meta markup, which is less space consuming
method, but more CPU consuming method.
* People should declare in which variety they are writing (inside of
their options if they are not anonymous or inside of the edit form if
they are anonymous).

In both cases we need changes inside of Edit.php file. In the second
case we don't need to change DB structure.



More information about the foundation-l mailing list