For a couple of years I am talking to different people inside of WMF about the need for solving conversion engines issue systematically. However, all of the responses which I am getting are non-understanding (in better cases) or silence.
== Why do we need conversion engines? ==
Unlike, for example, French, English, German and Russian, there are languages which have more than trivial internal differences. It may vary between: * slightly different orthographies, so, person who knows one orthography is not able to write in another; * slightly different language varieties (or "dialects"), so, person who knows one variety is not able to write in another; * different scripts, so, person who knows one script doesn't know [well] another; * some combination of the previous possibilities.
Options which we have are:
* Not to care about differences. The most known situation is related to the English language projects, which allows writing in both major varieties. However, difference between "kilometer" and "kilometre" is small and it belongs to the common knowledge of every educated English speaker. The other situations known to me are Persian language projects (Farsi and Dari are allowed) and Serbian language projects (Ekavian and Iyekavian allowed).
Problems with such approach is that at least one group, usually a bigger one, doesn't know to write in the other variety. Speakers of Farsi don't know to write Dari, as well as speakers of Ekavian don't know to write Iyekavian. There are significant problems in keeping and expanding articles written in a variety of minority group: Even with a lot of good will, speaker of majority group has to ask a speaker of minority group to check consistency of an article, *if* there are active speakers of minority group at the project.
* To make different projects. This is the case with Belarus projects. (Parts of Belarus diaspora don't want to write in the "communist" orthography, while the educational system (including the educational system for Belarus minority in Poland) is using that orthography.)
I see that as the worst possible solution: Instead of having one project for one language system, there are two projects; which means that efforts needed to make a good source of knowledge are doubled.
* To use a conversion engine. There are few of implemented conversion engines: Chinese, Serbian and Kazakh (I think that this is the full list, but I am not sure). This is the best possible solution *if* it is working.
The smallest issue is in the Serbian case. All literate people in Serbia know to write in both scripts: Cyrillic and Latin. Usage of scripts is at the level of preference and rarely at the level of functional styles (usually, materials for children will be written in Cyrillic, while emails will be written usually in Latin; formal acts have to be written in Cyrillic).
Chinese is a little bit more complex because there are a number of characters. However, AFAIK, Simplified and Traditional scripts share a number of characters and some of others may be guessed form context.
But, again, current implementation may solve just cases which fulfill the next two conditions: (1) they are more or less straight-forward (more or less one character for one character) and (2) speakers are able to read and write (at least partially) the other script.
== Problems with the current conversion engine ==
* Current conversion engine is able to convert the text just for reading. When you switch to edit mode, you'll are able to see just text in one script (in which article is written). This is not a problem for Serbian case and this is a small scale problem in Chinese case.
However, this would be a significant problem for cases like Azerbaijani is: one Azerbaijani from Azerbaijan doesn't know Perso-Arabic script, while just educated Azerbaijanis from Iran know not so well Latin script (note that literacy in Iran is ~80%, which is quite low for Western standards; it means that one in five persons doesn't know to read and write). In other words, make a simple conversion engine, one on one, from Latin to Arabic script for English and try to read converted text. If you don't want to bother yourself with right-to-left text, try with Devanagari.
* Current conversion engine converts *everything* into the output script. This means that text with mixed scripts will be converted in one. This is useful for Chinese case because contributors may write text in any script, while readers would be able to read in one of them. This is a redundant (and sometimes irritating) feature for Serbian case because no one is writing Serbian texts by mixing Cyrillic and Latin (except, of course, for scientific purposes).
But, it makes the engine useless in the cases where just orthographies or language varieties need to be converted. For example, if Dari has word which form is X and meaning A (and written in Farsi as Y) and Farsi has word which form is X (and written in Dari as Z) and meaning is B, the only option which conversion engine gives is escape syntax like -{ Dari: X; Farsi: Y }-. Imagine now how the wiki code would look like if, for example, genitive case is written Dari like accusative case in Farsi: All syntactic objects will have to be escaped; which means that almost every sentence will have one escape from regular rules.
== What do we need? ==
Actually, we don't need a lot to solve this problem. I have the solution for the most important part of the problem, the linguistic one. Even if I don't have enough of time to deal with all cases, I am able to find students or professors of linguists who are willing to work on those issues for free (they would have scientific papers after the work is done). We need "just" a PHP programmer who is willing to work on this problem. And for a couple of years I didn't find any (even I know a lot of PHP programmers).
P.S. I am writing this because I've got an email with an ask to help in solving an orthography problem. The only option which I am able to give them is to make a Python script which would make four articles from one at their project.
== What do we need? ==
Actually, we don't need a lot to solve this problem. I have the solution for the most important part of the problem, the linguistic one. Even if I don't have enough of time to deal with all cases, I am able to find students or professors of linguists who are willing to work on those issues for free (they would have scientific papers after the work is done). We need "just" a PHP programmer who is willing to work on this problem. And for a couple of years I didn't find any (even I know a lot of PHP programmers).
It sounds like a good project for a directed grant. Have you tried contacting potential grant-making organisations? I imagine some awesome things could be done with as little as $100K.
-- Tim Starling
On Wed, Apr 1, 2009 at 2:11 PM, Tim Starling tstarling@wikimedia.org wrote:
It sounds like a good project for a directed grant. Have you tried contacting potential grant-making organisations? I imagine some awesome things could be done with as little as $100K.
First, sorry for forgetting you. You were the only person which responded positively toward this idea :)
I am thinking again about funding... Thanks for raising this idea into my mind again.
Milos, thank you for the very comprehensive presentation of the problem. There are other cases that could be mentioned, it is indeed a problem touching most of the language editions. I am sceptical about automatic conversion. As you said, it is mainly a solution for reading, but not for writing, because the source text is in one specific spelling or character system. As a result there are mainly two ways to deal with that: - a split of the Wikipedias into two; this is most likely when there are other linguistic differences e.g. in dictionary. - one variety has so much support in the linguistic community that the minority is small and discouraged to create a Wikipedia of their own. In that case automatic conversion is a nice conveniency, but as editors the minority users more or less have to adapt to the majority. Alas. Kind regards Ziko
2009/4/1 Milos Rancic millosh@gmail.com
On Wed, Apr 1, 2009 at 2:11 PM, Tim Starling tstarling@wikimedia.org wrote:
It sounds like a good project for a directed grant. Have you tried contacting potential grant-making organisations? I imagine some awesome things could be done with as little as $100K.
First, sorry for forgetting you. You were the only person which responded positively toward this idea :)
I am thinking again about funding... Thanks for raising this idea into my mind again.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Wed, Apr 1, 2009 at 11:32 AM, Ziko van Dijk zvandijk@googlemail.com wrote:
I am sceptical about automatic conversion. As you said, it is mainly a solution for reading, but not for writing, because the source text is in one specific spelling or character system.
Why couldn't that be converted on the fly as well? Choose one variant as the canonical one, and store only that in the database. Anyone wanting to use other formats would have the text in the edit box automatically converted to their preferred variant on the fly, and converted back when they saved.
You'd have to be very certain you could roundtrip stuff for this to work well, of course.
Aryeh Gregor wrote:
On Wed, Apr 1, 2009 at 11:32 AM, Ziko van Dijk zvandijk@googlemail.com wrote:
I am sceptical about automatic conversion. As you said, it is mainly a solution for reading, but not for writing, because the source text is in one specific spelling or character system.
Why couldn't that be converted on the fly as well? Choose one variant as the canonical one, and store only that in the database. Anyone wanting to use other formats would have the text in the edit box automatically converted to their preferred variant on the fly, and converted back when they saved.
When you declare one version canonical the risk is that you will have supporters of the losing version(s) becoming irrationally angry.
Ec
On Thu, Apr 2, 2009 at 9:49 AM, Ray Saintonge saintonge@telus.net wrote:
Aryeh Gregor wrote:
On Wed, Apr 1, 2009 at 11:32 AM, Ziko van Dijk zvandijk@googlemail.com wrote:
I am sceptical about automatic conversion. As you said, it is mainly a solution for reading, but not for writing, because the source text is in one specific spelling or character system.
Why couldn't that be converted on the fly as well? Choose one variant as the canonical one, and store only that in the database. Anyone wanting to use other formats would have the text in the edit box automatically converted to their preferred variant on the fly, and converted back when they saved.
When you declare one version canonical the risk is that you will have supporters of the losing version(s) becoming irrationally angry.
Not just that... It is computationally non-sustainable.
Even in the most simplest cases, like Serbian script conversion is, conversion is not transitive (however, intransitivity is small and approximation works good enough).
So, one of the simplest cases assumes: * Usually, it is thought that Serbian Cyrillic alphabet has more informations than Serbian Latin. In Cyrillic, sound "dzh" is marked with letter "џ", while it is marked as digraph in Latin -- "dž". However, there are cases where combination "d+zh" is regular, so it is in Cyrillic "дж", while in Latin it marked as the sound "dzh": as "dž". So, it means that if you are keeping text in Cyrillic, as a canonical version, you'll be able to regenerate Latin (while not vice versa). * However, because of those digraphs, Latin differs capital letters from heading letters. If you are converting Cyrillic capital letter "Џ" into Latin, you'll put "Dž" as its counterpart. However, if it is a part of heading letters, let's say "ЏАК", you'll get "DžAK", while the correct form should be "DŽAK".
Of course, it is possible to solve it by testing are the surrounding letters are capital or not (as well as it is not a big deal in Serbian). However, this is a very simple case for conversion rules. Usually, it is much cheaper to do conversion at the time of adding/changing text and to keep both versions inside of databases. Because there are two different sets of rules for conversion. The other option is to keep one meta text inside of database, which would have internal markup. So, the previous example may look like "{Latin: {DŽ}AK}".
And, of course, if there are more than two script/orthography versions (Kurdish is an example), it would be necessary to make conversion rules for all combinations. Of course, a lot of generalizations are possible, but, it isn't possible to generalize all of the rules.
On Thu, Apr 2, 2009 at 3:49 AM, Ray Saintonge saintonge@telus.net wrote:
When you declare one version canonical the risk is that you will have supporters of the losing version(s) becoming irrationally angry.
Which version was canonical is an implementation detail that wouldn't even be visible to contributors, so this isn't a big deal. Wikis have to pick a canonical display type right now anyway for anonymous users who haven't specified a preference, right?
On Thu, Apr 2, 2009 at 5:38 AM, Milos Rancic millosh@gmail.com wrote:
Even in the most simplest cases, like Serbian script conversion is, conversion is not transitive (however, intransitivity is small and approximation works good enough).
*That's* what would pose difficulties, yes.
Of course, it is possible to solve it by testing are the surrounding letters are capital or not (as well as it is not a big deal in Serbian). However, this is a very simple case for conversion rules. Usually, it is much cheaper to do conversion at the time of adding/changing text and to keep both versions inside of databases. Because there are two different sets of rules for conversion. The other option is to keep one meta text inside of database, which would have internal markup. So, the previous example may look like "{Latin: {DŽ}AK}".
I suspect this would be feasible to get working to an acceptable level, but only with a lot of effort. Natural languages are really messy. :(
On Thu, Apr 2, 2009 at 3:52 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
I suspect this would be feasible to get working to an acceptable level, but only with a lot of effort. Natural languages are really messy. :(
If you treat words as strings, they are really messy, yes. But, if you treat words as words, you'll have much better chances to make something useful :)
As I said, the main problem is intransitivity of conversions between written language varieties. So, if we know that, we are able to realize that we need as many records in database as we have language varieties or that we'll use some meta language inside of the database.
As MW engine is already able to "understand" differences at the word level, we need the next to solve the described case:
If we choose to have two records (and without using a dictionary!), the algorithm may be the next: * We write in Cyrillic: "Љуљашка, конјункција и ЏАК." * Output in Latin is: "Ljuljaška, konjunkcija i DžAK." * We correct "DžAK" into "DŽAK". So, by default, we'll get in Cyrillic: "Љуљашка, коњункција и ЏАК." (Note that "нј" switched to "њ" because default for "nj" conversion is "њ".) However, MW engine may test all changed words and may realize that "конјукција" is also a correct conversion for "konjukcija", so it won't change it.
If we use just one record (which may be a more reasonable option), we may use just Cyrillic or just Latin variant; or, we we want to be "fair", we may use random Unicode characters from the Private Areas :) They'll may look like:
* We write in Cyrillic: "Љуљашка, конјункција и ЏАК." * Latin meta markup is: "{Lj}u{lj}aška, konjunkcija i {Dž}AK." * But, Latin wiki code is: "Ljuljaška i DžAK." * We correct "DžAK" into "DŽAK". Then MW engine compares changed word and realizes that "DŽAK" is the same as "ЏАК" and as it finds that it is, it treats the change as a conversion fix. * Cyrillic meta markup is: "Љуљашка, конјункција и {Џ=DŽ}АК." * Latin meta markup is: "{Lj}u{lj}aška, konjunkcija i {DŽ}AK."
So, there are two options for changing MW code: * To have as many tables as the number of varieties is. This is a space consuming method, but CPU won't need to work a lot. * To have one table with meta markup, which is less space consuming method, but more CPU consuming method. * People should declare in which variety they are writing (inside of their options if they are not anonymous or inside of the edit form if they are anonymous).
In both cases we need changes inside of Edit.php file. In the second case we don't need to change DB structure.
Dear Aryeh,
Your idea of "converting on the fly" would not work in many cases. Take for example the ß in German WP. Swiss (registered) readers can decide via their Preferences to see only ss and never ß, because the Swiss do not use ß. That's ok. But vice versa, not every ss is to be converted to ß.
The Germany-Germans write for example "Masse" (a mass, with a short "a") and "Maße" (measures, with a long "a"). The Swiss write "Masse" and "Masse" for both. Now, imagine that a Swiss editor writes "Masse", the conversion engine would not know whether this should be converted to "Maße" or not. Only a person who knows German is capable to decide.
So in the source text the Swiss editor would have to write "Maße", although he as a Swiss is not accustomed to do so. He usually writes "Masse", and if it is not a Swiss-related article, he will tolerate if later a Germany-German edits the article and changes to "Maße".
I would find it an improvement if the ß-conversion would not be only a gadget in the Preferences, but if this could be realized also for non registered users/readers, as the Serbian WP gives such a choice (Latin or Cyrilic) to the readers on the Main Page.
Now, from what Ting and others told me about the conversion problems in Chinese and other Wikipedias, I can imagine that the possible benefits are relatively limited.
Ziko
2009/4/2 Aryeh Gregor <Simetrical+wikilist@gmail.comSimetrical%2Bwikilist@gmail.com
Why couldn't that be converted on the fly as well? Choose one variant as the canonical one, and store only that in the database. Anyone wanting to use other formats would have the text in the edit box automatically converted to their preferred variant on the fly, and converted back when they saved.
Ziko van Dijk wrote:
Dear Aryeh,
Your idea of "converting on the fly" would not work in many cases. Take for example the ß in German WP. Swiss (registered) readers can decide via their Preferences to see only ss and never ß, because the Swiss do not use ß. That's ok. But vice versa, not every ss is to be converted to ß.
The Germany-Germans write for example "Masse" (a mass, with a short "a") and "Maße" (measures, with a long "a"). The Swiss write "Masse" and "Masse" for both. Now, imagine that a Swiss editor writes "Masse", the conversion engine would not know whether this should be converted to "Maße" or not. Only a person who knows German is capable to decide.
There's no reason in principle why a computer can't be as good at making that decision as a human. Such ambiguities are what makes the field of computational linguistics interesting, they're not a reason to be dismissive. We need to find out what is possible with state-of-the-art research systems, and then negotiate, or develop software, to bring that technology to Wikipedia.
-- Tim Starling
On Wed, Apr 1, 2009 at 5:32 PM, Ziko van Dijk zvandijk@googlemail.com wrote:
- a split of the Wikipedias into two; this is most likely when there are
other linguistic differences e.g. in dictionary.
Dictionary is not a problem. This is the option for Ekavian-Iyekavian conversion engine for Serbian. I made an algorithm with exponential complexity (I am not a programmer :) ), but Robert Stojnic (maintainer of the Serbian conversion engine) said that it is possible to avoid it.
- one variety has so much support in the linguistic community that the
minority is small and discouraged to create a Wikipedia of their own. In that case automatic conversion is a nice conveniency, but as editors the minority users more or less have to adapt to the majority.
If we don't start to deal with those issues, we won't be able to have a better product. If we start it, our chances will become much better :) So, simpler cases may be solved faster, more complex cases may be solved not so fast. But, if we have something which may be able to solve simpler cases, we would be able to go further.
Milos Rancic wrote:
Chinese is a little bit more complex because there are a number of characters. However, AFAIK, Simplified and Traditional scripts share a number of characters and some of others may be guessed form context.
Well, Chinese is not that simple, especially for the different translations of western names into chinese, as the following examples show.
But, it makes the engine useless in the cases where just orthographies or language varieties need to be converted. For example, if Dari has word which form is X and meaning A (and written in Farsi as Y) and Farsi has word which form is X (and written in Dari as Z) and meaning is B, the only option which conversion engine gives is escape syntax like -{ Dari: X; Farsi: Y }-. Imagine now how the wiki code would look like if, for example, genitive case is written Dari like accusative case in Farsi: All syntactic objects will have to be escaped; which means that almost every sentence will have one escape from regular rules.
As far as I know, you can define escapes globally for the whole article. This would make an escape in every sentence unnecessary. Take as example the following example: http://zh.wikipedia.org/wiki/%E6%96%AF%E6%B4%9B%E5%8D%9A%E4%B8%B9%C2%B7%E7%B...
You see on the left corner of the article (above the info-box) a triangle sign. It explains which global escapes are used in this article for the title and for other words. Indeed in some articles that list could be quite long, like here: http://zh.wikipedia.org/wiki/%E4%B9%94%E6%B2%BB%C2%B7%E8%B5%AB%E4%BC%AF%E7%8...
Ting
On Wed, Apr 1, 2009 at 2:40 PM, Ting Chen wing.philopp@gmx.de wrote:
As far as I know, you can define escapes globally for the whole article. This would make an escape in every sentence unnecessary. Take as example the following example: http://zh.wikipedia.org/wiki/%E6%96%AF%E6%B4%9B%E5%8D%9A%E4%B8%B9%C2%B7%E7%B...
You see on the left corner of the article (above the info-box) a triangle sign. It explains which global escapes are used in this article for the title and for other words. Indeed in some articles that list could be quite long, like here: http://zh.wikipedia.org/wiki/%E4%B9%94%E6%B2%BB%C2%B7%E8%B5%AB%E4%BC%AF%E7%8...
This is true for logosyllabic orthographies for highly analytic languages like Chinese situation is. Alphabetic orthographies in conjunction with synthetic language (Belarus, Serbian) system would get a mess from such implementation. For example, genitive plural of one noun may be the same as the second future tense of another verb.
wikimedia-l@lists.wikimedia.org