On Thu, Apr 2, 2009 at 3:52 PM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
I suspect this would be feasible to get working to an acceptable level, but only with a lot of effort. Natural languages are really messy. :(
If you treat words as strings, they are really messy, yes. But, if you treat words as words, you'll have much better chances to make something useful :)
As I said, the main problem is intransitivity of conversions between written language varieties. So, if we know that, we are able to realize that we need as many records in database as we have language varieties or that we'll use some meta language inside of the database.
As MW engine is already able to "understand" differences at the word level, we need the next to solve the described case:
If we choose to have two records (and without using a dictionary!), the algorithm may be the next: * We write in Cyrillic: "Љуљашка, конјункција и ЏАК." * Output in Latin is: "Ljuljaška, konjunkcija i DžAK." * We correct "DžAK" into "DŽAK". So, by default, we'll get in Cyrillic: "Љуљашка, коњункција и ЏАК." (Note that "нј" switched to "њ" because default for "nj" conversion is "њ".) However, MW engine may test all changed words and may realize that "конјукција" is also a correct conversion for "konjukcija", so it won't change it.
If we use just one record (which may be a more reasonable option), we may use just Cyrillic or just Latin variant; or, we we want to be "fair", we may use random Unicode characters from the Private Areas :) They'll may look like:
* We write in Cyrillic: "Љуљашка, конјункција и ЏАК." * Latin meta markup is: "{Lj}u{lj}aška, konjunkcija i {Dž}AK." * But, Latin wiki code is: "Ljuljaška i DžAK." * We correct "DžAK" into "DŽAK". Then MW engine compares changed word and realizes that "DŽAK" is the same as "ЏАК" and as it finds that it is, it treats the change as a conversion fix. * Cyrillic meta markup is: "Љуљашка, конјункција и {Џ=DŽ}АК." * Latin meta markup is: "{Lj}u{lj}aška, konjunkcija i {DŽ}AK."
So, there are two options for changing MW code: * To have as many tables as the number of varieties is. This is a space consuming method, but CPU won't need to work a lot. * To have one table with meta markup, which is less space consuming method, but more CPU consuming method. * People should declare in which variety they are writing (inside of their options if they are not anonymous or inside of the edit form if they are anonymous).
In both cases we need changes inside of Edit.php file. In the second case we don't need to change DB structure.