As a follow-up, here's an email I wrote a couple of months ago providing some background on language converter issues. --scott
---------- Forwarded message ---------- From: C. Scott Ananian cananian@wikimedia.org Date: Wed, Sep 25, 2013 at 5:35 PM Subject: Re: Understanding Chinese Wikipedia Language Converter Issues
Some background information[...]:
* https://en.wikipedia.org/wiki/Chinese_Wikipedia has a good historic overview of the process that led to LanguageConverter and a merged zhwiki. It also provides useful statistics on readers/editors of zhwiki.
Competitors to zhwiki include Bǎidù Bǎikē and Hùdòng Zàixiàn. My understanding is that these are both sites hosted in mainland China, and thus have to adhere to legal restrictions preventing them from publishing content with traditional characters or in dialects other than Putonghua (the standard PRC Mandarin).
* http://languagelog.ldc.upenn.edu/nll/?p=6654 gives a reasonable analogy for an English-speaker for how it can be that so many different languages can be 'read' from the same written text in Chinese characters. The above Wikipedia article refers to this as "read in local pronunciation but preserving the vocabulary and grammar of Standard Chinese". This is a reasonable starting point for understanding some of the zhwiki variants.
* http://languagelog.ldc.upenn.edu/nll/?p=3676 considers zh-cn vs zh-tw more specifically; these are the "same language" but diverged in 1949, in writing script but also to some degree in vocabulary (and pronunciation, but that's less relevant).
Some language pairs to consider:
Sharing the same wiki:
* zh-cn / zh-tw : "Taiwanese" Mandarin, in traditional characters -vs- "Mainland" Mandarin in simplified characters.
* zh-sg / zh-mo : Chinese with Singapore/Malaysian terms, and Chinese as spoken in Macao. I don't understand the linguistic issues here, but these dialects are both written in simplified characters and share the same wiki as zh-cn/zh-tw.
* https://meta.wikimedia.org/wiki/Wikipedias_in_Multi-writing_System lists 9 more wikipedias using script conversion and sharing a wiki
Split into different wikis:
* zh-hk / zh-yue : Mandarin-as-it-is-read-in-Cantonese (or Mandarin-with-Cantonese-words) -vs- Cantonese-written-in-Chinese-characters. See http://languagelog.ldc.upenn.edu/nll/?p=6501
* zh-classical / zh-min-man, cdo, wuu, hak, gan : Other related scripts which are hosted on separate wikis; described in [[Chinese Wikipedia]].
* https://meta.wikimedia.org/wiki/Wikipedias_in_Multi-writing_System lists 26 additional wikipedias who wish to use some form of script conversion.
Language converter consists of two parts: a script conversion, and a word-level converter. Other language pairs which could use this toolbox:
Script conversions:
* ur/hi Hindi and Urdu are mutually intelligible, just written in different scripts. Large political differences, though. See https://en.wikipedia.org/wiki/Urdu
* Pinyan and Bopomofo transcriptions/annotations are often used for language learning in Chinese (including native speakers). See http://languagelog.ldc.upenn.edu/nll/?p=189. Additionally, pinyan is widely used as an input method, so it may be worthwhile to allow pinyan display during authoring/editing.
Word level conversions:
* ar/arz : Arabic and Egyptian Arabic. See https://en.wikipedia.org/wiki/Egyptian_Arabic_Wikipedia#Reaction which mirrors some of the zhwiki issues.
* es-es/es-ar : Spain and Latin American Spanish. There are other vocabulary differences within the Latin American countries as well, but es-ar is usually the first split made. (OLPC has separate localizations, but eswiki hasn't (yet?) split.)
* pt-pt/pt-br : Portuguese and Brazilian Portuguese. A fork has been discussed but 80% of the contributors to ptwiki are Brazilian, see https://en.wikipedia.org/wiki/Portuguese_Wikipedia
* en-us/en-gb : British and American English. Current policy is schizophrenic.
* en/sco . English and Scots. See https://en.wikipedia.org/wiki/Scots_Wikipedia
There are probably more, but these are the examples I'm currently familiar with.
Putting my cards on the table: it would be nice to better support parallel texts in wikis -- the machine translation fans would greatly appreciate it! Tools to better supporting parallel wikis might also provide some political cover (for example, for Urdu and Hindi, which don't want to admit that they are the same language). But maintaining parallel texts is a rather speculative experiment at this time. As a contrast, the technology and ideas behind LanguageConverter are known and the implementation roadmap for full Visual Editor support (by which I mean editing in your native variant) is well understood (members of the Parsoid and VE teams met during the Tech Days to hash out the steps required). --scott
-- (http://cscott.net)