As a follow-up, here's an email I wrote a couple of months ago
providing some background on language converter issues.
--scott
---------- Forwarded message ----------
From: C. Scott Ananian <cananian(a)wikimedia.org>
Date: Wed, Sep 25, 2013 at 5:35 PM
Subject: Re: Understanding Chinese Wikipedia Language Converter Issues
Some background information[...]:
*
https://en.wikipedia.org/wiki/Chinese_Wikipedia has a good historic
overview of the process that led to LanguageConverter and a merged
zhwiki. It also provides useful statistics on readers/editors of
zhwiki.
Competitors to zhwiki include Bǎidù Bǎikē and Hùdòng Zàixiàn. My
understanding is that these are both sites hosted in mainland China,
and thus have to adhere to legal restrictions preventing them from
publishing content with traditional characters or in dialects other
than Putonghua (the standard PRC Mandarin).
*
http://languagelog.ldc.upenn.edu/nll/?p=6654 gives a reasonable
analogy for an English-speaker for how it can be that so many
different languages can be 'read' from the same written text in
Chinese characters. The above Wikipedia article refers to this as
"read in local pronunciation but preserving the vocabulary and grammar
of Standard Chinese". This is a reasonable starting point for
understanding some of the zhwiki variants.
*
http://languagelog.ldc.upenn.edu/nll/?p=3676 considers zh-cn vs
zh-tw more specifically; these are the "same language" but diverged in
1949, in writing script but also to some degree in vocabulary (and
pronunciation, but that's less relevant).
Some language pairs to consider:
Sharing the same wiki:
* zh-cn / zh-tw : "Taiwanese" Mandarin, in traditional characters -vs-
"Mainland" Mandarin in simplified characters.
* zh-sg / zh-mo : Chinese with Singapore/Malaysian terms, and Chinese
as spoken in Macao. I don't understand the linguistic issues here,
but these dialects are both written in simplified characters and share
the same wiki as zh-cn/zh-tw.
*
https://meta.wikimedia.org/wiki/Wikipedias_in_Multi-writing_System
lists 9 more wikipedias using script conversion and sharing a wiki
Split into different wikis:
* zh-hk / zh-yue : Mandarin-as-it-is-read-in-Cantonese (or
Mandarin-with-Cantonese-words) -vs-
Cantonese-written-in-Chinese-characters. See
http://languagelog.ldc.upenn.edu/nll/?p=6501
* zh-classical / zh-min-man, cdo, wuu, hak, gan : Other related
scripts which are hosted on separate wikis; described in [[Chinese
Wikipedia]].
*
https://meta.wikimedia.org/wiki/Wikipedias_in_Multi-writing_System
lists 26 additional wikipedias who wish to use some form of script
conversion.
Language converter consists of two parts: a script conversion, and a
word-level converter. Other language pairs which could use this
toolbox:
Script conversions:
* ur/hi Hindi and Urdu are mutually intelligible, just written in
different scripts. Large political differences, though. See
https://en.wikipedia.org/wiki/Urdu
* Pinyan and Bopomofo transcriptions/annotations are often used for
language learning in Chinese (including native speakers). See
http://languagelog.ldc.upenn.edu/nll/?p=189. Additionally, pinyan is
widely used as an input method, so it may be worthwhile to allow
pinyan display during authoring/editing.
Word level conversions:
* ar/arz : Arabic and Egyptian Arabic. See
https://en.wikipedia.org/wiki/Egyptian_Arabic_Wikipedia#Reaction which
mirrors some of the zhwiki issues.
* es-es/es-ar : Spain and Latin American Spanish. There are other
vocabulary differences within the Latin American countries as well,
but es-ar is usually the first split made. (OLPC has separate
localizations, but eswiki hasn't (yet?) split.)
* pt-pt/pt-br : Portuguese and Brazilian Portuguese. A fork has been
discussed but 80% of the contributors to ptwiki are Brazilian, see
https://en.wikipedia.org/wiki/Portuguese_Wikipedia
* en-us/en-gb : British and American English. Current policy is schizophrenic.
* en/sco . English and Scots. See
https://en.wikipedia.org/wiki/Scots_Wikipedia
There are probably more, but these are the examples I'm currently familiar with.
Putting my cards on the table: it would be nice to better support
parallel texts in wikis -- the machine translation fans would greatly
appreciate it! Tools to better supporting parallel wikis might also
provide some political cover (for example, for Urdu and Hindi, which
don't want to admit that they are the same language). But maintaining
parallel texts is a rather speculative experiment at this time. As a
contrast, the technology and ideas behind LanguageConverter are known
and the implementation roadmap for full Visual Editor support (by
which I mean editing in your native variant) is well understood
(members of the Parsoid and VE teams met during the Tech Days to hash
out the steps required).
--scott
--
(
http://cscott.net)