在 2018年8月30日週四 00:50，Nicolas VIGNERON <vigneron.nicolas@gmail.com> 寫道：

Le mer. 29 août 2018 à 17:56, C. Scott Ananian <cananian@wikimedia.org> a écrit :
On Wed, Aug 29, 2018, 11:39 AM Michael Everson <everson@evertype.com> wrote:
On 21 Aug 2018, at 17:46, Phake Nick <c933103@gmail.com> wrote:
>
> Wikipedia Pinyin Chinese (coded for now as cmn, Mandarin Chinese): Here's a proposal that I think needs some serious discussion. Please read the discussion on the linked Meta page.
>
> Arguments against:
> • No separate ISO 639-3 language code

Not necessary as the IANA script code extensions can be used, so zh-Latn-pinyin

Agreed: BCP47.

> • It is proposed that this can be handled with a script converter, per (for example) T193366.

This will never add in capital letters correctly.

Could you describe this issue in more detail?

Hi,

My Chinese is not very good (there might be some mistakes in this mail) but I can see how very difficult it could be to build a pinyin converter.

For instance, 北 is usually běi in pinyin (but probably not always, sinogram pronuncation can change depending on the context), the word for "North" in English.
But in 北京 it's Běijīng, the capital of China (it's litt. "North capital", same for others cities of the same name), with an uppercase initial (and same for the derivated words).
This cases are probably not unsolvable but a table of Chinese words in sinograms with corresponding pinyin is probably needed.

And it could be worse: the same sequence of sinograms can have different meanings. For instance, 王 is both a common noun for "king" but also the most common surname in China. So 王是大 could be either "The king is big" or "Mr/Ms Wang is big". Here 王 is at the beggining of the sentence so it's always uppercased but it could also be in the middle and then, in the first case it would be "wáng shì dà" and in the second case "Wáng shì dà".

Cheers, ~nicolas
_______________________________________________
Langcom mailing list
Langcom@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/langcom

Ah yeah space delimitation seems to require more capability than what the current LanguageConverter can provide. There are a number of Chinese word segmentation program [including a number of open source development projects] available on the internet, one of them claims a 97.5% accuracy (apparently that's not even the best one), but that still mean additional works will still be needed for LanguageConverter to integrate one of those programs before it can transliterate Chinese text into the recommended orthography for pinyin, and manual annotation like NoteTA template currently used in Chinese Wikipedia would probably still be needed to provide a perfect word segmentation to achieve perfect space delimitation. Without the word segmentation support, the paragraph produced would be less readable, comparable to what Vietnamese text look like.

Also, so far only 1 user have responded to my post on Chinese WP Community Portal, let me post a translation of that message here:

"I have also contributed a few pages there (What I did was copying text from Wikipedia, throw them into Google and then copy the result back), however I support the deletion( of the incubator), I haven't seen any country that formally use phonetic transliteration for Chinese languages other than the Dungan language yet. As for importation, that's not necessary. Just as you see, other than the wiki's main page, which of them aren't copied from Chinese Wikipedia."