在 2018年8月30日週四 00:50,Nicolas VIGNERON <vigneron.nicolas(a)gmail.com> 寫道:
Le mer. 29 août 2018 à 17:56, C. Scott Ananian <cananian(a)wikimedia.org> a
écrit :
On Wed, Aug 29, 2018, 11:39 AM Michael Everson
<everson(a)evertype.com>
wrote:
On 21 Aug 2018, at 17:46, Phake Nick
<c933103(a)gmail.com> wrote:
Wikipedia Pinyin Chinese (coded for now as cmn, Mandarin Chinese):
Here's a
proposal that I think needs some serious discussion. Please read
the discussion on the linked Meta page.
Arguments against:
• No separate ISO 639-3 language code
Not necessary as the IANA script code extensions can be used, so
zh-Latn-pinyin
Agreed: BCP47.
• It is proposed that this can be handled
with a script
converter, per (for example) T193366.
This will never add in capital letters correctly.
Could you describe this issue in more detail?
Hi,
My Chinese is not very good (there might be some mistakes in this mail)
but I can see how very difficult it could be to build a pinyin converter.
For instance, 北 is usually běi in pinyin (but probably not always,
sinogram pronuncation can change depending on the context), the word for
"North" in English.
But in 北京 it's Běijīng, the capital of China (it's litt. "North
capital",
same for others cities of the same name), with an uppercase initial (and
same for the derivated words).
This cases are probably not unsolvable but a table of Chinese words in
sinograms with corresponding pinyin is probably needed.
And it could be worse: the same sequence of sinograms can have different
meanings. For instance, 王 is both a common noun for "king" but also the
most common surname in China. So 王是大 could be either "The king is big" or
"Mr/Ms Wang is big". Here 王 is at the beggining of the sentence so it's
always uppercased but it could also be in the middle and then, in the first
case it would be "wáng shì dà" and in the second case "Wáng shì dà".
Cheers, ~nicolas
_______________________________________________
Langcom mailing list
Langcom(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/langcom
Ah yeah space delimitation seems to require more capability than what the
current LanguageConverter can provide. There are a number of Chinese word
segmentation program [including a number of open source development
projects] available on the internet, one of them claims a 97.5% accuracy
(apparently that's not even the best one), but that still mean additional
works will still be needed for LanguageConverter to integrate one of those
programs before it can transliterate Chinese text into the recommended
orthography for pinyin, and manual annotation like NoteTA template
currently used in Chinese Wikipedia would probably still be needed to
provide a perfect word segmentation to achieve perfect space delimitation.
Without the word segmentation support, the paragraph produced would be less
readable, comparable to what Vietnamese text look like.
Also, so far only 1 user have responded to my post on Chinese WP Community
Portal, let me post a translation of that message here:
"I have also contributed a few pages there (What I did was copying text
from Wikipedia, throw them into Google and then copy the result back),
however I support the deletion( of the incubator), I haven't seen any
country that formally use phonetic transliteration for Chinese languages
other than the Dungan language yet. As for importation, that's not
necessary. Just as you see, other than the wiki's main page, which of them
aren't copied from Chinese Wikipedia."