Re: [Wikipedia-l] Re: One Chinese Wikipedia - semi-automated markup-based approach

11 Sep 2004


      yuanml wrote:
...
To Mark Williamson:
...
By the way, since when am I trying to compare en/jp and tc/sc? I was
merely responding to something somebody else said about SC and TC
users "living in the same universe" or something.
I don't think I lose my point.
tc/sc users enjoy the same concept structure of the universe,
but en/jp, en/tc or en/sc are not same.
For example planet Venus in English is a term related to a goddess,
but in both sc/tc planet Venus is related to the same things - gold
and star. In one word, tc/sc is the same language.
This is my point.
The tc/sc users not only enjoy the same grammar of language,
but also most part of their knowledge systems.
Let us not talk about Chinese native knowledge,
such as Chinese history, Foreklore, but let us talk about mordern science.
Terminologies of mordern science are introduced to China
since Ming Dynasty hundreds of years ago, and increased vastly after 1900.
The Chinese knowledge system evolve into their morder form
just after the New Culture Movment around 1920.
But the split of tc/sc is about at 1956,
then the tc/sc enjoy the same backgroud of their knowledge systems.
...
From 1949 to 1980s tc/sc evolved independently for lack of communication,
then some new terminologies are different, such as in computer science.
But after 1980s, the communication between tc/sc increased comparatively.
Disclaimer: I can't read Chinese, so I don't know whether this is 
similar to any of the current or proposed solutions, but I have read 
some of the literature on the subject. My apologies if I'm going over 
old territory.
The best analogy is (I think) the difference between en-us and en-gb: 
the differences are mostly "spelling" and idioms. Automatic conversion 
is entirely possible, but occasionally imperfect. However, it should be 
possible to paraphrase around these problems where they occur and 
produce a single text that can be displayed (and edited) in either 
language and converted to-and-fro.
Perhaps one way to do it would be as in this fictitious example: if I 
have a (say) simplified word that means "fish", but can be transformed 
to either (say) "FISH" or "STONE" in the traditional script. Suppose we 
auto-convert this '''into the Wiki source''' at edit time to markup like
[fish=FISH|STONE]
which would display as "fish" highlighted in some way when the page is 
rendered in simplified script to show there is a potential 
transliteration problem, and as [FISH|STONE] when rendered in 
traditional script.
Then it can be cleaned up in markup by writing:
[fish=FISH]
or similar markup, which will force the traditional rendering to the 
correct word, and remove the warning flag for simplified rendering, 
since there is now a one-to-one mapping. The same would apply for in 
reverse for ambiguous conversions in the opposite direction. With any 
luck, this could be entirely lexicon-driven, and would need no AI 
research, because we would be find all pages containing ambiguities 
automatically, and then harness the copyediting skills of Wikipedians to 
find and disambiguate all the problematic text. We could even harness 
this when idioms or short phrases differ, to go:
[idiom in simplified=IDIOM IN TRADITIONAL]
-- Neil

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

Re: [Wikipedia-l] Re: One Chinese Wikipedia - semi-automated markup-based approach