On 11/20/05, Bill Clark <wclarkxoom(a)gmail.com> wrote:
[snip]
There are ways to get much, much better machine
translation with a little
extra effort from native speakers of the source language. If the words in an
article are part-of-speech (POS) tagged (noun, verb, adjective, preposition,
etc.) then the quality of machine translation of that text improves
dramatically.
[snip]
I've been making a little effort on and off again to improve the
parsability of Wikipedia articles by Link Grammar
(
http://bobo.link.cs.cmu.edu/link/). Generally the formal style used
on most articles provides easy material for link grammar to correctly
parse and most of the statements that unparsable are clear grammatical
or spelling mistakes.
Generally the two biggest sources of parse errors which can not be
attributed to an obvious mistake that I've run into using link grammar
is the omission of the serial comma, and subject area verbs which are
not in my dictionary. I'm not sure why the serial comma isn't
required in the manual of style as it's omission sometimes causes
human readers to incorrectly group objects.
I think that machine readability for Wikipedia should be a long term
goal, even if we do not intend to use it to facilitate translation.
Generally text which is machine parsable without markup also tends to
be more easily readable by human readers who have widely varying
levels of skill. Once we factor in the improvements in searching,
translation, and machine intelligence, the desirability of machine
parsibility becomes more clear.
For example, I've toyed with making my content filtering bot (output
available on freenode irc in #wikipedia-suspectedits) use link grammar
to parse sentences and detect when someone has negated/inverted the
meaning of a sentence. Unfortunately I can't put this into production
on my bot because the machine parsability of Wikipedia is currently
too low, and link-grammar's performance on difficult to parse text is
currently too low.