Re: [Wikitech-l] machine translaton of the articles...

20 Nov 2005

On 11/20/05, Bill Clark &lt;wclarkxoom(a)gmail.com&gt; wrote:
[snip]
...
  There are ways to get much, much better machine
translation with a little
 extra effort from native speakers of the source language. If the words in an
 article are part-of-speech (POS) tagged (noun, verb, adjective, preposition,
 etc.) then the quality of machine translation of that text improves
 dramatically. [snip]

I've been making a little effort on and off again to improve the
parsability of Wikipedia articles by Link Grammar
(http://bobo.link.cs.cmu.edu/link/). Generally the formal style used
on most articles provides easy material for link grammar to correctly
parse and most of the statements that unparsable are clear grammatical
or spelling mistakes.

Generally the two biggest sources of parse errors which can not be
attributed to an obvious mistake that I've run into using link grammar
is the omission of the serial comma, and subject area verbs which are
not in my dictionary.  I'm not sure why the serial comma isn't
required in the manual of style as it's omission sometimes causes
human readers to incorrectly group objects.

I think that machine readability for Wikipedia should be a long term
goal, even if we do not intend to use it to facilitate translation.
Generally text which is machine parsable without markup also tends to
be more easily readable by human readers who have widely varying
levels of skill. Once we factor in the improvements in searching,
translation, and machine intelligence, the desirability of machine
parsibility becomes more clear.

For example, I've toyed with making my content filtering bot (output
available on freenode irc in #wikipedia-suspectedits) use link grammar
to parse sentences and detect when someone has negated/inverted the
meaning of a sentence. Unfortunately I can't put this into production
on my bot because the machine parsability of Wikipedia is currently
too low, and link-grammar's performance on difficult to parse text is
currently too low.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] machine translaton of the articles...