On 11/20/05, Bill Clark wclarkxoom@gmail.com wrote: [snip]
There are ways to get much, much better machine translation with a little extra effort from native speakers of the source language. If the words in an article are part-of-speech (POS) tagged (noun, verb, adjective, preposition, etc.) then the quality of machine translation of that text improves dramatically.
[snip]
I've been making a little effort on and off again to improve the parsability of Wikipedia articles by Link Grammar (http://bobo.link.cs.cmu.edu/link/). Generally the formal style used on most articles provides easy material for link grammar to correctly parse and most of the statements that unparsable are clear grammatical or spelling mistakes.
Generally the two biggest sources of parse errors which can not be attributed to an obvious mistake that I've run into using link grammar is the omission of the serial comma, and subject area verbs which are not in my dictionary. I'm not sure why the serial comma isn't required in the manual of style as it's omission sometimes causes human readers to incorrectly group objects.
I think that machine readability for Wikipedia should be a long term goal, even if we do not intend to use it to facilitate translation. Generally text which is machine parsable without markup also tends to be more easily readable by human readers who have widely varying levels of skill. Once we factor in the improvements in searching, translation, and machine intelligence, the desirability of machine parsibility becomes more clear.
For example, I've toyed with making my content filtering bot (output available on freenode irc in #wikipedia-suspectedits) use link grammar to parse sentences and detect when someone has negated/inverted the meaning of a sentence. Unfortunately I can't put this into production on my bot because the machine parsability of Wikipedia is currently too low, and link-grammar's performance on difficult to parse text is currently too low.