At 04:27 PM 09/13/2002 -0700, Ray Saintonge wrote:
Would we use an American or British spell-checker?
We create our own spell checker, but then we make the corrections
manual. This is similar to Google's spell checking feature. We first
parse out all the words in every article and make a table with each unique
word and the number of occurrences. This is typically a step in most
indexed search engines, but MySQL is really fast without this.
It would be a safe a assumption that words that are used the most
frequently are probably spelled correctly. True, "recieve" may be a very
common miss spelling, but there are probably a lot more occurrences of
"receive". So the flip side of this is that words that are used rarely are
probably spelled wrong. Now we don't want to go off blindly replacing
these words (mostly because we would know what with) but they are good
words to look for for replacing.
So if we could have an automated script that took these "l;east frequently
occurring words" and listed them for a human they could say "Ah, recieve
should be receive, this is a miss-spelling." Then they enter in the
correct spelling and we use the same method mentioned in my previous e-mail
to approve each individual change.
I have found that by automating the mundane repetitive portions of tasks
like this that humans are much more accurate. If you have to go through 10
of the same motions for every 1 that requires thought then you are more
likely to not put any thought into that 1. But if it is only a 2:1 or even
better 1:1 ratio then you will put much more thought into it.
Again I don't know if this is even remotely possible with the WikiPedia
software. I'd hate to do this off-line since it would be too easy to get
out of sync. Maybe an alternative interface for these kinds of edits.