At 04:27 PM 09/13/2002 -0700, Ray Saintonge wrote:
Would we use an American or British spell-checker?
We create our own spell checker, but then we make the corrections manual. This is similar to Google's spell checking feature. We first parse out all the words in every article and make a table with each unique word and the number of occurrences. This is typically a step in most indexed search engines, but MySQL is really fast without this.
It would be a safe a assumption that words that are used the most frequently are probably spelled correctly. True, "recieve" may be a very common miss spelling, but there are probably a lot more occurrences of "receive". So the flip side of this is that words that are used rarely are probably spelled wrong. Now we don't want to go off blindly replacing these words (mostly because we would know what with) but they are good words to look for for replacing.
So if we could have an automated script that took these "l;east frequently occurring words" and listed them for a human they could say "Ah, recieve should be receive, this is a miss-spelling." Then they enter in the correct spelling and we use the same method mentioned in my previous e-mail to approve each individual change.
I have found that by automating the mundane repetitive portions of tasks like this that humans are much more accurate. If you have to go through 10 of the same motions for every 1 that requires thought then you are more likely to not put any thought into that 1. But if it is only a 2:1 or even better 1:1 ratio then you will put much more thought into it.
Again I don't know if this is even remotely possible with the WikiPedia software. I'd hate to do this off-line since it would be too easy to get out of sync. Maybe an alternative interface for these kinds of edits.