Gerard Meijssen wrote:
Nikola Smolenski wrote:
You don't want to duplicate entire Russian corpus (with inflections, it could easily rise to ten million words), so that you could have each one of them with and without diacritics. It makes sense to have only canonical spellings in the dictionary, and a bit of code to offer nearest match when someone tries to retrieve a word spelled in a different way.
As a matter a fact I do want all inflections even if they are ten million words. Now I do not expect to have all these inflections to start off with but as far as I am concerned I want them all. I already have 222.930 Dutch words and they do include many inflections.
There are two approaches to dictionaries: (1) The encyclopedic approach, trying to find (define, spellcheck, explain, ...) "all" words (and their deflections), or (2) the statistics based approach, trying to find the most commonly used words. I think the OED is of the first kind, while many dictionaries in recent decades (built with the help of computers, extracting word frequency statistics from large text corpora) have been of the latter kind. Some would call (1) a 19th century approach.
The real difference is their handling of the least common words. The encyclopedic approach sees every missing word as a failure, while the statistics based approach recognizes that there is an infinite number of words anyway (new ones are created every day) and some might be too uncommon to deserve a mention.
As a consequence, spellchecking in the statistics based approach can never say that a spelling is "wrong" when it is missing from the dictionary, only that it probably is "uncommon" and thus suspect. The remedy for this is a statics based dictionary of common misspellings. Wikipedia article history can be used as a source for this. Just find all edits that changed one word, e.g. speling -> spelling, and you will have a fine dictionary of common spelling mistakes.
From a database point of view a Word has one Spelling.
This would be an example of the encyclopedic approach.