On 26/07/05, Lars Aronsson lars@aronsson.se wrote:
There are two approaches to dictionaries: (1) The encyclopedic approach, trying to find (define, spellcheck, explain, ...) "all" words (and their deflections), or (2) the statistics based approach, trying to find the most commonly used words. I think the OED is of the first kind, while many dictionaries in recent decades (built with the help of computers, extracting word frequency statistics from large text corpora) have been of the latter kind. Some would call (1) a 19th century approach.
The real difference is their handling of the least common words. The encyclopedic approach sees every missing word as a failure, while the statistics based approach recognizes that there is an infinite number of words anyway (new ones are created every day) and some might be too uncommon to deserve a mention.
As a consequence, spellchecking in the statistics based approach can never say that a spelling is "wrong" when it is missing from the dictionary, only that it probably is "uncommon" and thus suspect. The remedy for this is a statics based dictionary of common misspellings. Wikipedia article history can be used as a source for this. Just find all edits that changed one word, e.g. speling -> spelling, and you will have a fine dictionary of common spelling mistakes.
From a database point of view a Word has one Spelling.
This would be an example of the encyclopedic approach.
It is clear to me that the approach we want to take is the "encyclopedic" one, simply because we can handle it. The Oxford dictionary in paper cannot handle it "elegantly" as it becomes unwieldy, spans a whole shelf. A good database can.
It is unacceptable for a Word to have one Spelling for reasons described previously (German a-with-umlaut, Hebrew niqqud and optional vowels, etc), but I am unable to find out who originally wrote that.