Heiko Evermann wrote:
Hi Gerard,
Actual work on UW itself is underway. Here you can find the data desisgn http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design This design is very much open for comments and I am happy to say that many comments that were given have led to changes. I name but a few changes that came about this way; Can sign languages be included - now they can, Can attestations be included - now they can.
I want to propose (again) to make one important change: I think it is important that an entry within one language can be tagged as being correct according to several orthographies within one language. From what I understood so far, I find that the word de: "ist" (English: "(he) is") must be inserted twice, once for the new German spelling and once for the old (before the recent reform). Even thogh this word was not affected by the spelling reform. This applies to 95% of all German words. And each of them gets complete translation coverage into all languages. This is also a problem for Low Saxon (with our wide range of possible spellings). You have tried to make your current design plausible to me when we talked about it recently, but I was not convinced that this huge multiplication of entries is a good idea. Maybe I misunderstood you somehow, but I still do not understand it.
The German situation is a bit difficult. In actual fact there are only two orthographies because two Bundeslander did not pass as law that the new spelling would apply there as well. The consequence is that both old spelling and new spelling are valid. In a typical situation, the words that have been changed would get dated and be outdated. From a practical point of view I would only have the changed words and the new words included and I would treat them as if these two Bundeslander had voted in favour. For lookup purposes the difference is a SELECT statement in the query statement.
For Lower Saxon the situation is different. There are many "correct" ways of spelling a word. Here it is essential that it is indicated to what orthography or dialect a word belongs to. One reason is that many people are quiet insistent that only one spelling should be used. This is in a marked contrast to the practice of Neopolitan and Sicilian where all spellings are accepted without much of a fuss.
The argument why all words have to be explicitly identified as belonging to an orthography is because it allows us to do other things than just producing lexicological information from the Internet. What in your perception is an "multiplication of entries" is in actual fact no such thing; an expression is registered only once for each language, dialect or orthography.
Then again, if we create a wordcount on the Wikipedia content, run it against a spellchecker, the resulting list should be spelled correctly and could be included in UW. Particularly for our biggest wikipedias and the amount of topics covered, it should be a list that might be close to the size of what Aspell has. We will also have a long list of words missing in Aspell. We will however not get a spellchecker for British or American in this way.
Does that mean that you think about importing huge amounts of words without definition and without any translation?
When I have lists of words that are known to be correct for instance because an organisation actively vouches for them, this is certainly the intention. I have a wordlist of some 222.930 Dutch words that are certified as correct. Those I think should be no issue. When the community identifies a list that is correct, I am sure they want to upload it as well.
Thanks, GerardM