Re: [Wikitech-l] Re: Spell checking in MediaWiki

31 Aug 2005


      Heiko Evermann wrote:
...
Hi Gerard,
...
Actual work on UW itself is underway. Here you can find the data desisgn
http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design This
design is very much open for comments and I am happy to say that many
comments that were given have led to changes. I name but a few changes
that came about this way; Can sign languages be included - now they can,
Can attestations be included - now they can.
I want to propose (again) to make one important change:
I think it is important that an entry within one language can be tagged as 
being correct according to several orthographies within one language. From 
what I understood so far, I find that the word
de: "ist" (English: "(he) is") must be inserted twice, once for the new German 
spelling and once for the old (before the recent reform). Even thogh this 
word was not affected by the spelling reform. This applies to 95% of all 
German words. And each of them gets complete translation coverage into all 
languages. This is also a problem for Low Saxon (with our wide range of 
possible spellings). You have tried to make your current design plausible to 
me when we talked about it recently, but I was not convinced that this huge 
multiplication of entries is a good idea. Maybe I misunderstood you somehow, 
but I still do not understand it.
The German situation is a bit difficult. In actual fact there are only 
two orthographies because two Bundeslander did not pass as law that the 
new spelling would apply there as well. The consequence is that both old 
spelling and new spelling are valid. In a typical situation, the words 
that have been changed would get dated and be outdated. From a practical 
point of view I would only have the changed words and the new words 
included and I would treat them as if these two Bundeslander had voted 
in favour. For lookup purposes the difference is a SELECT statement in 
the query statement.
For Lower Saxon the situation is different. There are many "correct" 
ways of spelling a word. Here it is essential that it is indicated to 
what orthography or dialect a word belongs to. One reason is that many 
people are quiet insistent that only one spelling should be used. This 
is in a marked contrast to the practice of Neopolitan and Sicilian where 
all spellings are accepted without much of a fuss.
The argument why all words have to be explicitly identified as belonging 
to an orthography is because it allows us to do other things than just 
producing lexicological information from the Internet. What in your 
perception is an "multiplication of entries" is in actual fact no such 
thing; an expression is registered only once for each language, dialect 
or orthography.
...
...
Then again, if we create a wordcount on the Wikipedia content, run it
against a spellchecker, the resulting list should be spelled correctly
and could be included in UW. Particularly for our biggest wikipedias and
the amount of topics covered, it should be a list that might be close to
the size of what Aspell has. We will also have a long list of words
missing in Aspell. We will however not get a spellchecker for British or
American in this way.
Does that mean that you think about importing huge amounts of words without 
definition and without any translation?
When I have lists of words that are known to be correct for instance 
because an organisation actively vouches for them, this is certainly the 
intention. I have a wordlist of some 222.930 Dutch words that are 
certified as correct. Those I think should be no issue. When the 
community identifies a list that is correct, I am sure they want to 
upload it as well.
Thanks,
   GerardM

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Re: Spell checking in MediaWiki