Spell checking in MediaWiki

List overview All Threads
Download

newer

older

Dump progress

Re: [MediaWiki-CVS]...

Jeffrey McGee

19 Aug 2005 19 Aug '05

1:30 a.m.

Howdy,

I'd like to add a spell checker to MediaWiki using the pspell library. (Pspell is part of php and it uses libaspell.) It doesn't help that I'm new to the MediaWiki code base and PHP isn't exactly my favorite language. (I wouldn't even call it my _third_ favorite.) Anyway, I'd like to get a little feedback and advice on where to go from here.

I know a few people have proposed working on spell check before: http://mail.wikimedia.org/pipermail/wikitech-l/2004-March/021358.html But, as best I can tell no one has gone anywhere. Does anyone know what happened to User:Archivist's spellchecker?

Right now I have a proof-of-concept running on my computer. You can see it at http://66.205.125.240/spell/index.php/Special:Spellcheck/Main_Page

It is a SpecialPage that reads the article from the database, spell checks it, lets the user choose the words from the drop-down box, and then sends a FauxRequest to EditPage. Eventually I'd like to add it to EditPage, but I started out with a special page so that I did not have to deal with the complexity of EditPage.

Here's how I'd like the final version to work: # There's a button at the bottom of an EditPage beside 'Show Perview' and 'Show Changes' labeled 'Spell Check'. # When the user clicks 'Spell Check', they get a preview of their edit where misspelled words are replaced with drop-down boxes. # The user changes the words they think are mispelled to one of the suggestions or leaves it as is. When they click 'Show Preview', they go back to the preview page.

A few questions:

Do I not need to deal with multi-byte character functions like mb_substr since all the languages use utf-8?

Should the user spell check a preview or the wikitext?

If a word is misspelled in several places, should the user be asked once for the word or should the user be asked everytime the word appears?

Thanks, Jeff McGee

Show replies by date

Walter Vermeir

19 Aug 19 Aug

8:53 p.m.

Jeffrey McGee schreef:

...

Howdy,

I'd like to add a spell checker to MediaWiki using the pspell library.

[cut]

This function can be useful. But now whit the google toolbar for Firefox you can do very easy spell checking. For IE is there also a version. The need is reduced.

http://toolbar.google.com/firefox/

-- Ook een artikeltje schrijven? WikipediaNL, de vrije GNU/FDL encyclopedie http://www.wikipedia.be

Niklas Laxström

9:46 p.m.

On 19/08/05, Walter Vermeir walter@wikipedia.be wrote:

...

This function can be useful. But now whit the google toolbar for Firefox you can do very easy spell checking. For IE is there also a version. The need is reduced.

In my understanding google toolbar supports only English. Does pspell support other languages? And there is people who don't want to install google toolbar anyway.

---- Niklas Laxström

Jeffrey McGee

10:57 p.m.

On 8/19/05, Niklas Laxström niklas.laxstrom@gmail.com wrote:

...

On 19/08/05, Walter Vermeir walter@wikipedia.be wrote:

...
This function can be useful. But now whit the google toolbar for Firefox you can do very easy spell checking. For IE is there also a version. The need is reduced.

In my understanding google toolbar supports only English. Does pspell support other languages? And there is people who don't want to install google toolbar anyway.

Niklas Laxström

Pspell (http://www.php.net/manual/en/ref.pspell.php) uses the GNU Aspell library and Aspell claims to support quite a few languages: http://aspell.sourceforge.net/man-html/Supported.html

I have tested my prototype a little with Spanish: http://66.205.125.240/spell/index.php/User_talk:192.168.1.101 But you won't be able to test it since you can't change $wgLanguageCode.

Jeff

Mark Williamson

20 Aug 20 Aug

5:46 a.m.

While Walter's advice about checking spelling with the Google toolbar can be useful for some languages (I believe it works for English, German, Spanish, French, and Swedish, maybe I missed one?), it won't be useful for everybody else -- what is a Hindi Wikipedian or a Xhosa Wikipedian to do for spellchecking?

Certainly, ispell libraries are a much more appealing option because they support a huge list of languages, and it's not extremely difficult to add more.

Mark

On 19/08/05, Jeffrey McGee jeffamcgee@gmail.com wrote:

...

On 8/19/05, Niklas Laxström niklas.laxstrom@gmail.com wrote:

...
On 19/08/05, Walter Vermeir walter@wikipedia.be wrote:

...
This function can be useful. But now whit the google toolbar for Firefox you can do very easy spell checking. For IE is there also a version. The need is reduced.

In my understanding google toolbar supports only English. Does pspell support other languages? And there is people who don't want to install google toolbar anyway.

Niklas Laxström

Pspell (http://www.php.net/manual/en/ref.pspell.php) uses the GNU Aspell library and Aspell claims to support quite a few languages: http://aspell.sourceforge.net/man-html/Supported.html

I have tested my prototype a little with Spanish: http://66.205.125.240/spell/index.php/User_talk:192.168.1.101 But you won't be able to test it since you can't change $wgLanguageCode.

Jeff _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Heiko Evermann

7:29 a.m.

Hi everyone,

...

Certainly, ispell libraries are a much more appealing option because they support a huge list of languages, and it's not extremely difficult to add more.

Please make sure that it also works with GNU/aspell.

Kind regards,

Heiko Evermann

Sabine Cretella

8:39 a.m.

...

...
Certainly, ispell libraries are a much more appealing option because they support a huge list of languages, and it's not extremely difficult to add more.

Please make sure that it also works with GNU/aspell.

Yes, this is very important indeed - we need it mainly for the minor languages!!!

Also Sicilian, Neapolitan etc.

Ciao, Sabine

___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it

Ginu Gmail

8:18 a.m.

Dear Mark,

For Google am doing the localization. Now am correcting their main page in Malayalam. They told it will take some time to correct that in the main page. And using machine translation for Malayalam will not be good, I can show that through so many examples.

If 1 against 3 is for the right thing, I believe I have to go ahead. As a journalist I am always putting that. If you feel the 3 is right then skip my mails. Carry on the wrong thing in wikipedia if u like.

If u need I can make the same comment from 100's. Dont thing bad in my mail. I am always writing straight

Regards

Ginu George

-----Original Message----- From: wikitech-l-bounces@wikimedia.org [mailto:wikitech-l-bounces@wikimedia.org] On Behalf Of Mark Williamson Sent: Saturday, August 20, 2005 9:46 AM To: Wikimedia developers Subject: Re: [Wikitech-l] Re: Spell checking in MediaWiki

Certainly, ispell libraries are a much more appealing option because they support a huge list of languages, and it's not extremely difficult to add more.

Mark

On 19/08/05, Jeffrey McGee jeffamcgee@gmail.com wrote:

...

On 8/19/05, Niklas Laxström niklas.laxstrom@gmail.com wrote:

...
On 19/08/05, Walter Vermeir walter@wikipedia.be wrote:

...
This function can be useful. But now whit the google toolbar for

Firefox

...

...
...
you can do very easy spell checking. For IE is there also a version.

The

...

...
...
need is reduced.

In my understanding google toolbar supports only English. Does pspell support other languages? And there is people who don't want to install google toolbar anyway.

Niklas Laxström

Pspell (http://www.php.net/manual/en/ref.pspell.php) uses the GNU Aspell library and Aspell claims to support quite a few languages: http://aspell.sourceforge.net/man-html/Supported.html

I have tested my prototype a little with Spanish: http://66.205.125.240/spell/index.php/User_talk:192.168.1.101 But you won't be able to test it since you can't change $wgLanguageCode.

Jeff _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l -- No virus found in this incoming message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.10.13/78 - Release Date: 8/19/2005 -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.10.13/78 - Release Date: 8/19/2005

Mark Williamson

10:35 a.m.

Ginu,

It isn't that I don't believe you.

Please understand: nobody has anything to make a good decision here.

You are telling people, "the Malayalam Wikipedia is wrong". But none of us knows Malayalam. So we cannot decide whether you are correct.

The people already at the Malayalam Wikipedia seem to have no problem. They are all Malayalees, so they are not using machine translation because it is their native language. There are at least 3 of them.

This is not to say that you are wrong -- you might be right. But, do you have anything you might convince people with? To simply say, "I am right, and everybody else is wrong" does not help -- no change will be made based on that alone. But if you have something more, that a grammarian or a PhD in Malayalam language or some such expert agrees with you, or have citations from a book for your corrections, then there may be room for changes.

Again, though, I think the best avenue for sorting this out is for you to discuss it on the Malayalam Wikipedia. Remember, Wikipedia is freely editable by anybody, so you may add comments on discussion pages and invite feedback from the existing community.

Also, Google localisation is by no means a prior qualification -- anybody can translate the user interface for Google, no matter what their knowledge, it is a largely open system, and there are others working on Malayalam localisation for Google as well (in fact, I have e-mailed with some of them before).

I'm also thinking a bit that this may be a font problem. It seems recently somebody made changes to the Malayalam Wikipedia which was due to a bad font, instead of real spelling errors. Please see http://ml.wikipedia.org/wiki/%E0%B4%B5%E0%B4%BF%E0%B4%95%E0%B5%8D%E0%B4%95%E... -- if you are using certain fonts, it will appear with "spelling mistakes", but this is because of the font. You need a better font, to fix the apparent errors.

Mark

On 20/08/05, Ginu Gmail ginu.george@gmail.com wrote:

...

Dear Mark,

For Google am doing the localization. Now am correcting their main page in Malayalam. They told it will take some time to correct that in the main page. And using machine translation for Malayalam will not be good, I can show that through so many examples.

If 1 against 3 is for the right thing, I believe I have to go ahead. As a journalist I am always putting that. If you feel the 3 is right then skip my mails. Carry on the wrong thing in wikipedia if u like.

If u need I can make the same comment from 100's. Don't thing bad in my mail. I am always writing straight

Regards

Ginu George

-----Original Message----- From: wikitech-l-bounces@wikimedia.org [mailto:wikitech-l-bounces@wikimedia.org] On Behalf Of Mark Williamson Sent: Saturday, August 20, 2005 9:46 AM To: Wikimedia developers Subject: Re: [Wikitech-l] Re: Spell checking in MediaWiki

While Walter's advice about checking spelling with the Google toolbar can be useful for some languages (I believe it works for English, German, Spanish, French, and Swedish, maybe I missed one?), it won't be useful for everybody else -- what is a Hindi Wikipedian or a Xhosa Wikipedian to do for spellchecking?

Certainly, ispell libraries are a much more appealing option because they support a huge list of languages, and it's not extremely difficult to add more.

Mark

On 19/08/05, Jeffrey McGee jeffamcgee@gmail.com wrote:

...
On 8/19/05, Niklas Laxström niklas.laxstrom@gmail.com wrote:

...
On 19/08/05, Walter Vermeir walter@wikipedia.be wrote:

...
This function can be useful. But now whit the google toolbar for

Firefox

...
...
...
you can do very easy spell checking. For IE is there also a version.

The

...
...
...
need is reduced.

In my understanding google toolbar supports only English. Does pspell support other languages? And there is people who don't want to install google toolbar anyway.

Niklas Laxström

Pspell (http://www.php.net/manual/en/ref.pspell.php) uses the GNU Aspell library and Aspell claims to support quite a few languages: http://aspell.sourceforge.net/man-html/Supported.html

I have tested my prototype a little with Spanish: http://66.205.125.240/spell/index.php/User_talk:192.168.1.101 But you won't be able to test it since you can't change $wgLanguageCode.

Jeff _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- No virus found in this incoming message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.10.13/78 - Release Date: 8/19/2005

-- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.338 / Virus Database: 267.10.13/78 - Release Date: 8/19/2005

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Pablo Saratxaga

7:18 p.m.

Kaixo!

On Fri, Aug 19, 2005 at 10:46:08PM -0700, Mark Williamson wrote:

...

Certainly, ispell libraries are a much more appealing option because they support a huge list of languages, and it's not extremely difficult to add more.

I almost agree; only a remark: ispell is now obsolete, it has been superseeded with aspell, which fully supports utf-8 and a lot more languages (as long as the language uses only, or can be decomposed into, up to 210 different elements; that is the case of korean, even if encoding of korean uses several thousands precombined characters; it can be decomposed in basic letters for the spell checking; a korean aspell dictionnary has however yet to be written)

There are some on-line php interfaces that use aspell on the server to spell check some text trough a web interface; so it could indeed be added to the wikipedia, and it may be usefull; but for heavy traffic wikipedias it may also be a serious performance problem if it is used a lot...

There are dictionnaries for aspell for 76 different languages, and of course more can be added; look at: http://aspell.sourceforge.net/man-html/Supported.html

Ki ça vos våye bén, Pablo Saratxaga

http://chanae.walon.org/pablo/ PGP Key available, key ID: 0xD9B85466 [you can write me in Walloon, Spanish, French, English, Catalan or Esperanto] [min povas skribi en valona, esperanta, angla aux latinidaj lingvoj]

Lars Aronsson

22 Aug 22 Aug

7:49 p.m.

Pablo Saratxaga wrote:

...

There are dictionnaries for aspell for 76 different languages, and of course more can be added; look at: http://aspell.sourceforge.net/man-html/Supported.html

But not all Aspell dictionaries are free. This is an area where Wiktionary and the Wikimedia Foundation could make a difference. For example, download the English dictionaries in ftp://ftp.gnu.org/gnu/aspell/dict/en/aspell6-en-6.0-0.tar.bz2 and read the enclosed text file named "Copyright".

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Sabine Cretella

10:21 p.m.

Lars Aronsson wrote:

...

Pablo Saratxaga wrote:

...
There are dictionnaries for aspell for 76 different languages, and of course more can be added; look at: http://aspell.sourceforge.net/man-html/Supported.html

But not all Aspell dictionaries are free. This is an area where Wiktionary and the Wikimedia Foundation could make a difference. For example, download the English dictionaries in ftp://ftp.gnu.org/gnu/aspell/dict/en/aspell6-en-6.0-0.tar.bz2 and read the enclosed text file named "Copyright".

Well Lars, we are not so far away from making a different point in that. It is one of the usages we have in mind with Ultimate Wiktionary. Since there we will have words in all languages and have these words in a relational database it is easy to "extract an actual spellchecker" every now and then. Being Ultimate Wiktionary data under GFDL also the spell checker is under GFDL. We are also thinking about a feedback mechanism that allows for updates and corrections. This means that for example who uses our spell checker in OmegaT or OOo or wherever gives us back the additional and/or corrected data and this can than be integrated. Also every manual edit on UW will contribute to increase the validity of spell checkers. We will for sure have over 200 languages and many millions of words (of course in these several languages). Therefore we will have also "start-up-spellcheckers" (in languages where there are no other spellcheckers) where people can work on online and offline and working on them (for example in any African language offline, where people have no stable internet connection) they also automatically contribute to UW.

Really this would need much more "digging deeper" - but I presume that even with these few lines you can imagine what it would mean to have spellcheckers in over 200 languages in gfdl.

Ciao, Sabine

___________________________________ Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB http://mail.yahoo.it

Lars Aronsson

24 Aug 24 Aug

3:18 a.m.

Sabine Cretella wrote:

...

Well Lars, we are not so far away from making a different point in that. It is one of the usages we have in mind with Ultimate Wiktionary. Since there we will have words in all languages and have these words in a relational database it is easy to "extract an actual spellchecker" every now and then.

I keep hearing these promises, but "seeing is believing"! Have you started actual work on UW yet, or are you sitting idle while waiting for Wikidata to be released? Will there be an English free dictionary that can compete in size and quality with Aspell's current dictionary by the end of 2005? Or by the end of 2006?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Gerard Meijssen

7:44 a.m.

Lars Aronsson wrote:

...

Sabine Cretella wrote:

...
Well Lars, we are not so far away from making a different point in that. It is one of the usages we have in mind with Ultimate Wiktionary. Since there we will have words in all languages and have these words in a relational database it is easy to "extract an actual spellchecker" every now and then.

I keep hearing these promises, but "seeing is believing"! Have you started actual work on UW yet, or are you sitting idle while waiting for Wikidata to be released? Will there be an English free dictionary that can compete in size and quality with Aspell's current dictionary by the end of 2005? Or by the end of 2006?

Hoi, Actual work on UW itself is underway. Here you can find the data desisgn http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design This design is very much open for comments and I am happy to say that many comments that were given have led to changes. I name but a few changes that came about this way; Can sign languages be included - now they can, Can attestations be included - now they can.

As Ultimate Wiktionary is dependent on Wikidata, there is little option for us but to wait untill it is ready. It is really important that Wikidata is done well because it will not serve only Ultimate Wiktionary but also Ultimate Wiktionary.

When both Aspell and Ultimate Wiktionary are considered Free, it should be possible for us to work together. Once we find this cooperation possible, we could host the data currently included in Aspell in UW. In return we would provide a publicly accessible website where it is easy to add new words thay will end up in Aspell. Even when we do not cooperate, there will be languages that currently do not have a spellchecker. These spellcheckers I am particularly exited about because this is where we will be able to add value.

Without a massive infusion of data, it will be hard to predict when we have as many words as Aspell does for languages where Aspell has a dictionary.

Then again, if we create a wordcount on the Wikipedia content, run it against a spellchecker, the resulting list should be spelled correctly and could be included in UW. Particularly for our biggest wikipedias and the amount of topics covered, it should be a list that might be close to the size of what Aspell has. We will also have a long list of words missing in Aspell. We will however not get a spellchecker for British or American in this way.

Thanks, GerardM.

Thanks, GerardM

Mark Williamson

10:06 a.m.

Hi,

That page states that the ways for writing signed languages are closer to Chinese characters than Latin script.

This is completely incorrect.

Please see below for my suggestion on signed languages.

There are 4 main ways of writing signed languages:

1) With word-for-word glosses in a spoken language. For ASL or BSL this is usually English; for InSL it may be Hindi or another Indian language or English; for Chinese SL it will probably be Chinese. While this is suitable in most cases for writing whole sentences and recording syntax and grammar, it gives no specific information about what a sign looks like and thus is completely unsuitable.

2) Sutton SignWriting. This writing system is copyrighted and use of it is not free. However, it is currently the most widely used of any of the 3 main sign-writing systems today, at least by deaf people (researchers are more likely to use HamNoSys or Stokoe). It is more like Korean letters: each part of any given symbol says something specific about how to form the sign, but they are combined to form what may appear to the uninitiated to be a logographic sign, when in fact it is most certainly not. More information at http://www.omniglot.com/writing/signwriting.htm

3) HamNoSys. A very complex system that can be best compared with a "Narrow transcription" of a spoken language using the IPA http://en.wikipedia.org/wiki/Phonetic_transcription#Narrow_and_broad_transcr... , it is used mostly by researchers. However, it's much easier to represent on computers than Sutton SignWriting. More information at http://www.sign-lang.uni-hamburg.de/Projects/HamNoSys.html

4) Stokoe. Stokoe is actually, in a sense, the basis of HamNoSys. It is more equivalent to the Latin script than any of the other systems, in fact it borrows many letters from it. Its use is restricted mostly to researchers today. Some people accept the minor changes made by it by a BSL researcher, however any more drastic changes are usually considered to be separate systems.

Suggestion: Use http://www.unm.edu/~grvsmth/signsynth/ -- data will be stored as a computer representation of Stokoe, but can be played back. Demo available at http://www.panix.com/~grvsmth/signsynth/ ...

Although the native rendering for SignSynth is VRML (Virtual Reality Markup Language), I imagine it would be quite easy to convert automatically or even to make it render initially in a more widely-used format, such as some sort of video or animation format.

This would limit the space that would be taken up by the storage of so many videos -- there are hundreds of signed languages on the planet today, and to store videos for each word in all of them (or even just the major ones and the 1000 most frequent words) would take up a lot of space.

While it is obviously not perfect (look at the forehead... oh my is she ugly!), it's definitely good enough that someone could imitate it and their imitation would be the proper sign very accurately, and it's also good enough that anybody who can speak sign language (there are better terms than "speak", but they seem awkward to me) should be able to understand it well.

Presumably, the developer of the software could be solicited for further cooperation.

Mark

On 24/08/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Lars Aronsson wrote:

...
Sabine Cretella wrote:

...
Well Lars, we are not so far away from making a different point in that. It is one of the usages we have in mind with Ultimate Wiktionary. Since there we will have words in all languages and have these words in a relational database it is easy to "extract an actual spellchecker" every now and then.

I keep hearing these promises, but "seeing is believing"! Have you started actual work on UW yet, or are you sitting idle while waiting for Wikidata to be released? Will there be an English free dictionary that can compete in size and quality with Aspell's current dictionary by the end of 2005? Or by the end of 2006?

Hoi, Actual work on UW itself is underway. Here you can find the data desisgn http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design This design is very much open for comments and I am happy to say that many comments that were given have led to changes. I name but a few changes that came about this way; Can sign languages be included - now they can, Can attestations be included - now they can.

As Ultimate Wiktionary is dependent on Wikidata, there is little option for us but to wait untill it is ready. It is really important that Wikidata is done well because it will not serve only Ultimate Wiktionary but also Ultimate Wiktionary.

When both Aspell and Ultimate Wiktionary are considered Free, it should be possible for us to work together. Once we find this cooperation possible, we could host the data currently included in Aspell in UW. In return we would provide a publicly accessible website where it is easy to add new words thay will end up in Aspell. Even when we do not cooperate, there will be languages that currently do not have a spellchecker. These spellcheckers I am particularly exited about because this is where we will be able to add value.

Without a massive infusion of data, it will be hard to predict when we have as many words as Aspell does for languages where Aspell has a dictionary.

Then again, if we create a wordcount on the Wikipedia content, run it against a spellchecker, the resulting list should be spelled correctly and could be included in UW. Particularly for our biggest wikipedias and the amount of topics covered, it should be a list that might be close to the size of what Aspell has. We will also have a long list of words missing in Aspell. We will however not get a spellchecker for British or American in this way.

Thanks, GerardM.

Thanks, GerardM _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Delirium

27 Aug 27 Aug

7:34 a.m.

Mark Williamson wrote:

...

With word-for-word glosses in a spoken language. For ASL or BSL

this is usually English; for InSL it may be Hindi or another Indian language or English; for Chinese SL it will probably be Chinese. While this is suitable in most cases for writing whole sentences and recording syntax and grammar, it gives no specific information about what a sign looks like and thus is completely unsuitable.

Why does this make it completely unsuitable? A large proportion of Chinese characters give no specific information about how they are pronounced---and indeed are pronounced radically different by Mandarin, Cantonese, and Japanese speakers---but that doesn't seem to have led to them being deemed unsuitable for use in a written language. At the very least, using Chinese characters to write Chinese SL is no worse than using Kanji to write Japanese.

-Mark

Mark Williamson

9:46 a.m.

I meant that it's unsuitable for a dictionary.

Any good dictionary will tell you how a Chinese character is pronounced in any given variety.

Chinese characters wouldn't be so difficult to learn for somebody who only knew a signed language (in much of the developing world, deaf people don't learn to read and write spoken languages), because they can be associated with signs on a morphemic basis.

However, a spoken language written alphabetically used for glosses will require that any given deaf person learn said spoken language before being able to read a text.

This also means that a deaf man in Boston who wants to know the ASL equivalent of the English word "boarish" will not get an answer by looking in a dictionary and finding the gloss "boarish" in the field for an ASL translation.

Mark

On 27/08/05, Delirium delirium@hackish.org wrote:

...

Mark Williamson wrote:

...

With word-for-word glosses in a spoken language. For ASL or BSL

this is usually English; for InSL it may be Hindi or another Indian language or English; for Chinese SL it will probably be Chinese. While this is suitable in most cases for writing whole sentences and recording syntax and grammar, it gives no specific information about what a sign looks like and thus is completely unsuitable.

Why does this make it completely unsuitable? A large proportion of Chinese characters give no specific information about how they are pronounced---and indeed are pronounced radically different by Mandarin, Cantonese, and Japanese speakers---but that doesn't seem to have led to them being deemed unsuitable for use in a written language. At the very least, using Chinese characters to write Chinese SL is no worse than using Kanji to write Japanese.

-Mark

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Gerard Meijssen

10:11 a.m.

Mark Williamson wrote:

...

I meant that it's unsuitable for a dictionary.

Any good dictionary will tell you how a Chinese character is pronounced in any given variety.

Chinese characters wouldn't be so difficult to learn for somebody who only knew a signed language (in much of the developing world, deaf people don't learn to read and write spoken languages), because they can be associated with signs on a morphemic basis.

However, a spoken language written alphabetically used for glosses will require that any given deaf person learn said spoken language before being able to read a text.

This also means that a deaf man in Boston who wants to know the ASL equivalent of the English word "boarish" will not get an answer by looking in a dictionary and finding the gloss "boarish" in the field for an ASL translation.

Mark

Hoi First of all, there are three different types of languages. There are the written languages, the spoken languages and the signed languages. They are quite different. A deaf person does not learn a spoken language, he learns a written language. The difference is quite crucial. How a written language relates to a spoken language is something that a deaf person does not apreciate. This relation is often tenuous at best. It makes our written language as abstract as Chines characters for deaf people.

Your assumption that deaf people have to learn written languages is correct. In many ways it is an essential skill. It is awkward that a dictionary needs to have another language to make it accesible. To a large extend this is just that, awkward. It is not a reason not to include sign languages in Ultimate Wiktionary as UW intends to have all words in all languages. When you have a look at a recent version of the datadesign, you will find that I included Wolfgang Georgdorf's methodology of providing metadata about signs in there as well.

So when someone want to find ASL for boarish, and he is literate, he will be able to find it. Being illiterate makes using a computer practically impossible. The idea of having a user interface for ASL will be a dream, it will not be feasible in UW mark I. The best way it might work is that you have the words in the UI and when you click it, you get a signed instruction.

This discussion is not really relevant to the WIKITECH mailinglist so I crosspost it to the WIKTIONARY mailing list where it is of interest.

Thanks, GerardM

...

On 27/08/05, Delirium delirium@hackish.org wrote:

...
Mark Williamson wrote:

...

With word-for-word glosses in a spoken language. For ASL or BSL

this is usually English; for InSL it may be Hindi or another Indian language or English; for Chinese SL it will probably be Chinese. While this is suitable in most cases for writing whole sentences and recording syntax and grammar, it gives no specific information about what a sign looks like and thus is completely unsuitable.

Why does this make it completely unsuitable? A large proportion of Chinese characters give no specific information about how they are pronounced---and indeed are pronounced radically different by Mandarin, Cantonese, and Japanese speakers---but that doesn't seem to have led to them being deemed unsuitable for use in a written language. At the very least, using Chinese characters to write Chinese SL is no worse than using Kanji to write Japanese.

-Mark

Mark Williamson

10:38 p.m.

Hi Gerard,

Please re-read my original e-mail.

Mark.

On 27/08/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Mark Williamson wrote:

...
I meant that it's unsuitable for a dictionary.

Any good dictionary will tell you how a Chinese character is pronounced in any given variety.

Chinese characters wouldn't be so difficult to learn for somebody who only knew a signed language (in much of the developing world, deaf people don't learn to read and write spoken languages), because they can be associated with signs on a morphemic basis.

However, a spoken language written alphabetically used for glosses will require that any given deaf person learn said spoken language before being able to read a text.

This also means that a deaf man in Boston who wants to know the ASL equivalent of the English word "boarish" will not get an answer by looking in a dictionary and finding the gloss "boarish" in the field for an ASL translation.

Mark

Hoi First of all, there are three different types of languages. There are the written languages, the spoken languages and the signed languages. They are quite different. A deaf person does not learn a spoken language, he learns a written language. The difference is quite crucial. How a written language relates to a spoken language is something that a deaf person does not apreciate. This relation is often tenuous at best. It makes our written language as abstract as Chines characters for deaf people.

Your assumption that deaf people have to learn written languages is correct. In many ways it is an essential skill. It is awkward that a dictionary needs to have another language to make it accesible. To a large extend this is just that, awkward. It is not a reason not to include sign languages in Ultimate Wiktionary as UW intends to have all words in all languages. When you have a look at a recent version of the datadesign, you will find that I included Wolfgang Georgdorf's methodology of providing metadata about signs in there as well.

So when someone want to find ASL for boarish, and he is literate, he will be able to find it. Being illiterate makes using a computer practically impossible. The idea of having a user interface for ASL will be a dream, it will not be feasible in UW mark I. The best way it might work is that you have the words in the UI and when you click it, you get a signed instruction.

This discussion is not really relevant to the WIKITECH mailinglist so I crosspost it to the WIKTIONARY mailing list where it is of interest.

Thanks, GerardM

...
On 27/08/05, Delirium delirium@hackish.org wrote:

...
Mark Williamson wrote:

...

With word-for-word glosses in a spoken language. For ASL or BSL

this is usually English; for InSL it may be Hindi or another Indian language or English; for Chinese SL it will probably be Chinese. While this is suitable in most cases for writing whole sentences and recording syntax and grammar, it gives no specific information about what a sign looks like and thus is completely unsuitable.

Why does this make it completely unsuitable? A large proportion of Chinese characters give no specific information about how they are pronounced---and indeed are pronounced radically different by Mandarin, Cantonese, and Japanese speakers---but that doesn't seem to have led to them being deemed unsuitable for use in a written language. At the very least, using Chinese characters to write Chinese SL is no worse than using Kanji to write Japanese.

-Mark

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Delirium

5:05 p.m.

Mark Williamson wrote:

...

However, a spoken language written alphabetically used for glosses will require that any given deaf person learn said spoken language before being able to read a text.

I don't see why that's true at all.

-Mark

Mark Williamson

10:39 p.m.

In the sense that written languages are mirrors of spoken languages, which in most cases they are to a certain extent, this is true.

Nevertheless, it won't help the deaf man who wants to know what "boarish" is in ASL, even though he's literate in English.

Mark

On 27/08/05, Delirium delirium@hackish.org wrote:

...

Mark Williamson wrote:

...
However, a spoken language written alphabetically used for glosses will require that any given deaf person learn said spoken language before being able to read a text.

I don't see why that's true at all.

-Mark

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Mark Williamson

28 Aug 28 Aug

5:22 a.m.

New subject: Fwd: Re: Spell checking in MediaWiki

Gerard, you can find the message to which I was referring at the end of the present message.

Please do not treat me as if I know nothing about the subject, because I certainly do.

Many deaf people learn to speak and to read lips. I don't really think whether or not deaf people learn a spoken language is an issue regarding Wiktionary. But certainly, many deaf people DO learn spoken languages, even if you think (as you appear to) that spoken and written languages are completely distinct entities.

Besides, it's absurd to state that written languages are not related to spoken languages. Written English is nothing more than a transcription of spoken English (in a standard dialect).

As of yet there is no widely-used written language which does not correspond directly or nearly directly to a spoken or signed language.

You say there are three different kinds of languages, spoken, written, and signed. This, too, is absurd!

Written languages are always associated with a spoken language (or in some cases, a signed language). Thus, when we say "Finnish", we're referring to the language spoken by the majority of Finns, as well as the language written by the majority of Finns, because in the minds of all reasonable people they are a single entity.

Do you think that if Korean kids had to learn to write Dutch-written-language in school rather than Korean-written-language, it would be just as easy? Obviously, it would not. This is because knowing spoken Korean makes it much easier to learn written Korean, and knowing spoken Dutch makes it easier to learn written Dutch, because there are many simple correspondences. Learning a different written language, however, is just as difficult as learning a different spoken language.

But again, this isn't the main point. Please read the e-mail below:

---------- Forwarded message ---------- From: Mark Williamson node.ue@gmail.com Date: 24-Aug-2005 03:06 Subject: Re: [Wikitech-l] Re: Spell checking in MediaWiki To: Wikimedia developers wikitech-l@wikimedia.org

Hi,

That page states that the ways for writing signed languages are closer to Chinese characters than Latin script.

This is completely incorrect.

Please see below for my suggestion on signed languages.

There are 4 main ways of writing signed languages:

Presumably, the developer of the software could be solicited for further cooperation.

Mark

On 24/08/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Lars Aronsson wrote:

...
Sabine Cretella wrote:

...
Well Lars, we are not so far away from making a different point in that. It is one of the usages we have in mind with Ultimate Wiktionary. Since there we will have words in all languages and have these words in a relational database it is easy to "extract an actual spellchecker" every now and then.

I keep hearing these promises, but "seeing is believing"! Have you started actual work on UW yet, or are you sitting idle while waiting for Wikidata to be released? Will there be an English free dictionary that can compete in size and quality with Aspell's current dictionary by the end of 2005? Or by the end of 2006?

Hoi, Actual work on UW itself is underway. Here you can find the data desisgn http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design This design is very much open for comments and I am happy to say that many comments that were given have led to changes. I name but a few changes that came about this way; Can sign languages be included - now they can, Can attestations be included - now they can.

As Ultimate Wiktionary is dependent on Wikidata, there is little option for us but to wait untill it is ready. It is really important that Wikidata is done well because it will not serve only Ultimate Wiktionary but also Ultimate Wiktionary.

When both Aspell and Ultimate Wiktionary are considered Free, it should be possible for us to work together. Once we find this cooperation possible, we could host the data currently included in Aspell in UW. In return we would provide a publicly accessible website where it is easy to add new words thay will end up in Aspell. Even when we do not cooperate, there will be languages that currently do not have a spellchecker. These spellcheckers I am particularly exited about because this is where we will be able to add value.

Without a massive infusion of data, it will be hard to predict when we have as many words as Aspell does for languages where Aspell has a dictionary.

Then again, if we create a wordcount on the Wikipedia content, run it against a spellchecker, the resulting list should be spelled correctly and could be included in UW. Particularly for our biggest wikipedias and the amount of topics covered, it should be a list that might be close to the size of what Aspell has. We will also have a long list of words missing in Aspell. We will however not get a spellchecker for British or American in this way.

Thanks, GerardM.

Thanks, GerardM _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

Heiko Evermann

24 Aug 24 Aug

10:08 p.m.

Hi Gerard,

...

Actual work on UW itself is underway. Here you can find the data desisgn http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design This design is very much open for comments and I am happy to say that many comments that were given have led to changes. I name but a few changes that came about this way; Can sign languages be included - now they can, Can attestations be included - now they can.

I want to propose (again) to make one important change: I think it is important that an entry within one language can be tagged as being correct according to several orthographies within one language. From what I understood so far, I find that the word de: "ist" (English: "(he) is") must be inserted twice, once for the new German spelling and once for the old (before the recent reform). Even thogh this word was not affected by the spelling reform. This applies to 95% of all German words. And each of them gets complete translation coverage into all languages. This is also a problem for Low Saxon (with our wide range of possible spellings). You have tried to make your current design plausible to me when we talked about it recently, but I was not convinced that this huge multiplication of entries is a good idea. Maybe I misunderstood you somehow, but I still do not understand it.

...

Then again, if we create a wordcount on the Wikipedia content, run it against a spellchecker, the resulting list should be spelled correctly and could be included in UW. Particularly for our biggest wikipedias and the amount of topics covered, it should be a list that might be close to the size of what Aspell has. We will also have a long list of words missing in Aspell. We will however not get a spellchecker for British or American in this way.

Does that mean that you think about importing huge amounts of words without definition and without any translation?

Heiko

Gerard Meijssen

31 Aug 31 Aug

11:03 a.m.

Heiko Evermann wrote:

...

Hi Gerard,

...
Actual work on UW itself is underway. Here you can find the data desisgn http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design This design is very much open for comments and I am happy to say that many comments that were given have led to changes. I name but a few changes that came about this way; Can sign languages be included - now they can, Can attestations be included - now they can.

I want to propose (again) to make one important change: I think it is important that an entry within one language can be tagged as being correct according to several orthographies within one language. From what I understood so far, I find that the word de: "ist" (English: "(he) is") must be inserted twice, once for the new German spelling and once for the old (before the recent reform). Even thogh this word was not affected by the spelling reform. This applies to 95% of all German words. And each of them gets complete translation coverage into all languages. This is also a problem for Low Saxon (with our wide range of possible spellings). You have tried to make your current design plausible to me when we talked about it recently, but I was not convinced that this huge multiplication of entries is a good idea. Maybe I misunderstood you somehow, but I still do not understand it.

The German situation is a bit difficult. In actual fact there are only two orthographies because two Bundeslander did not pass as law that the new spelling would apply there as well. The consequence is that both old spelling and new spelling are valid. In a typical situation, the words that have been changed would get dated and be outdated. From a practical point of view I would only have the changed words and the new words included and I would treat them as if these two Bundeslander had voted in favour. For lookup purposes the difference is a SELECT statement in the query statement.

For Lower Saxon the situation is different. There are many "correct" ways of spelling a word. Here it is essential that it is indicated to what orthography or dialect a word belongs to. One reason is that many people are quiet insistent that only one spelling should be used. This is in a marked contrast to the practice of Neopolitan and Sicilian where all spellings are accepted without much of a fuss.

The argument why all words have to be explicitly identified as belonging to an orthography is because it allows us to do other things than just producing lexicological information from the Internet. What in your perception is an "multiplication of entries" is in actual fact no such thing; an expression is registered only once for each language, dialect or orthography.

...

...
Then again, if we create a wordcount on the Wikipedia content, run it against a spellchecker, the resulting list should be spelled correctly and could be included in UW. Particularly for our biggest wikipedias and the amount of topics covered, it should be a list that might be close to the size of what Aspell has. We will also have a long list of words missing in Aspell. We will however not get a spellchecker for British or American in this way.

Does that mean that you think about importing huge amounts of words without definition and without any translation?

When I have lists of words that are known to be correct for instance because an organisation actively vouches for them, this is certainly the intention. I have a wordlist of some 222.930 Dutch words that are certified as correct. Those I think should be no issue. When the community identifies a list that is correct, I am sure they want to upload it as well.

Thanks, GerardM

Heiko Evermann

8:30 p.m.

Hi Gerard,

Thank you for your answer.

...

The German situation is a bit difficult. In actual fact there are only two orthographies because two Bundeslander did not pass as law that the new spelling would apply there as well. The consequence is that both old spelling and new spelling are valid. In a typical situation, the words that have been changed would get dated and be outdated. From a practical point of view I would only have the changed words and the new words included and I would treat them as if these two Bundeslander had voted in favour. For lookup purposes the difference is a SELECT statement in the query statement.

So you do not want to include the old spelling? From what I understood for Low Saxon you also wanted to include historic spellings. But I may have misunderstood that.

...

The argument why all words have to be explicitly identified as belonging to an orthography is because it allows us to do other things than just producing lexicological information from the Internet. What in your perception is an "multiplication of entries" is in actual fact no such thing; an expression is registered only once for each language, dialect or orthography.

So number of entries = (number of languages) x (number of dialects) x (number of orthographies)?

What are you planning to do with American English vs. British English?

You would have two entries: 1) title=colour lang=EN dialect=EN_US orthography=USA-official 2) title=color lang=EN dialect=EN_GB orthography=GB official

That is fine. But what about "bus"? would you have two entries? 1) title=bus lang=EN dialect=EN_US orthography=USA-official 2) title=bus lang=EN dialect=EN_GB orthography=GB official

That (to my understanding) would double the entries for English, wouldn't it? And the translation of de:Bus would list en_US: bus, en_GB:bus?

Kind regards,

Heiko Evermann

Gerard Meijssen

9:27 p.m.

Heiko Evermann wrote:

...

Hi Gerard,

Thank you for your answer.

...
The German situation is a bit difficult. In actual fact there are only two orthographies because two Bundeslander did not pass as law that the new spelling would apply there as well. The consequence is that both old spelling and new spelling are valid. In a typical situation, the words that have been changed would get dated and be outdated. From a practical point of view I would only have the changed words and the new words included and I would treat them as if these two Bundeslander had voted in favour. For lookup purposes the difference is a SELECT statement in the query statement.

So you do not want to include the old spelling? From what I understood for Low Saxon you also wanted to include historic spellings. But I may have misunderstood that.

Sorry, good try but no cigar. The words that are spelled differently will both be in there. They will both have a record in ValidExpression where the old spelling will have a value in the ValidUntil field and the new spelling will have a value in the ValidFrom field.

There is room for historic orthographies, it may prove instructive in demonstrating the ongoing Germanisation of Lower Saxon

...

...
The argument why all words have to be explicitly identified as belonging to an orthography is because it allows us to do other things than just producing lexicological information from the Internet. What in your perception is an "multiplication of entries" is in actual fact no such thing; an expression is registered only once for each language, dialect or orthography.

So number of entries = (number of languages) x (number of dialects) x (number of orthographies)?

What are you planning to do with American English vs. British English?

You would have two entries:

title=colour lang=EN dialect=EN_US orthography=USA-official 2) title=color lang=EN dialect=EN_GB orthography=GB official

That is fine. But what about "bus"? would you have two entries?

title=bus lang=EN dialect=EN_US orthography=USA-official 2) title=bus lang=EN dialect=EN_GB orthography=GB official

That (to my understanding) would double the entries for English, wouldn't it? And the translation of de:Bus would list en_US: bus, en_GB:bus?

Kind regards,

Heiko Evermann

First of all I am not a specialist when it comes to the spelling of American English or British English. Depending on there being an official body that identifies correctly spelled English, a spelling can be either validated by one organisation or by two organisations. When this is the case, there is no need for duplication. This is functionality implicitly there in the data design.

The examples that you show bear no relation to what UW will look like nor how the edit screens will look like I am happy to say :) There is this big difference in the attitude of the way Lower Saxon is treating its orthograhies and the way Sicilan or Napolitan orthographies are treated. The Lower Saxon seem really eager to have only one orthography and therefore a mix of the different spellings is not likely to find much apreciation by many.

The duplication of words that are spelled the same in different dialects or orthographies is inherent in the database design. This is essential if you want to have definitions and etymology in these dialects or orthographies. If you are willing to accept that definitions and etymology can be spelled in orthographies other than Sass there could be a solution but as the nds.wikipedia also has to standardise on Sass, I think this is a rather unlikely scenario.

Thanks, GerardM

Heiko Evermann

1 Sep 1 Sep

12:57 p.m.

Hi Gerard,

...

The duplication of words that are spelled the same in different dialects or orthographies is inherent in the database design. This is essential if you want to have definitions and etymology in these dialects or orthographies. If you are willing to accept that definitions and etymology can be spelled in orthographies other than Sass there could be a solution but as the nds.wikipedia also has to standardise on Sass, I think this is a rather unlikely scenario.

Definition and etymology would be the same. Your approach would be a duplication of efforts. It would be sufficient to allow one entry to belong to several orthographis, as in 1:n instead of 1:1. So this is not inherent in the database design. It is the design bug that I complain about for some time. 1:n would allow us to enter the data the way we think appropriate. And it still leaves us the opportunity to add individual entries when other users really think that explanations must also be duplicated along the orthographies (which I really doubt). So they can, if they want to, but they are not forced.

Kind regards,

Heiko

-- Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko! Satte Provisionen f�r GMX Partner: http://www.gmx.net/de/go/partner

Gerard Meijssen

4:59 p.m.

Heiko Evermann wrote:

...

Hi Gerard,

...
The duplication of words that are spelled the same in different dialects or orthographies is inherent in the database design. This is essential if you want to have definitions and etymology in these dialects or orthographies. If you are willing to accept that definitions and etymology can be spelled in orthographies other than Sass there could be a solution but as the nds.wikipedia also has to standardise on Sass, I think this is a rather unlikely scenario.

Definition and etymology would be the same. Your approach would be a duplication of efforts. It would be sufficient to allow one entry to belong to several orthographis, as in 1:n instead of 1:1. So this is not inherent in the database design. It is the design bug that I complain about for some time. 1:n would allow us to enter the data the way we think appropriate. And it still leaves us the opportunity to add individual entries when other users really think that explanations must also be duplicated along the orthographies (which I really doubt). So they can, if they want to, but they are not forced.

Kind regards,

Heiko

Hoi, As you want your definitions and etymology in Sass, these defintions are not the same as the ones in an other orthography. It is therefore not a duplication of efforts, it is the consequence of things in Sass. You cannot both insist on Sass and have it apply for other orthographies or dialects as well. You CAN state that words written the same way or in a different way mean the same thing and if people have selected Sass as an orthography they are interested in, they may see that it has an etymology or a definition.

I am interested in who your "we" is. I did recently discuss this design in a four hour session with language engineers, I discussed the German / Dutch / English - American / Lower Saxon languages and they agreed with me that this design allows for other purposes than just lexicon lookup. I have discussed the design with many people and so far the only "we" who asks for "this" is you.

What you call a design bug is in actual fact a design feature. One point that you are missing is that the meaning of a word spelled the same between dialects may be different. This is exactly one reason why this duplication is needed. The same is true for etymology; when a word is used for a first time in THAT orthography or dialect may be different as well, it may arrive from another dialect and not from another language..

Thanks, GerardM

Mark Williamson

2 Sep 2 Sep

9:11 a.m.

Ahh, but Gerard, the words have the same origins, even if you spell them differently.

So Sass will spell it "greutens", and AS will spell it "groytens", but they both have the same origin, and the same translations.

Mark

On 01/09/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Heiko Evermann wrote:

...
Hi Gerard,

...
The duplication of words that are spelled the same in different dialects or orthographies is inherent in the database design. This is essential if you want to have definitions and etymology in these dialects or orthographies. If you are willing to accept that definitions and etymology can be spelled in orthographies other than Sass there could be a solution but as the nds.wikipedia also has to standardise on Sass, I think this is a rather unlikely scenario.

Definition and etymology would be the same. Your approach would be a duplication of efforts. It would be sufficient to allow one entry to belong to several orthographis, as in 1:n instead of 1:1. So this is not inherent in the database design. It is the design bug that I complain about for some time. 1:n would allow us to enter the data the way we think appropriate. And it still leaves us the opportunity to add individual entries when other users really think that explanations must also be duplicated along the orthographies (which I really doubt). So they can, if they want to, but they are not forced.

Kind regards,

Heiko

Hoi, As you want your definitions and etymology in Sass, these defintions are not the same as the ones in an other orthography. It is therefore not a duplication of efforts, it is the consequence of things in Sass. You cannot both insist on Sass and have it apply for other orthographies or dialects as well. You CAN state that words written the same way or in a different way mean the same thing and if people have selected Sass as an orthography they are interested in, they may see that it has an etymology or a definition.

I am interested in who your "we" is. I did recently discuss this design in a four hour session with language engineers, I discussed the German / Dutch / English - American / Lower Saxon languages and they agreed with me that this design allows for other purposes than just lexicon lookup. I have discussed the design with many people and so far the only "we" who asks for "this" is you.

What you call a design bug is in actual fact a design feature. One point that you are missing is that the meaning of a word spelled the same between dialects may be different. This is exactly one reason why this duplication is needed. The same is true for etymology; when a word is used for a first time in THAT orthography or dialect may be different as well, it may arrive from another dialect and not from another language..

Thanks, GerardM _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Heiko Evermann

8:51 p.m.

Hi Mark,

...

Ahh, but Gerard, the words have the same origins, even if you spell them differently.

So Sass will spell it "greutens", and AS will spell it "groytens", but they both have the same origin, and the same translations.

Thanks a lot for this remark.

Kind regards,

Heiko

Gerard Meijssen

3 Sep 3 Sep

9:49 a.m.

Heiko Evermann wrote:

...

Hi Mark,

...
Ahh, but Gerard, the words have the same origins, even if you spell them differently.

So Sass will spell it "greutens", and AS will spell it "groytens", but they both have the same origin, and the same translations.

Thanks a lot for this remark.

Kind regards,

Heiko

Hoi, When a word's etymology is considered, its origin is often mistaken by people because they look for the "ultimate" root for a word. They then decide that it is for instance derived from a Latin or Greek word and there is means so and so. When you look at etymology in this way, sure you are "correct". An alternative way of looking is from what language did it derive in this language, dialect; this etymology may be quite different. It may be from French in stead. Historically this connection is far more relevant. Linguistically I would say it is an enrichment if you are aware of this flow.

Now I know that you know. And I also know that this discussion has nothing to do with the subjectmatter discussed on the Wikitech-l. I will from now on only discuss technical matters that are genuinely technical on wikitech-l. Discussions on the use of Ultimate Wiktionary I will only answer to on the Wiktionary-l.

Thanks, Gerard

Nikola Smolenski

2 Sep 2 Sep

9:19 a.m.

On Thursday 01 September 2005 17:59, Gerard Meijssen wrote:

...

Heiko Evermann wrote:

...
...
The duplication of words that are spelled the same in different dialects or orthographies is inherent in the database design. This is essential if you want to have definitions and etymology in these dialects or orthographies. If you are willing to accept that definitions and etymology can be spelled in orthographies other than Sass there could be a solution but as the nds.wikipedia also has to standardise on Sass, I think this is a rather unlikely scenario.

Definition and etymology would be the same. Your approach would be a duplication of efforts. It would be sufficient to allow one entry to belong to several orthographis, as in 1:n instead of 1:1. So this is not inherent in the database design. It is the design bug that I complain about for some time. 1:n would allow us to enter the data the way we think appropriate. And it still leaves us the opportunity to add individual entries when other users really think that explanations must also be duplicated along the orthographies (which I really doubt). So they can, if they want to, but they are not forced.

I am interested in who your "we" is. I did recently discuss this design

Well, count me in, for the start.

...

What you call a design bug is in actual fact a design feature. One point that you are missing is that the meaning of a word spelled the same between dialects may be different. This is exactly one reason why this duplication is needed. The same is true for etymology; when a word

That is not reason for duplication at all. If a string of letters has different meanings in different dialects, each of these meanings should have a row in "word" table, with different wordID. Each of these wordIDs should be related to a SEPARATE spellingID, not the same one, even though it's still the same string of letters; for one, their languageIDs are different. Each of these spellingIDs, if valid, should of course be related to separate ValidSpellingID; and it should be possible that each of these ValidSpellingIDs is related to several spelling authorities. It is the norm, rather than exception, that different authorities recommend same spelling for same word.

Heiko Evermann

8:50 p.m.

Hi Gerard

...

What you call a design bug is in actual fact a design feature. One point that you are missing is that the meaning of a word spelled the same between dialects may be different. This is exactly one reason why this duplication is needed. The same is true for etymology; when a word is used for a first time in THAT orthography or dialect may be different as well, it may arrive from another dialect and not from another language..

No Gerard, this is only needed when there really *is* such a difference. You force us to duplicate our entries even where it is not needed at all.

Besides: to my understanding the question of American English vs. British English has not been answered. I do see the same problem there. It is the same problem in the word entries and in the definitions.

Heiko

Ray Saintonge

31 Aug 31 Aug

9:50 p.m.

Heiko Evermann wrote:

...

What are you planning to do with American English vs. British English?

You would have two entries:

title=colour lang=EN dialect=EN_US orthography=USA-official 2) title=color lang=EN dialect=EN_GB orthography=GB official

That is fine. But what about "bus"? would you have two entries?

title=bus lang=EN dialect=EN_US orthography=USA-official 2) title=bus lang=EN dialect=EN_GB orthography=GB official

That (to my understanding) would double the entries for English, wouldn't it? And the translation of de:Bus would list en_US: bus, en_GB:bus?

There is no such thing as "official" orthography in English. Also, further variations may often apply for other English speaking countries.

Tomer Chachamu

24 Aug 24 Aug

10:24 p.m.

On 24/08/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

As Ultimate Wiktionary is dependent on Wikidata, there is little option for us but to wait untill it is ready. It is really important that Wikidata is done well because it will not serve only Ultimate Wiktionary but also Ultimate Wiktionary.

That makes... 1 project? ;)

Mark Williamson

25 Aug 25 Aug

4:45 a.m.

Yes, I hadn't noticed the "it will not serve only Ultimate Wiktionary but also Ultimate Wiktionary"... I assume Gerard meant something serious by that, but whatever it was, I catch no ball (as they say in Singapore).

Mark

On 24/08/05, Tomer Chachamu the.r3m0t@gmail.com wrote:

...

On 24/08/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
As Ultimate Wiktionary is dependent on Wikidata, there is little option for us but to wait untill it is ready. It is really important that Wikidata is done well because it will not serve only Ultimate Wiktionary but also Ultimate Wiktionary.

That makes... 1 project? ;) _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Gerard Meijssen

7:23 a.m.

Mark Williamson wrote:

...

Yes, I hadn't noticed the "it will not serve only Ultimate Wiktionary but also Ultimate Wiktionary"... I assume Gerard meant something serious by that, but whatever it was, I catch no ball (as they say in Singapore).

Mark

Hoi, Wikidata is an engine that allows you to use relational data in a Wiki environment. This means that it can be used for other things than Ultimate Wiktionary. I have a dream that one day Wikispecies will have its data in a relational database and that it will take off in a way that it now fails to do.

The Ultimate Wiktionary IS one project but it does not only deliver Ultimate Wiktionary, it also brings you Wikidata :)

Thanks, GerardM

Christian Thiele

20 Aug 20 Aug

7:25 p.m.

Hello everyone,

I just wanted to add that Opera supports spell checking in input boxes by using GNU Aspell. You just have to install Aspell plus the languages you need (for Windows see http://aspell.net/win32/) and restart Opera.

But I have to add, that - in my opinion - aspell isn't a good spell checking engine and the dictionaries are very incomplete.

Sincerely Christian Thiele

Mark Williamson

10:31 p.m.

My only problem with Aspell is that for some reason, I can't get non-Roman scripts to work under Win32... Other than that, I think it's quite amazing.

Mark

On 20/08/05, Christian Thiele APPER@apper.de wrote:

...

Hello everyone,

I just wanted to add that Opera supports spell checking in input boxes by using GNU Aspell. You just have to install Aspell plus the languages you need (for Windows see http://aspell.net/win32/) and restart Opera.

But I have to add, that - in my opinion - aspell isn't a good spell checking engine and the dictionaries are very incomplete.

Sincerely Christian Thiele _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- SI HOC LEGERE SCIS NIMIVM ERVDITIONIS HABES QVANTVM MATERIAE MATERIETVR MARMOTA MONAX SI MARMOTA MONAX MATERIAM POSSIT MATERIARI ESTNE VOLVMEN IN TOGA AN SOLVM TIBI LIBET ME VIDERE

Pablo Saratxaga

21 Aug 21 Aug

1:27 a.m.

Kaixo!

On Sat, Aug 20, 2005 at 03:31:42PM -0700, Mark Williamson wrote:

...

My only problem with Aspell is that for some reason, I can't get non-Roman scripts to work under Win32... Other than that, I think it's quite amazing.

You need an aspell >= 0.60, and it seems nobody has compiled the new aspell for Windows yet... until 0.60 there was no utf-8 support; only one 8bit support;

Ki ça vos våye bén, Pablo Saratxaga

Nikola Smolenski

23 Aug 23 Aug

1:27 a.m.

On Saturday 20 August 2005 20:25, Christian Thiele wrote:

...

But I have to add, that - in my opinion - aspell isn't a good spell checking engine and the dictionaries are very incomplete.

What would you suggest to use instead?

6897

Age (days ago)

6912

Last active (days ago)

wikitech-l@lists.wikimedia.org

40 comments

15 participants

tags (0)

participants (15)

Christian Thiele
Delirium
Gerard Meijssen
Ginu Gmail
Heiko Evermann
Jeffrey McGee
Lars Aronsson
Mark Williamson
Niklas Laxström
Nikola Smolenski
Pablo Saratxaga
Ray Saintonge
Sabine Cretella
Tomer Chachamu
Walter Vermeir