Ultimate Wiktionary and design decisions

List overview All Threads
Download

newer

older

Re: [Wikitech-l] New feature...

New feature proposal: Verify

Gerard Meijssen

23 Jul 2005 23 Jul '05

5:47 p.m.

Hoi,

I had an interesting conversation with Brion. We do not agree on everything. One of the things we do not agree on are redirects.

In my opinion, Wiktionary should not have redirects. A word is either spelled correctly and it will have its lemma or it is not and there will not be a lemma with the incorrect spelling. In Brions opinion there are links to lemmas and as we need to ensure that these links remain ok, we need redirects to make this possible.

In a Wikipedia context I am 100% with Brion. In a Wiktionary context it is a different matter. As only correctly spelled words should be in a Wiktionary, errors should be deleted. Some of our Wiktionaries for historical reasons are capitalising their articles. In essence this means that from a spelling point of view the name of the lemmas are irrelevant. However, many people assume that the name of the article indicates that a word is spelled correctly. To remedy this, more and more wiktionaries are moving away from first character capitalisation and make it possible to have correctly spelled words as a lemma.

When a wiktionary has made this move away from first character capitalisation, the interwiki and interproject links within the Wikimedia projects need to be fixed. After this, the redirects can in my opinion be removed. I think this is appropriate because users expect that an application behaves in certain ways. When new content is added to a non-capitalised Wiktionary, the word foo will not have a redirect in Foo and consequently it behaves differently from the content predating the move to non-capitalisation. Also words like Kinder and kinder are not related at all. The redirect at Kinder will be replaced at some stage breaking the existing redirect and consequently not providing the continuance that Brion holds dear.

For the Ultimate Wiktionary I have documented some of the design criteria. It can be found here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage The Data design can be found here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design

One crucial decision is that only correct spelling is allowed. This means that all incorrect spelling will be amended or deleted. As Ultimate Wiktionary is a database, it does not cater for things like redirects. I urge you to have a look at both the design criteria and the design itself because this is the time when it is relatively easy to make changes. Once Erik starts coding the UW database, having finished Wikidata and the GEMET implementation, the moment has passed us by.

Thanks, GerardM

Show replies by date

Heiko Evermann

23 Jul 23 Jul

10:23 p.m.

Hi Gerard,

...

I had an interesting conversation with Brion. We do not agree on everything. One of the things we do not agree on are redirects.

In my opinion, Wiktionary should not have redirects. A word is either spelled correctly and it will have its lemma or it is not and there will not be a lemma with the incorrect spelling. In Brions opinion there are links to lemmas and as we need to ensure that these links remain ok, we need redirects to make this possible.

This might work for English, but not for all languages. I think that redirects should be possible. This would make sense for 1) verb conjugation. I would like to simply link most of the declination forms to the main verb entry, as long as no other language entry uses the same heading. Example: Low Saxon conjugation of "to be". "He was" => "he weer". I would simply redirect this to "wesen" (en: to be). I would not like to be forced to write a whole article stating that "he weer" is 3. person sing, simple present of "wesen". I would like to include the conjugation table only once. 2) dialect variations (of which we have a lot in Low Saxon): for "to be" we have "ween/wesen/sien" as regional dialects. I would redirect two of them to the "wesen". I would not like to be forced to write a whole article stating that "ween" is a variant of "wesen". 3) orthography variations. We also have several competing orthographies for Low Saxon. I would like to be able to just redirect to the main orthography that we use.

In the end it might be nice to have all such explanations, but nds.wiktionary.org still has to be written, and it would simplify this writing a lot.

...

One crucial decision is that only correct spelling is allowed.

This is a no-go for minor languages where "correct" spelling does not exist.

Kind regards,

Heiko Evermann

Nikola Smolenski

24 Jul 24 Jul

12:05 p.m.

On Saturday 23 July 2005 21:23, Heiko Evermann wrote:

...

Hi Gerard,

...
In my opinion, Wiktionary should not have redirects. A word is either spelled correctly and it will have its lemma or it is not and there will not be a lemma with the incorrect spelling. In Brions opinion there are links to lemmas and as we need to ensure that these links remain ok, we need redirects to make this possible.

This might work for English, but not for all languages. I think that redirects should be possible. This would make sense for

verb conjugation. I would like to simply link most of the declination

forms to the main verb entry, as long as no other language entry uses the same heading. Example: Low Saxon conjugation of "to be". "He was" => "he weer". I would simply redirect this to "wesen" (en: to be). I would not like to be forced to write a whole article stating that "he weer" is 3. person sing, simple present of "wesen". I would like to include the conjugation table only once. 2) dialect variations (of which we have a lot in Low Saxon): for "to be" we have "ween/wesen/sien" as regional dialects. I would redirect two of them to the "wesen". I would not like to be forced to write a whole article stating that "ween" is a variant of "wesen". 3) orthography variations. We also have several competing orthographies for Low Saxon. I would like to be able to just redirect to the main orthography that we use.

Even if redirects would be allowed, you should do any of this by filling in the database correctly. Each declination of a word would have its own entry. Dialect variations would all be present, and marked as belonging to a dialect or subdialect (the end of my discussion with Gerard is about how exactly to do this). There is a field "spellingauthority" where you could specify which of competing orthographies endorses particular spelling of a word.

Brion Vibber

23 Jul 23 Jul

10:49 p.m.

Gerard Meijssen wrote:

...

I had an interesting conversation with Brion. We do not agree on everything. One of the things we do not agree on are redirects.

In my opinion, Wiktionary should not have redirects. A word is either spelled correctly and it will have its lemma or it is not and there will not be a lemma with the incorrect spelling. In Brions opinion there are links to lemmas and as we need to ensure that these links remain ok, we need redirects to make this possible.

The gap between my thinking and Gerard's thinking appears to be that Gerard considers redirects to be canonincal content; thus having a "wrong spelling" in a URL somehow implies that this is a "correct" spelling, which is therefore wrong and should be removed.

To my mind however, redirects are not content. They are metacontent: compatibility tools used to provide corrections, by sending a visitor to the correct page if they're using an outdated link.

In the case of a "misspelling" which has been moved to the correct spelling, this means a reference to the wrong spelling is automatically corrected when a visitor returns to that page. Not only is the visitor now brought to the correct location of the entry, but it's prominently labeled with the correct spelling at the top.

They do not imply that a "wrong" spelling is correct, but rather do the exact opposite. Gerard's argument for deleting redirects thus, in my opinion, fails.

More generally, it's completely irresponsible for a web-based resource to rearrange content pages without providing a redirect from the old URL. This is a basic principle which applies just as much to Wiktionary as to Wikipedia, just as much to Hewlett Packard's driver web pages as to Slashdot postings, just as much to a database of autogenerated earthquake reports or a collection of press releases as to an online academic journal.

Wiktionary has no special dispensation: if we want people to believe this is a legitimate resource which can be used and referenced, maintaining URL compatibility is a non-negotioable requirement.

...

When a wiktionary has made this move away from first character capitalisation, the interwiki and interproject links within the Wikimedia projects need to be fixed. After this, the redirects can in my opinion be removed.

This would be harmful to the project, as it would still break every such link that may exist offsite: * personal bookmarks * links from other sites * URLs published as references in papers or books * etc

Some online links will eventually be noticed and corrected. Others never will be. Offline links (printed as references in papers, etc) will be broken forever unless someone notices and recreates the redirect. (Unless Gerard comes and deletes it again. ;)

For a primarily web-based resource, it would be the height of unprofessional behavior to render something like 99% of our pages inaccessible from the URLs they've been at for years.

Deleting redirects is bad for our users, it's bad for anyone who wants to rely on our resources as reference works, it's bad for anyone who wants to link to our site, it's just plain bad.

-- brion vibber (brion @ pobox.com)

Nikola Smolenski

24 Jul 24 Jul

12:21 p.m.

On Saturday 23 July 2005 21:49, Brion Vibber wrote:

...

Gerard Meijssen wrote:

...
I had an interesting conversation with Brion. We do not agree on everything. One of the things we do not agree on are redirects.

In my opinion, Wiktionary should not have redirects. A word is either spelled correctly and it will have its lemma or it is not and there will not be a lemma with the incorrect spelling. In Brions opinion there are links to lemmas and as we need to ensure that these links remain ok, we need redirects to make this possible.

The gap between my thinking and Gerard's thinking appears to be that Gerard considers redirects to be canonincal content; thus having a "wrong spelling" in a URL somehow implies that this is a "correct" spelling, which is therefore wrong and should be removed.

[...]

...

More generally, it's completely irresponsible for a web-based resource to rearrange content pages without providing a redirect from the old URL. This is a basic principle which applies just as much to Wiktionary as to Wikipedia, just as much to Hewlett Packard's driver web pages as to Slashdot postings, just as much to a database of autogenerated earthquake reports or a collection of press releases as to an online academic journal.

I, in fact, agree with both of you :) I agree that a web-based resource should have its URLs as permanent as possible, but I also think that in the UW it should be possible to specify in the database anything which was in traditional wiktionaries done with redirects.

A simple way to solve this could be to make redirects function differently than they usually do on MediaWiki: instead of silently displaying content of another page under the same URL, they could simply display "Contents of this page has moved to [[that page]]." Thatway, URLs would still exist and be useful, while there would be absolutely no danger of mistaking redirects for canonical content.

Brion Vibber

5:37 p.m.

Nikola Smolenski wrote:

...

A simple way to solve this could be to make redirects function differently than they usually do on MediaWiki: instead of silently displaying content of another page under the same URL, they could simply display "Contents of this page has moved to [[that page]]."

It's not "silent"; there's a big fat "redirected from XYZ" message.

...

Thatway, URLs would still exist and be useful, while there would be absolutely no danger of mistaking redirects for canonical content.

There is already no danger of mistaking redirects for canonical content.

-- brion vibber (brion @ pobox.com)

Gerard Meijssen

6:08 p.m.

Brion Vibber wrote:

...

Nikola Smolenski wrote:

...
A simple way to solve this could be to make redirects function differently than they usually do on MediaWiki: instead of silently displaying content of another page under the same URL, they could simply display "Contents of this page has moved to [[that page]]."

It's not "silent"; there's a big fat "redirected from XYZ" message.

...
Thatway, URLs would still exist and be useful, while there would be absolutely no danger of mistaking redirects for canonical content.

There is already no danger of mistaking redirects for canonical content.

-- brion vibber (brion @ pobox.com)

Hoi, In an e-mail by Heiko Evermann the practice of using redirects to indicate alternate spellings is explicitly raised as the way to go. I do think that that is an unfortunate choise as well as it does not indicate where the alternate spelling is comming from. It does however prove that the assertion that the URL is not giving a meaning as to the correctness of the spelling is manifestly wrong. Thanks, GerardM

Timwi

1:08 p.m.

Brion Vibber wrote:

...

The gap between my thinking and Gerard's thinking appears to be that Gerard considers redirects to be canonincal content; thus having a "wrong spelling" in a URL somehow implies that this is a "correct" spelling, which is therefore wrong and should be removed.

On the other hand, your thinking is based on the assumption that without redirects, the URLs that people currently link to will stop working. This in turn is assuming that the Ultimate Wiktionary will replace the existing Wiktionaries at their current URLs, which according to my understanding of the plan isn't going to happen. Gerard said that UW will at first exist alongisde the existing Wiktionaries, and he also said that if the community will be in favour of keeping the existing Wiktionaries in operation indefinitely, then so be it.

Therefore, Ultimate Wiktionary is not going to be at en.wiktionary.org. Once we decide to scrap the old Wiktionaries, we can therefore easily have a RewriteRule to forward from capitalised [langcode].wiktionary.org to ultimate.wiktionary.org or whatever it will be (I would prefer just wiktionary.org).

However, this is not saying that I disagree with you, Brion; in fact, I quite agree that redirects should continue to exist, though for other reasons than yours. I don't see any point in having separate entries for "goes" and "went", or for "color" and "colour". This would be a tremendous undertaking for languages that are heavily inflected (http://verbix.com/webverbix/cache/31.etmek.html), and its usefulness to the reader is highly doubtful. At the same time, however, there is a need for disambiguation pages: If I browse to a word which exists in two languages, and in both languages it is an inflected form of another word, I need to be asked which of those words I meant.

Timwi

Brion Vibber

5:38 p.m.

Timwi wrote:

...

Brion Vibber wrote:

...
The gap between my thinking and Gerard's thinking appears to be that Gerard considers redirects to be canonincal content; thus having a "wrong spelling" in a URL somehow implies that this is a "correct" spelling, which is therefore wrong and should be removed.

On the other hand, your thinking is based on the assumption that without redirects, the URLs that people currently link to will stop working. This in turn is assuming that the Ultimate Wiktionary will replace the existing Wiktionaries at their current URLs, which according to my understanding of the plan isn't going to happen.

Gerard's been talking about existing Wiktionaries and his practice of removing redirects from them.

UW doesn't even enter into this.

-- brion vibber (brion @ pobox.com)

Rob Lanphier

25 Jul 25 Jul

10:23 a.m.

New subject: Cool URIs don't change (Ultimate Wiktionary and design decisions)

On Sat, 2005-07-23 at 22:49 +0200, Brion Vibber wrote:

...

More generally, it's completely irresponsible for a web-based resource to rearrange content pages without providing a redirect from the old URL. This is a basic principle which applies just as much to Wiktionary as to Wikipedia, just as much to Hewlett Packard's driver web pages as to Slashdot postings, just as much to a database of autogenerated earthquake reports or a collection of press releases as to an online academic journal.

100% agreed on this point, and one that many people don't seem to understand. So it bears repeating:

...

More generally, it's completely irresponsible for a web-based resource to rearrange content pages [well, just re-read the paragraph above :) ]

Actually, it bears restating. Here's an old Tim Berners-Lee rant on the subject, called "Cool URIs Don't Change", that I often trot out whenever the topic rears its ugly head, which it does way too frequently: http://www.w3.org/Provider/Style/URI

In my mind, making an exception for Wiktionary projects would be an extremely disappointing reversal of very good policy maintained by the rest of the Wikimedia projects.

Rob

Gerard Meijssen

11:30 a.m.

New subject: Cool URIs don't change (Ultimate Wiktionary and design decisions)

Rob Lanphier wrote:

...

On Sat, 2005-07-23 at 22:49 +0200, Brion Vibber wrote:

...
More generally, it's completely irresponsible for a web-based resource to rearrange content pages without providing a redirect from the old URL. This is a basic principle which applies just as much to Wiktionary as to Wikipedia, just as much to Hewlett Packard's driver web pages as to Slashdot postings, just as much to a database of autogenerated earthquake reports or a collection of press releases as to an online academic journal.

100% agreed on this point, and one that many people don't seem to understand. So it bears repeating:

...
More generally, it's completely irresponsible for a web-based resource to rearrange content pages [well, just re-read the paragraph above :) ]

Actually, it bears restating. Here's an old Tim Berners-Lee rant on the subject, called "Cool URIs Don't Change", that I often trot out whenever the topic rears its ugly head, which it does way too frequently: http://www.w3.org/Provider/Style/URI

In my mind, making an exception for Wiktionary projects would be an extremely disappointing reversal of very good policy maintained by the rest of the Wikimedia projects.

Rob

Hoi, Well Tim Berners-Lee has one good line on the Wiktionary situation: "Do you really feel that the old URIs cannot be kept running? If so, you chose them very badly." I was not there at the time but what I understood from some that the Wiktionary people where told that they could not have uncapitalised articles. This demand for uncapitalised articles did not go away and over time the cost of this change improved. It is certainly true for the Dutch wiktionary that there was a long time between the first request and the moment it was granted. I do not know if there ever was a technical reason why we could not have it, the only argument that I remember was that we needed something to do with case insensitive search, something we are still waiting for. All this procastinantion meant that several thousand words had to be changed, by hand in the Dutch case with a script in the English case.

It is "nice" to have persistence in your links. It is cool if it is just a technical choise because then it is a no brainer. In the Wiktionary case people do use redirects for correctly spelled words and as they are not interested in creating articles for the inflections they use redirects or they use redirects for the wrong spellings or they use redirects for different orthografies. So yes, it is disappointing but it is not a reversal of policy but the correction of a bad situation.

Thanks, GerardM

Magnus Manske

4:26 p.m.

New subject: Cool URIs don't change (Ultimate Wiktionary and design decisions)

Frankly, I don't see what the "(Ultimate) Wiktionary redirect problem" is all about. You want to keep old links working, but not use #REDIRECTs for bad spelling?

Why not introduce #MISPEELING ? (or a variant ;-)

That would function like a redirect, with the exception that the destination page will, additonally, say on the very first line "You looked for the word or phrase WRONSPELLING. We believe you might mean GOODSPELLING (this page), of which WRONGSPELLING is a misspelled form."

Am I missing some intricate problem here?

Magnus

Gerard Meijssen

5:48 p.m.

New subject: Cool URIs don't change (Ultimate Wiktionary and design decisions)

Magnus Manske wrote:

...

Frankly, I don't see what the "(Ultimate) Wiktionary redirect problem" is all about. You want to keep old links working, but not use #REDIRECTs for bad spelling?

Why not introduce #MISPEELING ? (or a variant ;-)

That would function like a redirect, with the exception that the destination page will, additonally, say on the very first line "You looked for the word or phrase WRONSPELLING. We believe you might mean GOODSPELLING (this page), of which WRONGSPELLING is a misspelled form."

Am I missing some intricate problem here?

Magnus

Hoi, I am afraid you miss the point. At issue is the big muddle that are the current en.wiktionary redirects. Where you can find technical redirects and the redirects because of an implied meaning (linkt to correct spelling, link to a headword etc)

The Ultimate Wiktionary will not have redirects the reason being that it starts from nothing and all removals will be done as per the policies of the project. This is true for any project anyway. Articles are removed when this is decided upon.

Thanks, Gerard

Andrew Dunbar

10:02 a.m.

On 7/24/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi,

I had an interesting conversation with Brion. We do not agree on everything. One of the things we do not agree on are redirects.

In my opinion, Wiktionary should not have redirects. A word is either spelled correctly and it will have its lemma or it is not and there will not be a lemma with the incorrect spelling. In Brions opinion there are links to lemmas and as we need to ensure that these links remain ok, we need redirects to make this possible.

In a Wikipedia context I am 100% with Brion. In a Wiktionary context it is a different matter. As only correctly spelled words should be in a Wiktionary, errors should be deleted. Some of our Wiktionaries for historical reasons are capitalising their articles.

"Historical reasons" is surely not the only reason a Wiktionary uses first-character capitalisation, turning off first-character capitalisation is not the only way to achieve correctly spelled article titles, and having correctly spelled article titles has been denied as a reason for turning off first-character capitalisation by some.

Don't forget that capitalisation of the first letter is only one issue as regards correct orthography in article titles. Another thing to watch out for is the three variations for making English compounds: two words with a space, two words with hyphenation, and one compound word. For any term, any one, two or three of these variants may be considered correct.

Other test cases which have recently met strong resistence are Latin words correctly spelled in all capitals as was the only possible spelling while Latin was a living language, and using the correct non-ambiguous apostrophe character which has been widely available on home computers for 20 years.

Other languages also have optional or compatibility spellings: In French it is officially correct to indicate accents on capital letters but there is a de facto rule to leave them out. German specifically allows ä, ö, and ü to be spelled as ae, oe, and ue. In Switzerland there is no letter ß, the correct spelling to be ss instead. This is also spefically considered correct in the other German-speaking countries. Latin and Old English often have macrons to show long vowels and rarely have breves to show short vowels.

Old English and Middle English also had various fashions but no official spelling, with various exotic letters being used at different times and under various circumstances, resulting in varied spellings of many words. For example, ð and þ were mostly interchangeable.

Ancient Greek and Modern Greek have different accent marks which look quite similar but have different names and different places in Unicode. But the Modern Greek accents are still much more commen in Ancient Greek on the Internet.

Hebrew geresh is often represented by ASCII apostrophe, Hebrew gershayim is often represented by ASCII double quote, Hebrew maqaf is often represented by ASCII hyphen, Hawaiian okina is often represented by ASCII apostrophe, Turkish long vowels (actually more complicated than this) can be indicated by use of the circumflex accent according to the offical orthographical rules, Russian (and some other Cyrllic script languages) can optionally indicate where the stress is and in some contexts it is the norm. With Hebrew and most languages in Arabic script, all short vowels are optional as are a number of other "letters" such as dagesh, shada, sukun, and a host of more exotic ones.

Hebrew also has accents which only occur in religious works plus there are plene and defective spellings and both have vowels etc as optional extras on top.

In some Polynesian languages, it is macrons and glottal stops are optional, in others they are compulsory.

Chinese, Japanese, and Korean have written variants of many characters which have the same meaning and sound with all being correct. They also have variants which exist only due to computer encodings and quirks in how various fonts were designed.

For some languages different optional features of orthography can interact to from many combinations and permutations all of which are correct spellings.

There are surly quite a few more examples I haven't even become aware of yet.

...

In essence this means that from a spelling point of view the name of the lemmas are irrelevant. However, many people assume that the name of the article indicates that a word is spelled correctly. To remedy this, more and more wiktionaries are moving away from first character capitalisation and make it possible to have correctly spelled words as a lemma.

Or they are moving due to rhetoric like this email rather than for any good reason. Remember that in print dictionaries the norm is to include different meanings and parts of speech, and even derivatives - regardless of capitalisation - into one article or at least on the one page.

English Wiktionary still considers only first letter capitalisation and ASCII apostrophes and Russian without stress marks to be correct enough to be titles, in the last case even as redirects! (if this is the meaning of "lemma" you mean). What do other Wiktionaries do?

...

When a wiktionary has made this move away from first character capitalisation, the interwiki and interproject links within the Wikimedia projects need to be fixed. After this, the redirects can in my opinion be removed. I think this is appropriate because users expect that an application behaves in certain ways. When new content is added to a non-capitalised Wiktionary, the word foo will not have a redirect in Foo and consequently it behaves differently from the content predating the move to non-capitalisation. Also words like Kinder and kinder are not related at all.

Don't you mean that "not all words like Kinder and kinder are related"? This is almost the opposite meaning. Also many words are related. Even in German it is common for a noun and another part of speech to be intimately related and share an identical spelling apart from capitalisation.

...

The redirect at Kinder will be replaced at some stage breaking the existing redirect and consequently not providing the continuance that Brion holds dear.

For the Ultimate Wiktionary I have documented some of the design criteria. It can be found here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage The Data design can be found here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design

One crucial decision is that only correct spelling is allowed. This means that all incorrect spelling will be amended or deleted. As Ultimate Wiktionary is a database, it does not cater for things like redirects. I urge you to have a look at both the design criteria and the design itself because this is the time when it is relatively easy to make changes. Once Erik starts coding the UW database, having finished Wikidata and the GEMET implementation, the moment has passed us by.

Please list out of the above points what is and what is not considered a correct spelling as Ultimate Wiktionary is concerned. Please then indicate whether every correct spelling is also suitable as a headword/ article title/lemma or whatever you wish to call it.

Hippietrail.

...

Thanks, GerardM

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- http://linguaphile.sf.net

Nikola Smolenski

11:42 a.m.

On Monday 25 July 2005 09:02, Andrew Dunbar wrote:

...

On 7/24/05, Gerard Meijssen gerard.meijssen@gmail.com wrote: accent according to the offical orthographical rules, Russian (and some other Cyrllic script languages) can optionally indicate where the stress is and in some contexts it is the norm. With Hebrew and most

Yes, and one of the places where it is the norm is in a dictionary ;)

Regardless of how is this resolved in the end, it would make sense to built in at least some ability of determining such things automatically. You don't want to duplicate entire Russian corpus (with inflections, it could easily rise to ten million words), so that you could have each one of them with and without diacritics. It makes sense to have only canonical spellings in the dictionary, and a bit of code to offer nearest match when someone tries to retrieve a word spelled in a different way.

...

...
One crucial decision is that only correct spelling is allowed. This means that all incorrect spelling will be amended or deleted. As Ultimate Wiktionary is a database, it does not cater for things like redirects. I urge you to have a look at both the design criteria and the design itself because this is the time when it is relatively easy to make changes. Once Erik starts coding the UW database, having finished Wikidata and the GEMET implementation, the moment has passed us by.

Please list out of the above points what is and what is not considered a correct spelling as Ultimate Wiktionary is concerned. Please then indicate whether every correct spelling is also suitable as a headword/ article title/lemma or whatever you wish to call it.

The way I see it, this decision is a political and not a technical one. Each word could have several spellings, each of which is related to a spelling authority. If you want common misspellings in the dictionary, simply have "Common misspelling" as a spelling authority. Similarly, nothing prevents you from having several different spellings of a same word attributed to a single spelling authority, which solves all the problems you mentioned above.

Tomer Chachamu

12:24 p.m.

On 25/07/05, Nikola Smolenski smolensk@eunet.yu wrote:

...

The way I see it, this decision is a political and not a technical one. Each word could have several spellings, each of which is related to a spelling authority. If you want common misspellings in the dictionary, simply have "Common misspelling" as a spelling authority. Similarly, nothing prevents you from having several different spellings of a same word attributed to a single spelling authority, which solves all the problems you mentioned above.

Surely it would be better for the search function to try some other possibilities, such as removing glottal stops from the input and searching again.

Besides, didn't we forget the Unicode possibility of the same character entered in two ways? http://en.wikipedia.org/wiki/Canonical_equivalence

Actually, ideally every word would have a "search name" which is worked out based on the language. It would only contain the compulsory characters, with a certain decision on e.g. German ä ö ü (to transform them in one direction or the other). It could also decompose all precomposed characters into sets of characters. This is a little bit like how some databases store the soundex index.

I think this would handle most cases. I certainly agree that redirects are a necessary technical feature for the rarer cases.

Gerard Meijssen

12:40 p.m.

Tomer Chachamu wrote:

...

On 25/07/05, Nikola Smolenski smolensk@eunet.yu wrote:

...
The way I see it, this decision is a political and not a technical one. Each word could have several spellings, each of which is related to a spelling authority. If you want common misspellings in the dictionary, simply have "Common misspelling" as a spelling authority. Similarly, nothing prevents you from having several different spellings of a same word attributed to a single spelling authority, which solves all the problems you mentioned above.

Surely it would be better for the search function to try some other possibilities, such as removing glottal stops from the input and searching again.

Besides, didn't we forget the Unicode possibility of the same character entered in two ways? http://en.wikipedia.org/wiki/Canonical_equivalence

Actually, ideally every word would have a "search name" which is worked out based on the language. It would only contain the compulsory characters, with a certain decision on e.g. German ä ö ü (to transform them in one direction or the other). It could also decompose all precomposed characters into sets of characters. This is a little bit like how some databases store the soundex index.

I think this would handle most cases. I certainly agree that redirects are a necessary technical feature for the rarer cases.

Hoi, Redirects will not exist in the UW. There is no need for them. Either they are correct spelling or they are not. If we MUST have a situation where links into UW need to be maintained, the words will go into Misspelling. Consequently the information will be utterly different from what it had before and consequently it cannot be maintained that the exterior party will have the URL point to the same information. Because in stead of a word with a meanin and all the other stuff it will now give a "You are wrong". This will be handled either at the Spelling level or at the Misspelling level.

Then again, a user may want to see only a limited number of languages, and consequently what is shown will also not be the same for everyone. The word "dei" and "die" or both common typos, they are but for different languages. So what are we going to show when someone asks for dei of die ??

Thanks, GerardM

Tomer Chachamu

6:32 p.m.

On 25/07/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Tomer Chachamu wrote:

...
Actually, ideally every word would have a "search name" which is worked out based on the language. It would only contain the compulsory characters, with a certain decision on e.g. German ä ö ü (to transform them in one direction or the other). It could also decompose all precomposed characters into sets of characters. This is a little bit like how some databases store the soundex index.

I think this would handle most cases. I certainly agree that redirects are a necessary technical feature for the rarer cases.

Hoi, Redirects will not exist in the UW. There is no need for them. Either they are correct spelling or they are not. If we MUST have a situation where links into UW need to be maintained, the words will go into Misspelling. Consequently the information will be utterly different from what it had before and consequently it cannot be maintained that the exterior party will have the URL point to the same information. Because in stead of a word with a meanin and all the other stuff it will now give a "You are wrong". This will be handled either at the Spelling level or at the Misspelling level.

There can be more than one correct spelling, as Andrew Dunbar detailed. Where exactly is the database design?

...

Then again, a user may want to see only a limited number of languages, and consequently what is shown will also not be the same for everyone. The word "dei" and "die" or both common typos, they are but for different languages. So what are we going to show when someone asks for dei of [or] die ??

ult.wiktionary.org/wikt/dei: like current wiktionary i.e. list of all meanings in all languages. ult.wiktionary.org/wikt/dei/de: german meaning only

Gerard Meijssen

6:56 p.m.

Tomer Chachamu wrote:

...

On 25/07/05, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
Tomer Chachamu wrote:

...
Actually, ideally every word would have a "search name" which is worked out based on the language. It would only contain the compulsory characters, with a certain decision on e.g. German ä ö ü (to transform them in one direction or the other). It could also decompose all precomposed characters into sets of characters. This is a little bit like how some databases store the soundex index.

I think this would handle most cases. I certainly agree that redirects are a necessary technical feature for the rarer cases.

Hoi, Redirects will not exist in the UW. There is no need for them. Either they are correct spelling or they are not. If we MUST have a situation where links into UW need to be maintained, the words will go into Misspelling. Consequently the information will be utterly different from what it had before and consequently it cannot be maintained that the exterior party will have the URL point to the same information. Because in stead of a word with a meanin and all the other stuff it will now give a "You are wrong". This will be handled either at the Spelling level or at the Misspelling level.

There can be more than one correct spelling, as Andrew Dunbar detailed. Where exactly is the database design?

Obviously there can be more than one correct spelling for the same language. Different orthografies need to be identified as such ..

The design can be found here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_data_design including an ERD and some notes on its usage can be found here: http://meta.wikimedia.org/wiki/Ultimate_Wiktionary_decisions_on_its_usage

...

...
Then again, a user may want to see only a limited number of languages, and consequently what is shown will also not be the same for everyone. The word "dei" and "die" or both common typos, they are but for different languages. So what are we going to show when someone asks for dei of [or] die ??

ult.wiktionary.org/wikt/dei: like current wiktionary i.e. list of all meanings in all languages. ult.wiktionary.org/wikt/dei/de: german meaning only

The URL as you state do not take into account the effect that filtering will have. Filtering will be specified on a user level. You may filter to only include zu de fr and xh or nl it nds fy. It just depends on user settings in the userinterface. So your suggested URL's are a bit off with what is proposed.

Thanks, GerardM

Andrew Dunbar

3:35 p.m.

On 7/25/05, Tomer Chachamu the.r3m0t@gmail.com wrote:

...

On 25/07/05, Nikola Smolenski smolensk@eunet.yu wrote:

...
The way I see it, this decision is a political and not a technical one. Each word could have several spellings, each of which is related to a spelling authority. If you want common misspellings in the dictionary, simply have "Common misspelling" as a spelling authority. Similarly, nothing prevents you from having several different spellings of a same word attributed to a single spelling authority, which solves all the problems you mentioned above.

Surely it would be better for the search function to try some other possibilities, such as removing glottal stops from the input and searching again.

Besides, didn't we forget the Unicode possibility of the same character entered in two ways? http://en.wikipedia.org/wiki/Canonical_equivalence

The current version of MediaWiki does Unicode canonical normalisation. I would hope that UW would too. Compatiblity normalisation is a different kettle of fish though.

...

Actually, ideally every word would have a "search name" which is worked out based on the language. It would only contain the compulsory characters, with a certain decision on e.g. German ä ö ü (to transform them in one direction or the other).

Well since there are German words which can be spelled with ae, oe, ue; but not with ä, ö, ü, the diacritic versions have to be the canonical ones. The same goes for French œ which can always be spelled as oe but not every oe can be spelled as œ. However, for ß, it always the wrong spelling in Switzerland but not every ss can be spelled ß either in German and Austria.

...

It could also decompose all precomposed characters into sets of characters. This is a little bit like how some databases store the soundex index.

Unicode canonical normalisation does this.

...

I think this would handle most cases. I certainly agree that redirects are a necessary technical feature for the rarer cases.

Or maybe thinking outside the MediWiki box altogether if US is to be designed from the ground up.

...

Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l

-- http://linguaphile.sf.net

Gerard Meijssen

12:28 p.m.

Nikola Smolenski wrote:

...

On Monday 25 July 2005 09:02, Andrew Dunbar wrote:

...
On 7/24/05, Gerard Meijssen gerard.meijssen@gmail.com wrote: accent according to the offical orthographical rules, Russian (and some other Cyrllic script languages) can optionally indicate where the stress is and in some contexts it is the norm. With Hebrew and most

Yes, and one of the places where it is the norm is in a dictionary ;)

Regardless of how is this resolved in the end, it would make sense to built in at least some ability of determining such things automatically. You don't want to duplicate entire Russian corpus (with inflections, it could easily rise to ten million words), so that you could have each one of them with and without diacritics. It makes sense to have only canonical spellings in the dictionary, and a bit of code to offer nearest match when someone tries to retrieve a word spelled in a different way.

As a matter a fact I do want all inflections even if they are ten million words. Now I do not expect to have all these inflections to start off with but as far as I am concerned I want them all. I already have 222.930 Dutch words and they do include many inflections. I asked Brion about this at one stage and he saw no problem with a big database. from a discspace pov it does indeed not amount to that much .. :)

As we explicitly want to use the UW as a repository of correct spellings, a repository that can be used by Open and Free software projects, we explicitly want the physical records. When we have technology to generate inflection well and good but we will still want all the correctly spelled words.

...

...
...
One crucial decision is that only correct spelling is allowed. This means that all incorrect spelling will be amended or deleted. As Ultimate Wiktionary is a database, it does not cater for things like redirects. I urge you to have a look at both the design criteria and the design itself because this is the time when it is relatively easy to make changes. Once Erik starts coding the UW database, having finished Wikidata and the GEMET implementation, the moment has passed us by.

Please list out of the above points what is and what is not considered a correct spelling as Ultimate Wiktionary is concerned. Please then indicate whether every correct spelling is also suitable as a headword/ article title/lemma or whatever you wish to call it.

The way I see it, this decision is a political and not a technical one. Each word could have several spellings, each of which is related to a spelling authority. If you want common misspellings in the dictionary, simply have "Common misspelling" as a spelling authority. Similarly, nothing prevents you from having several different spellings of a same word attributed to a single spelling authority, which solves all the problems you mentioned above.

I have for one compelling reason added a table Misspelling. This is where the absolutely wrong spellings may go. Its function? to prevent people to add wrong spellings time and again. So this table is to grow organically. Now there is this massive file on en.wikipedia and en.wiktionary, this file contains typos. This table is not really there for the typos but for the words that are spelled wrong for the "right" reason. Meaning can relate to several almost identical Words (and by implication Spelling). This may mean that several orthographies are implied. These orthographies are to be named and indentified. Common misspelling is not an orthography is anything it is the antithesis of orthography.

From a database point of view a Word has one Spelling. That is given the ERD very much technical and non negotialbe. It is the Spelling that is validated by a Spelling Authority.

Thanks, GerardM

Lars Aronsson

26 Jul 26 Jul

2:28 p.m.

Gerard Meijssen wrote:

...

Nikola Smolenski wrote:

...
You don't want to duplicate entire Russian corpus (with inflections, it could easily rise to ten million words), so that you could have each one of them with and without diacritics. It makes sense to have only canonical spellings in the dictionary, and a bit of code to offer nearest match when someone tries to retrieve a word spelled in a different way.

As a matter a fact I do want all inflections even if they are ten million words. Now I do not expect to have all these inflections to start off with but as far as I am concerned I want them all. I already have 222.930 Dutch words and they do include many inflections.

There are two approaches to dictionaries: (1) The encyclopedic approach, trying to find (define, spellcheck, explain, ...) "all" words (and their deflections), or (2) the statistics based approach, trying to find the most commonly used words. I think the OED is of the first kind, while many dictionaries in recent decades (built with the help of computers, extracting word frequency statistics from large text corpora) have been of the latter kind. Some would call (1) a 19th century approach.

The real difference is their handling of the least common words. The encyclopedic approach sees every missing word as a failure, while the statistics based approach recognizes that there is an infinite number of words anyway (new ones are created every day) and some might be too uncommon to deserve a mention.

As a consequence, spellchecking in the statistics based approach can never say that a spelling is "wrong" when it is missing from the dictionary, only that it probably is "uncommon" and thus suspect. The remedy for this is a statics based dictionary of common misspellings. Wikipedia article history can be used as a source for this. Just find all edits that changed one word, e.g. speling -> spelling, and you will have a fine dictionary of common spelling mistakes.

...

From a database point of view a Word has one Spelling.

This would be an example of the encyclopedic approach.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Tomer Chachamu

27 Jul 27 Jul

12:59 a.m.

On 26/07/05, Lars Aronsson lars@aronsson.se wrote:

...

There are two approaches to dictionaries: (1) The encyclopedic approach, trying to find (define, spellcheck, explain, ...) "all" words (and their deflections), or (2) the statistics based approach, trying to find the most commonly used words. I think the OED is of the first kind, while many dictionaries in recent decades (built with the help of computers, extracting word frequency statistics from large text corpora) have been of the latter kind. Some would call (1) a 19th century approach.

The real difference is their handling of the least common words. The encyclopedic approach sees every missing word as a failure, while the statistics based approach recognizes that there is an infinite number of words anyway (new ones are created every day) and some might be too uncommon to deserve a mention.

As a consequence, spellchecking in the statistics based approach can never say that a spelling is "wrong" when it is missing from the dictionary, only that it probably is "uncommon" and thus suspect. The remedy for this is a statics based dictionary of common misspellings. Wikipedia article history can be used as a source for this. Just find all edits that changed one word, e.g. speling -> spelling, and you will have a fine dictionary of common spelling mistakes.

...
From a database point of view a Word has one Spelling.

This would be an example of the encyclopedic approach.

It is clear to me that the approach we want to take is the "encyclopedic" one, simply because we can handle it. The Oxford dictionary in paper cannot handle it "elegantly" as it becomes unwieldy, spans a whole shelf. A good database can.

It is unacceptable for a Word to have one Spelling for reasons described previously (German a-with-umlaut, Hebrew niqqud and optional vowels, etc), but I am unable to find out who originally wrote that.

Gerard Meijssen

9:28 a.m.

Tomer Chachamu wrote:

...

On 26/07/05, Lars Aronsson lars@aronsson.se wrote:

...
There are two approaches to dictionaries: (1) The encyclopedic approach, trying to find (define, spellcheck, explain, ...) "all" words (and their deflections), or (2) the statistics based approach, trying to find the most commonly used words. I think the OED is of the first kind, while many dictionaries in recent decades (built with the help of computers, extracting word frequency statistics from large text corpora) have been of the latter kind. Some would call (1) a 19th century approach.

The real difference is their handling of the least common words. The encyclopedic approach sees every missing word as a failure, while the statistics based approach recognizes that there is an infinite number of words anyway (new ones are created every day) and some might be too uncommon to deserve a mention.

As a consequence, spellchecking in the statistics based approach can never say that a spelling is "wrong" when it is missing from the dictionary, only that it probably is "uncommon" and thus suspect. The remedy for this is a statics based dictionary of common misspellings. Wikipedia article history can be used as a source for this. Just find all edits that changed one word, e.g. speling -> spelling, and you will have a fine dictionary of common spelling mistakes.

...
From a database point of view a Word has one Spelling.

This would be an example of the encyclopedic approach.

It is clear to me that the approach we want to take is the "encyclopedic" one, simply because we can handle it. The Oxford dictionary in paper cannot handle it "elegantly" as it becomes unwieldy, spans a whole shelf. A good database can.

It is unacceptable for a Word to have one Spelling for reasons described previously (German a-with-umlaut, Hebrew niqqud and optional vowels, etc), but I am unable to find out who originally wrote that.

Hoi, You have to realise that the name convention used is that Word and Spelling refer to records in the database. Consequently it is unacceptable to have but one Spelling to one Word. You can have all the spellings that are correct and associate them through Relation and SynTrans. Now the different Words have to be marked for what they are. The niqquds are according to some system and as such they have to be identified; this leads to collections of the same type of niqquds that can be identified.

Thanks, GerardM

Nikola Smolenski

9:18 a.m.

On Monday 25 July 2005 11:28, Gerard Meijssen wrote:

...

Nikola Smolenski wrote:

...
On Monday 25 July 2005 09:02, Andrew Dunbar wrote:

...
On 7/24/05, Gerard Meijssen gerard.meijssen@gmail.com wrote: accent according to the offical orthographical rules, Russian (and some other Cyrllic script languages) can optionally indicate where the stress is and in some contexts it is the norm. With Hebrew and most

Yes, and one of the places where it is the norm is in a dictionary ;)

Regardless of how is this resolved in the end, it would make sense to built in at least some ability of determining such things automatically. You don't want to duplicate entire Russian corpus (with inflections, it could easily rise to ten million words), so that you could have each one of them with and without diacritics. It makes sense to have only canonical spellings in the dictionary, and a bit of code to offer nearest match when someone tries to retrieve a word spelled in a different way.

As a matter a fact I do want all inflections even if they are ten million words. Now I do not expect to have all these inflections to

Of course, but do you want ten million words with diacritics, and then ten million of same words without diacritics?

...

From a database point of view a Word has one Spelling. That is given the ERD very much technical and non negotialbe. It is the Spelling that is validated by a Spelling Authority.

Simply change relation spelling.spellingID-word.spellingID from 1-inf to inf-inf. That way each word could have multiple spellings, each validated by a spelling authority.

By the way, what is table "table" (related to "pronunciation") for?

7094

Age (days ago)

7098

Last active (days ago)

wikitech-l@lists.wikimedia.org

24 comments

10 participants

tags (0)

participants (10)

Andrew Dunbar
Brion Vibber
Gerard Meijssen
Heiko Evermann
Lars Aronsson
Magnus Manske
Nikola Smolenski
Rob Lanphier
Timwi
Tomer Chachamu