this is a message based on a conversation about revamping the wiktionary service so that it contained more structured data. In it's current form it is a Wiki about words. This does not allow for easy connections to be made between Thesarus words, translations, etc.
Hopefully, with some help, brainstorming, and experience we can hammer out a better system to store, look-up, and access dictionary words.
I've got loads of ideas about how to implement some of this, but others know more about how to include rollbacks, versioning, and deletion, of material.
If people are interested in can post some more indepth ideas about what (i think) would need to be built and get some feedback about planning for other languages, special cases, migration, and future additions.
-brian
At 12:14 11/02/2004 +0000, you wrote:
this is a message based on a conversation about revamping the wiktionary service so that it contained more structured data. In it's current form it is a Wiki about words. This does not allow for easy connections to be made between Thesarus words, translations, etc.
I am interested as I am writing a spellchecker and the data should come from wiktionary
Hopefully, with some help, brainstorming, and experience we can hammer out a better system to store, look-up, and access dictionary words.
I've got loads of ideas about how to implement some of this, but others know more about how to include rollbacks, versioning, and deletion, of material.
If people are interested in can post some more indepth ideas about what (i think) would need to be built and get some feedback about planning for other languages, special cases, migration, and future additions.
I currently have 266K en words and am proposing that the spellchecker is part php module to increase the speed.
Along with this the parser for wiki markup could also be in the module
Dave Caroline aka archivist usually hanging about on irc
I am interested as I am writing a spellchecker and the data should come from wiktionary
--- the only problem with pulling data from the wiktionary is how trustworth is it? i spell stuff wrong all the time, and if i put an incorrect spelling of 'sincerly' into the wiktionary the spellchecker would consider it to be correct? I do like the idea thought, i know loads about web services and eventually would like to write some plug-ins for other applications that can search the wiktionary for thesarus words, definitions, translations, etc.
the other question about the spellchecker would be, should you be able to download 'spellcheck' files so it can be run locally? this might be an option so that smaller devices like PDA's can leverage the wiktionary and run local look-ups?
The spellchecker could also be used in reverse, so when people mis-type words they are searching for, the spellchecker could give you something like ' NO RESULTS FOUND, DID YOU REALLY MEAN: ...'
-brian
On Wednesday 11 February 2004 13:14, brian suda wrote:
this is a message based on a conversation about revamping the wiktionary service so that it contained more structured data. In it's current form it is a Wiki about words. This does not allow for easy connections to be made between Thesarus words, translations, etc.
Hopefully, with some help, brainstorming, and experience we can hammer out a better system to store, look-up, and access dictionary words.
I've got loads of ideas about how to implement some of this, but others know more about how to include rollbacks, versioning, and deletion, of material.
If people are interested in can post some more indepth ideas about what (i think) would need to be built and get some feedback about planning for other languages, special cases, migration, and future additions.
Though I am not active in Wiktionary, I am thinking about this for a while. I think that MediaWiki software, great as it is, is not adequate enough for creating a dictionary and that new software has to be made from scratch. I call this kind of software WikiBase - a database in which (more or less) each field acts like a wiki page - it could be changed anytime, has an edit history, could be discussed etc...
In its simplest form, the database of a dictionary might have four tables:
words: ID|word concepts: ID|concept languages: ID|language meanings: wordID|languageID|conceptID
They could be connected like this:
words: 1|egg 2|jaje
concepts: 1|Something laid by a hen
languages: 1|English 2|Serbian
meanings: 1|1|1 2|2|1
That is, the word "egg" in language called "English" has the meaning of "Something laid by a hen", and the word "jaje" in language called "Serbian" has the same meaning. Now, this has obvious flaws, but I have not envisioned the database to be so simple. To cut to the point, I think that following structure would be enough for satisfying all needs of a dictionary (WARNING: long and sometimes confusing text ahead):
writings: ID|spelling readings: ID|reading languages: ID|language|dialect|place|time|group basics: ID|basic words: ID|languageID|writingID|readingID grammar: wordID|relation|wordID concepts: basicID|relation|basicID meanings: wordID|basicID
Now, how would all this work. I will use as an example english word "hair" for which three Serbian words exist: "kosa" (hair on one's head), "dlaka" (a hair) or "malja" (a hair on body).
Table "writlings" contain exact words as written on paper:
writings: 1|hair 2|kosa 3|dlaka 4|malja 5|hairs
Table "readings" contains readings of the words. I guess that it might be the easiest to use an internal format for this, which could be externally represented as IPA or SAMPA. Of course, for some languages, the readings could be autogenerated.
readings: 1|hejr 2|kosa 3|dlaka 4|mal<sup>j</sup>a 5|mal<sup>j</sup>e
(Note that here some IDs are the same; this of course need not be the case.)
Table "languages" contains data about languages. I was thinking big and allowed for ability of defining various dialects, regions, exact (or not exact) time at which a word was in use, and slang (of a certain social group). Perhaps this table needs a bit more work, but the basic idea is there. In this example, I'll use only the language name and forget about the rest:
languages: 1|English 2|Serbian
The last of basic tables, "basics", describes basic concepts.
basics: 1|A bunch of hairs on someone's head 2|A ceratinous outgrowth that covers human body 3|A single hair on someone's head 4|A single hair that is not on someone's head
Rest of the tables shows relations among IDs of these tables. Table "words" shows which writing has which spelling in which language.
words: 1|1|1|1 2|2|2|2 3|2|3|3 4|2|4|4 5|1|5|NULL 6|2|NULL|5
I'll expand the table:
1|English|hair|hejr 2|Serbian|kosa|kosa 3|Serbian|dlaka|dlaka 4|Serbian|malja|mal<sup>j</sup>a 5|English|hairs|NULL 6|Serbian|NULL|mal<sup>j</sup>e
Note that how English word "hairs" is read is currently not known and how a certain Serbian word is actually written is also currently not known. It doesn't matter.
Now, table "grammar" explains grammatical relations between the words:
1|root|1 2|root|2 3|root|3 4|root|4 5|plural|1 6|plural|4
Expanded:
hair|root|hair kosa|root|kosa dlaka|root|dlaka malja|root|malja hairs|plural|hair malje|plural|malja
I will explain what this "root" property means later when I explain how to actually query the database.
Table "concepts" is similar, except that it explains relations of the basic concepts:
1|mass/root|1 2|root|2 3|root|3 4|root|4 2|includes|3 2|includes|4
I will not expand the table but rather show the table "basics" again.
basics: 1|A bunch of hairs on someone's head 2|A ceratinous outgrowth that covers human body 3|A single hair on someone's head 4|A single hair that is not on someone's head
FINALLY, table "meanings" attaches words to concepts.
1|1 2|1 2|2 3|3 4|4 5|5
Now, how to read the dictionary. Suppose that you want to know what the word "hairs" means in English language. You go to the table "writings" and find "hairs" which has an ID of 5. Then you go to the table "languages" and find "Englihs" with an ID of 1. You go to the table "words" and see that ID for this word is 5 (along the way you might pick up reading of a word). Now you go to the table "grammar" and see that the word 5 is actually plural form of the word 1. In the same table you examine the word 1 and see that it is a root word; that is, one attached to a concept. Now, when you have found out that, search in "meanings" what concepts are attached to the root word 1 and you will see that there are two: concept 1 and 2. In the table "concepts" you search for them and find out that concept 2 is a root concept and concept 1 is also a root concept, but one of a mass concept; that is, of a thing that comes in an undistinguished mass. Finally you have:
'''hairs''': 1. ((rarely) plural denoting different kinds of) A bunch of hairs on someone's head 2. (plural) A ceratinous outgrowth that covers human body
Want to get Serbian translation? You go back through the tables, starting from basic concepts 1 and 2. You see in table "concepts" that concept 2 includes in itself concepts 3 and 4. You to the table "meanings" and see that concepts 1, 2, 3 and 4 correspond to words meanings 1, 2, 3, and 4; grammar is now not important, and in tabke "words" you see that only words 2, 3 and 4 are of Serbian language, and in "writings" that they are "kosa", "dlaka" and "malja". You now may go back and get their exact meanings and find exact translation that you need - more then usual dictionary has to offer.
This database system allows for much more then current free-form Wiktionary or the usual dictionaries. It would be easy to create a aoftware suited to specific needs that would browse the database on or off line. All gramaticall forms of a word are noted and it is easy to make basic machine translation from one language to another. It is easy to look up a word in unknown language when you don't know its root (this is often a problem, especially with electronic dictionaries).I have not shown in this example, but table "concepts" would include .. which would enable searching for them and creation of basic thesaura for all languages. It would also be possible to extract separate professional subdictionaries etc. etc.
Now, if you have bothered to read all this, you might as well spend just a bit more time to tell me what do you mean about it. I would be especially grateful if someone could find something that cannot fit into this kind of a dictionary. I already see a possible flaw; that is, that concept 1 in some cultures is not a mass concept; but I am certain that this could be overcome.
Final note: I think that the dictionary should not be under GFDL; rather, under a similar licence which would allow full copyright of a work derived from the subset of a database, but not one derived from it superset; in other words, it would be possible for someone to take this dictionary, lays the word on the paper, print it, and sell it, and noone would have the right to photocopy or reprint it; but if that someone wants to add some words to the dictionary, he must add them to the database first, which would then enable anyone else to add them to their dictionaries.
Nikola Smolenski wrote:
Though I am not active in Wiktionary, I am thinking about this for a while. I think that MediaWiki software, great as it is, is not adequate enough for creating a dictionary and that new software has to be made from scratch. I call this kind of software WikiBase - a database in which (more or less) each field acts like a wiki page - it could be changed anytime, has an edit history, could be discussed etc...
In its simplest form, the database of a dictionary might have four tables:
words: ID|word concepts: ID|concept languages: ID|language meanings: wordID|languageID|conceptID
They could be connected like this:
words: 1|egg 2|jaje
concepts: 1|Something laid by a hen
languages: 1|English 2|Serbian
meanings: 1|1|1 2|2|1
That is, the word "egg" in language called "English" has the meaning of "Something laid by a hen", and the word "jaje" in language called "Serbian" has the same meaning. Now, this has obvious flaws, but I have not envisioned the database to be so simple. To cut to the point, I think that following structure would be enough for satisfying all needs of a dictionary (WARNING: long and sometimes confusing text ahead):
writings: ID|spelling readings: ID|reading languages: ID|language|dialect|place|time|group basics: ID|basic words: ID|languageID|writingID|readingID grammar: wordID|relation|wordID concepts: basicID|relation|basicID meanings: wordID|basicID
Now, how would all this work. I will use as an example english word "hair" for which three Serbian words exist: "kosa" (hair on one's head), "dlaka" (a hair) or "malja" (a hair on body).
Table "writlings" contain exact words as written on paper:
writings: 1|hair 2|kosa 3|dlaka 4|malja 5|hairs
Table "readings" contains readings of the words. I guess that it might be the easiest to use an internal format for this, which could be externally represented as IPA or SAMPA. Of course, for some languages, the readings could be autogenerated.
readings: 1|hejr 2|kosa 3|dlaka 4|mal<sup>j</sup>a 5|mal<sup>j</sup>e
(Note that here some IDs are the same; this of course need not be the case.)
Table "languages" contains data about languages. I was thinking big and allowed for ability of defining various dialects, regions, exact (or not exact) time at which a word was in use, and slang (of a certain social group). Perhaps this table needs a bit more work, but the basic idea is there. In this example, I'll use only the language name and forget about the rest:
languages: 1|English 2|Serbian
The last of basic tables, "basics", describes basic concepts.
basics: 1|A bunch of hairs on someone's head 2|A ceratinous outgrowth that covers human body 3|A single hair on someone's head 4|A single hair that is not on someone's head
Rest of the tables shows relations among IDs of these tables. Table "words" shows which writing has which spelling in which language.
words: 1|1|1|1 2|2|2|2 3|2|3|3 4|2|4|4 5|1|5|NULL 6|2|NULL|5
I'll expand the table:
1|English|hair|hejr 2|Serbian|kosa|kosa 3|Serbian|dlaka|dlaka 4|Serbian|malja|mal<sup>j</sup>a 5|English|hairs|NULL 6|Serbian|NULL|mal<sup>j</sup>e
Note that how English word "hairs" is read is currently not known and how a certain Serbian word is actually written is also currently not known. It doesn't matter.
Now, table "grammar" explains grammatical relations between the words:
1|root|1 2|root|2 3|root|3 4|root|4 5|plural|1 6|plural|4
Expanded:
hair|root|hair kosa|root|kosa dlaka|root|dlaka malja|root|malja hairs|plural|hair malje|plural|malja
I will explain what this "root" property means later when I explain how to actually query the database.
Table "concepts" is similar, except that it explains relations of the basic concepts:
1|mass/root|1 2|root|2 3|root|3 4|root|4 2|includes|3 2|includes|4
I will not expand the table but rather show the table "basics" again.
basics: 1|A bunch of hairs on someone's head 2|A ceratinous outgrowth that covers human body 3|A single hair on someone's head 4|A single hair that is not on someone's head
FINALLY, table "meanings" attaches words to concepts.
1|1 2|1 2|2 3|3 4|4 5|5
Now, how to read the dictionary. Suppose that you want to know what the word "hairs" means in English language. You go to the table "writings" and find "hairs" which has an ID of 5. Then you go to the table "languages" and find "Englihs" with an ID of 1. You go to the table "words" and see that ID for this word is 5 (along the way you might pick up reading of a word). Now you go to the table "grammar" and see that the word 5 is actually plural form of the word 1. In the same table you examine the word 1 and see that it is a root word; that is, one attached to a concept. Now, when you have found out that, search in "meanings" what concepts are attached to the root word 1 and you will see that there are two: concept 1 and 2. In the table "concepts" you search for them and find out that concept 2 is a root concept and concept 1 is also a root concept, but one of a mass concept; that is, of a thing that comes in an undistinguished mass. Finally you have:
'''hairs''':
- ((rarely) plural denoting different kinds of) A bunch of hairs on someone's
head 2. (plural) A ceratinous outgrowth that covers human body
Want to get Serbian translation? You go back through the tables, starting from basic concepts 1 and 2. You see in table "concepts" that concept 2 includes in itself concepts 3 and 4. You to the table "meanings" and see that concepts 1, 2, 3 and 4 correspond to words meanings 1, 2, 3, and 4; grammar is now not important, and in tabke "words" you see that only words 2, 3 and 4 are of Serbian language, and in "writings" that they are "kosa", "dlaka" and "malja". You now may go back and get their exact meanings and find exact translation that you need - more then usual dictionary has to offer.
This database system allows for much more then current free-form Wiktionary or the usual dictionaries. It would be easy to create a aoftware suited to specific needs that would browse the database on or off line. All gramaticall forms of a word are noted and it is easy to make basic machine translation from one language to another. It is easy to look up a word in unknown language when you don't know its root (this is often a problem, especially with electronic dictionaries).I have not shown in this example, but table "concepts" would include .. which would enable searching for them and creation of basic thesaura for all languages. It would also be possible to extract separate professional subdictionaries etc. etc.
Now, if you have bothered to read all this, you might as well spend just a bit more time to tell me what do you mean about it. I would be especially grateful if someone could find something that cannot fit into this kind of a dictionary. I already see a possible flaw; that is, that concept 1 in some cultures is not a mass concept; but I am certain that this could be overcome.
Respectfully, may I call this scheme hair-brained, though I know that "hair" in that expression is a common error when "hare" should be used. At least it's naïve. People don't read instructions except as an absolute last resort. When they would need to have such complicated instruction to understand a difference in meaning focused on one concept in only 2 languages they would put the explanation down and do something else. Serbian and Croatian are much more closely related, but I'm sure that the subtleties that make them different to explain, especially to an English speaker who doesn't know much about either one.
Among the expressions which use hair in English we have The gun has a hair trigger = The trigger mechanism is very sensitive to the slightest pressure He's got him by the short hairs = he's got him in a difficult position that is equal to pulling on his pubic hairs I had some of the hair of the dog that bit me = I had some of what I drank last night to help relieve the hangover. How's that for examples to start with? :-)
The point is that language structure is extremely complicated, and in much of what matters there is rarely a simple one-to-one correspondence between languages. That's often why machine translations look so much like they came from a machine.
Final note: I think that the dictionary should not be under GFDL; rather, under a similar licence which would allow full copyright of a work derived from the subset of a database, but not one derived from it superset; in other words, it would be possible for someone to take this dictionary, lays the word on the paper, print it, and sell it, and noone would have the right to photocopy or reprint it; but if that someone wants to add some words to the dictionary, he must add them to the database first, which would then enable anyone else to add them to their dictionaries.
Copyrights on dictionaries are highly disputable. They are often a combination of material that may be both in and out of copyright. The 1913 Webster which is being used as a starting point for many of the English words is well out of copyright. Single definitions can always be considered fair use. Claiming copyrights as you suggest may be more complicated than it's worth.
Ec
On Thursday 12 February 2004 00:25, Ray Saintonge wrote:
Now, if you have bothered to read all this, you might as well spend just a bit more time to tell me what do you mean about it. I would be especially grateful if someone could find something that cannot fit into this kind of a dictionary. I already see a possible flaw; that is, that concept 1 in some cultures is not a mass concept; but I am certain that this could be overcome.
Respectfully, may I call this scheme hair-brained, though I know that "hair" in that expression is a common error when "hare" should be used. At least it's naïve. People don't read instructions except as an absolute last resort. When they would need to have such complicated instruction to understand a difference in meaning focused on one concept in only 2 languages they would put the explanation down and do something
Well, I hope that final interface won't be as complicated as it could get. When you add a new word you would probably search if it exist in another language. If it does, you would just connect the new word with its already defined meaning. Any grammatical shapes, such as plural, declensions, etc. would be derived from that word, rather then inserted separately. If it doesn't exist, you would create a new word and a new basic concept and connect them (this connecting would be done by the software; everything that you would have to do is to type the word on the left, meaning on the right, select the word's type and language, and click on 'Submit'). The hardest case might occur when a word does already exist, but with a slightly different meaning; then, you would probably "split" an existing meaning in two or join it with another.
else. Serbian and Croatian are much more closely related, but I'm sure that the subtleties that make them different to explain, especially to an English speaker who doesn't know much about either one.
Not really, and I think that all subtleties could be handled with this notion of concepts vs. basic concepts.
Among the expressions which use hair in English we have The gun has a hair trigger = The trigger mechanism is very sensitive to the slightest pressure He's got him by the short hairs = he's got him in a difficult position that is equal to pulling on his pubic hairs I had some of the hair of the dog that bit me = I had some of what I drank last night to help relieve the hangover. How's that for examples to start with? :-)
All right, but these are expressions. They would be added to writings table, their meanings to basics table, so it is possible.
The point is that language structure is extremely complicated, and in much of what matters there is rarely a simple one-to-one correspondence between languages. That's often why machine translations look so much like they came from a machine.
This scheme allows for correspondences that are not on-to-one: many-to-one, one-to-many, even half-to-half.
If it's going on the web, and contains semantics, why not put it on the Semantic Web?
See: http://www.w3.org/2004/01/sws-pressrelease
in particular - RDF Thesaurus: http://www.w3c.rl.ac.uk/SWAD/rdfthes.html (the titles are links)
Cheers, Danny.
----
wikitech-l@lists.wikimedia.org