On Wednesday 11 February 2004 13:14, brian suda wrote:
this is a message based on a conversation about revamping the wiktionary service so that it contained more structured data. In it's current form it is a Wiki about words. This does not allow for easy connections to be made between Thesarus words, translations, etc.
Hopefully, with some help, brainstorming, and experience we can hammer out a better system to store, look-up, and access dictionary words.
I've got loads of ideas about how to implement some of this, but others know more about how to include rollbacks, versioning, and deletion, of material.
If people are interested in can post some more indepth ideas about what (i think) would need to be built and get some feedback about planning for other languages, special cases, migration, and future additions.
Though I am not active in Wiktionary, I am thinking about this for a while. I think that MediaWiki software, great as it is, is not adequate enough for creating a dictionary and that new software has to be made from scratch. I call this kind of software WikiBase - a database in which (more or less) each field acts like a wiki page - it could be changed anytime, has an edit history, could be discussed etc...
In its simplest form, the database of a dictionary might have four tables:
words: ID|word concepts: ID|concept languages: ID|language meanings: wordID|languageID|conceptID
They could be connected like this:
words: 1|egg 2|jaje
concepts: 1|Something laid by a hen
languages: 1|English 2|Serbian
meanings: 1|1|1 2|2|1
That is, the word "egg" in language called "English" has the meaning of "Something laid by a hen", and the word "jaje" in language called "Serbian" has the same meaning. Now, this has obvious flaws, but I have not envisioned the database to be so simple. To cut to the point, I think that following structure would be enough for satisfying all needs of a dictionary (WARNING: long and sometimes confusing text ahead):
writings: ID|spelling readings: ID|reading languages: ID|language|dialect|place|time|group basics: ID|basic words: ID|languageID|writingID|readingID grammar: wordID|relation|wordID concepts: basicID|relation|basicID meanings: wordID|basicID
Now, how would all this work. I will use as an example english word "hair" for which three Serbian words exist: "kosa" (hair on one's head), "dlaka" (a hair) or "malja" (a hair on body).
Table "writlings" contain exact words as written on paper:
writings: 1|hair 2|kosa 3|dlaka 4|malja 5|hairs
Table "readings" contains readings of the words. I guess that it might be the easiest to use an internal format for this, which could be externally represented as IPA or SAMPA. Of course, for some languages, the readings could be autogenerated.
readings: 1|hejr 2|kosa 3|dlaka 4|mal<sup>j</sup>a 5|mal<sup>j</sup>e
(Note that here some IDs are the same; this of course need not be the case.)
Table "languages" contains data about languages. I was thinking big and allowed for ability of defining various dialects, regions, exact (or not exact) time at which a word was in use, and slang (of a certain social group). Perhaps this table needs a bit more work, but the basic idea is there. In this example, I'll use only the language name and forget about the rest:
languages: 1|English 2|Serbian
The last of basic tables, "basics", describes basic concepts.
basics: 1|A bunch of hairs on someone's head 2|A ceratinous outgrowth that covers human body 3|A single hair on someone's head 4|A single hair that is not on someone's head
Rest of the tables shows relations among IDs of these tables. Table "words" shows which writing has which spelling in which language.
words: 1|1|1|1 2|2|2|2 3|2|3|3 4|2|4|4 5|1|5|NULL 6|2|NULL|5
I'll expand the table:
1|English|hair|hejr 2|Serbian|kosa|kosa 3|Serbian|dlaka|dlaka 4|Serbian|malja|mal<sup>j</sup>a 5|English|hairs|NULL 6|Serbian|NULL|mal<sup>j</sup>e
Note that how English word "hairs" is read is currently not known and how a certain Serbian word is actually written is also currently not known. It doesn't matter.
Now, table "grammar" explains grammatical relations between the words:
1|root|1 2|root|2 3|root|3 4|root|4 5|plural|1 6|plural|4
Expanded:
hair|root|hair kosa|root|kosa dlaka|root|dlaka malja|root|malja hairs|plural|hair malje|plural|malja
I will explain what this "root" property means later when I explain how to actually query the database.
Table "concepts" is similar, except that it explains relations of the basic concepts:
1|mass/root|1 2|root|2 3|root|3 4|root|4 2|includes|3 2|includes|4
I will not expand the table but rather show the table "basics" again.
basics: 1|A bunch of hairs on someone's head 2|A ceratinous outgrowth that covers human body 3|A single hair on someone's head 4|A single hair that is not on someone's head
FINALLY, table "meanings" attaches words to concepts.
1|1 2|1 2|2 3|3 4|4 5|5
Now, how to read the dictionary. Suppose that you want to know what the word "hairs" means in English language. You go to the table "writings" and find "hairs" which has an ID of 5. Then you go to the table "languages" and find "Englihs" with an ID of 1. You go to the table "words" and see that ID for this word is 5 (along the way you might pick up reading of a word). Now you go to the table "grammar" and see that the word 5 is actually plural form of the word 1. In the same table you examine the word 1 and see that it is a root word; that is, one attached to a concept. Now, when you have found out that, search in "meanings" what concepts are attached to the root word 1 and you will see that there are two: concept 1 and 2. In the table "concepts" you search for them and find out that concept 2 is a root concept and concept 1 is also a root concept, but one of a mass concept; that is, of a thing that comes in an undistinguished mass. Finally you have:
'''hairs''': 1. ((rarely) plural denoting different kinds of) A bunch of hairs on someone's head 2. (plural) A ceratinous outgrowth that covers human body
Want to get Serbian translation? You go back through the tables, starting from basic concepts 1 and 2. You see in table "concepts" that concept 2 includes in itself concepts 3 and 4. You to the table "meanings" and see that concepts 1, 2, 3 and 4 correspond to words meanings 1, 2, 3, and 4; grammar is now not important, and in tabke "words" you see that only words 2, 3 and 4 are of Serbian language, and in "writings" that they are "kosa", "dlaka" and "malja". You now may go back and get their exact meanings and find exact translation that you need - more then usual dictionary has to offer.
This database system allows for much more then current free-form Wiktionary or the usual dictionaries. It would be easy to create a aoftware suited to specific needs that would browse the database on or off line. All gramaticall forms of a word are noted and it is easy to make basic machine translation from one language to another. It is easy to look up a word in unknown language when you don't know its root (this is often a problem, especially with electronic dictionaries).I have not shown in this example, but table "concepts" would include .. which would enable searching for them and creation of basic thesaura for all languages. It would also be possible to extract separate professional subdictionaries etc. etc.
Now, if you have bothered to read all this, you might as well spend just a bit more time to tell me what do you mean about it. I would be especially grateful if someone could find something that cannot fit into this kind of a dictionary. I already see a possible flaw; that is, that concept 1 in some cultures is not a mass concept; but I am certain that this could be overcome.
Final note: I think that the dictionary should not be under GFDL; rather, under a similar licence which would allow full copyright of a work derived from the subset of a database, but not one derived from it superset; in other words, it would be possible for someone to take this dictionary, lays the word on the paper, print it, and sell it, and noone would have the right to photocopy or reprint it; but if that someone wants to add some words to the dictionary, he must add them to the database first, which would then enable anyone else to add them to their dictionaries.