Re: [Wikitech-l] wiktionary

11 Feb 2004


      On Wednesday 11 February 2004 13:14, brian suda wrote:
...
this is a message based on a conversation about revamping the wiktionary
service so that it contained more structured data. In it's current form it
is a Wiki about words. This does not allow for easy connections to be made
between Thesarus words, translations, etc.
Hopefully, with some help, brainstorming, and experience we can hammer out
a better system to store, look-up, and access dictionary words.
I've got loads of ideas about how to implement some of this, but others
know more about how to include rollbacks, versioning, and deletion, of
material.
If people are interested in can post some more indepth ideas about what (i
think) would need to be built and get some feedback about planning for
other languages, special cases, migration, and future additions.
Though I am not active in Wiktionary, I am thinking about this for a while. I 
think that MediaWiki software, great as it is, is not adequate enough for 
creating a dictionary and that new software has to be made from scratch. I 
call this kind of software WikiBase - a database in which (more or less) each 
field acts like a wiki page - it could be changed anytime, has an edit 
history, could be discussed etc...
In its simplest form, the database of a dictionary might have four tables:
words: ID|word
concepts: ID|concept
languages: ID|language
meanings: wordID|languageID|conceptID
They could be connected like this:
words:
1|egg
2|jaje
concepts:
1|Something laid by a hen
languages:
1|English
2|Serbian
meanings:
1|1|1
2|2|1
That is, the word "egg" in language called "English" has the meaning of 
"Something laid by a hen", and the word "jaje" in language called "Serbian" 
has the same meaning. Now, this has obvious flaws, but I have not envisioned 
the database to be so simple. To cut to the point, I think that following 
structure would be enough for satisfying all needs of a dictionary (WARNING: 
long and sometimes confusing text ahead):
writings: ID|spelling
readings: ID|reading
languages: ID|language|dialect|place|time|group
basics: ID|basic
words: ID|languageID|writingID|readingID
grammar: wordID|relation|wordID
concepts: basicID|relation|basicID
meanings: wordID|basicID
Now, how would all this work. I will use as an example english word "hair" for 
which three Serbian words exist: "kosa" (hair on one's head), "dlaka" (a 
hair) or "malja" (a hair on body).
Table "writlings" contain exact words as written on paper:
writings:
1|hair
2|kosa
3|dlaka
4|malja
5|hairs
Table "readings" contains readings of the words. I guess that it might be the 
easiest to use an internal format for this, which could be externally 
represented as IPA or SAMPA. Of course, for some languages, the readings 
could be autogenerated.
readings:
1|hejr
2|kosa
3|dlaka
4|mal<sup>j</sup>a
5|mal<sup>j</sup>e
(Note that here some IDs are the same; this of course need not be the case.)
Table "languages" contains data about languages. I was thinking big and 
allowed for ability of defining various dialects, regions, exact (or not 
exact) time at which a word was in use, and slang (of a certain social 
group). Perhaps this table needs a bit more work, but the basic idea is 
there. In this example, I'll use only the language name and forget about the 
rest:
languages:
1|English
2|Serbian
The last of basic tables, "basics", describes basic concepts.
basics:
1|A bunch of hairs on someone's head
2|A ceratinous outgrowth that covers human body
3|A single hair on someone's head
4|A single hair that is not on someone's head
Rest of the tables shows relations among IDs of these tables. Table "words" 
shows which writing has which spelling in which language.
words:
1|1|1|1
2|2|2|2
3|2|3|3
4|2|4|4
5|1|5|NULL
6|2|NULL|5
I'll expand the table:
1|English|hair|hejr
2|Serbian|kosa|kosa
3|Serbian|dlaka|dlaka
4|Serbian|malja|mal<sup>j</sup>a
5|English|hairs|NULL
6|Serbian|NULL|mal<sup>j</sup>e
Note that how English word "hairs" is read is currently not known and how a 
certain Serbian word is actually written is also currently not known. It 
doesn't matter.
Now, table "grammar" explains grammatical relations between the words:
1|root|1
2|root|2
3|root|3
4|root|4
5|plural|1
6|plural|4
Expanded:
hair|root|hair
kosa|root|kosa
dlaka|root|dlaka
malja|root|malja
hairs|plural|hair
malje|plural|malja
I will explain what this "root" property means later when I explain how to 
actually query the database.
Table "concepts" is similar, except that it explains relations of the 
basic concepts:
1|mass/root|1
2|root|2
3|root|3
4|root|4
2|includes|3
2|includes|4
I will not expand the table but rather show the table "basics" again.
basics:
1|A bunch of hairs on someone's head
2|A ceratinous outgrowth that covers human body
3|A single hair on someone's head
4|A single hair that is not on someone's head
FINALLY, table "meanings" attaches words to concepts.
1|1
2|1
2|2
3|3
4|4
5|5
Now, how to read the dictionary. Suppose that you want to know what the word 
"hairs" means in English language. You go to the table "writings" and find 
"hairs" which has an ID of 5. Then you go to the table "languages" and find 
"Englihs" with an ID of 1. You go to the table "words" and see that ID for 
this word is 5 (along the way you might pick up reading of a word). Now you 
go to the table "grammar" and see that the word 5 is actually plural form of 
the word 1. In the same table you examine the word 1 and see that it is a 
root word; that is, one attached to a concept. Now, when you have found out 
that, search in "meanings" what concepts are attached to the root word 1 and 
you will see that there are two: concept 1 and 2. In the table "concepts" you 
search for them and find out that concept 2 is a root concept and concept 1 
is also a root concept, but one of a mass concept; that is, of a thing that 
comes in an undistinguished mass. Finally you have:
'''hairs''':
1. ((rarely) plural denoting different kinds of) A bunch of hairs on someone's 
head
2. (plural) A ceratinous outgrowth that covers human body
Want to get Serbian translation? You go back through the tables, starting from 
basic concepts 1 and 2. You see in table "concepts" that concept 2 includes 
in itself concepts 3 and 4. You to the table "meanings" and see that concepts 
1, 2, 3 and 4 correspond to words meanings 1, 2, 3, and 4; grammar is now not 
important, and in tabke "words" you see that only words 2, 3 and 4 are of 
Serbian language, and in "writings" that they are "kosa", "dlaka" and 
"malja". You now may go back and get their exact meanings and find exact 
translation that you need - more then usual dictionary has to offer.
This database system allows for much more then current free-form Wiktionary or 
the usual dictionaries. It would be easy to create a aoftware suited to 
specific needs that would browse the database on or off line. All gramaticall 
forms of a word are noted and it is easy to make basic machine translation 
from one language to another. It is easy to look up a word in unknown 
language when you don't know its root (this is often a problem, especially 
with electronic dictionaries).I have not shown in this example, but table 
"concepts" would include .. which would enable searching for them and 
creation of basic thesaura for all languages. It would also be possible to 
extract separate professional subdictionaries etc. etc.
Now, if you have bothered to read all this, you might as well spend just a bit 
more time to tell me what do you mean about it. I would be especially 
grateful if someone could find something that cannot fit into this kind of a 
dictionary. I already see a possible flaw; that is, that concept 1 in some 
cultures is not a mass concept; but I am certain that this could be overcome.
Final note: I think that the dictionary should not be under GFDL; rather, 
under a similar licence which would allow full copyright of a work derived 
from the subset of a database, but not one derived from it superset; in other 
words, it would be possible for someone to take this dictionary, lays the 
word on the paper, print it, and sell it, and noone would have the right to 
photocopy or reprint it; but if that someone wants to add some words to the 
dictionary, he must add them to the database first, which would then enable 
anyone else to add them to their dictionaries.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] wiktionary