[Wiktionary-l] Some methods for generation of dictionaries

Milos Rancic millosh at gmail.com
Sat Mar 10 05:29:08 UTC 2007


I am forwarding to you the first (not complete) version of the page
http://meta.wikimedia.org/wiki/User:Millosh/Dictionaries .

At the end of this month I'll have some software for such generation
of dictionaries. So, it would be good to hear what do you think about
that and is there someone interested to join this project. Maybe
Gerard may think how to implement such thing in the OmegaWiki, too :)

The page is not completed (stages 2 and 3 are not described), but I
think that you may follow my idea anyway. I'll complete the page in
the next few weeks and I'll inform you about that.

* * *

In this moment I am working on one Serbian dictionary of synonyms.
During that work I got some ideas about the work on Wiktionaries:

Let's say what one word with synonyms/translations is enough for one
word in Wiktionary. (Maybe I should read some Wiktionary
documentation, but I suppose that this is the minimum.)

In short, this may be done for a dozens of languages on a dozens of
Wiktionaries.

==Stage 1, one language dictionary==
*Take some dictionary between English (or whatever language) and your
language. Of course, take it in machine readable format (not
encrypted).
*Take the first word in (let's say) English.
*Take the first translation in your language. Connect this word in
your language with other translations of the word in English.
*Find which words in English have the same translation. Connect the
word with other translations in those words.
*You will get the list of connected words. There will be a lot of
mass, but you will be able to make some simple methods for cleaning
the most of the mass. The rest of the mass will be cleaned by humans
because this is a wiki :)
*Of course, you may do that with a lot of different dictionaries...

Imagine that we analyzed two words from language A in the dictionary
"language B -> language A" and that we got the next results (of
course, this is simplified table):

<pre>
A58 - B65 - A58, A43, A21, A63
    - B69 - A58, A28, A21, A38
    - B71 - A58, A43, A21, A88
    - B89 - A58, A43, A21, A63

A21 - B31 - A21, A43, A76, A20
    - B44 - A21, A43, A39, A22
    - B65 - A58, A43, A21, A63
    - B69 - A58, A28, A21, A38
    - B71 - A58, A43, A21, A88
    - B89 - A58, A43, A21, A63
</pre>

We may say that if one word from the language A has the same meaning
as the word A58 in the language B, this connection will get one point.
So, we will have the next situation according to the words A58 and
A21:

<pre>
A58(A21) = 4
A58(A43) = 3
A58(A63) = 2
A58(A28) = 1
A58(A38) = 1
A58(A88) = 1

A21(A43) = 5
A21(A58) = 4
A21(A63) = 2
A21(A28) = 1
A21(A38) = 1
A21(A88) = 1
A21(A76) = 1
A21(A39) = 1
A21(A20) = 1
A21(A22) = 1
</pre>

For the beginning, this may mean:
*The closest synonyms to the word A58 is the word A21.
*The closest synonyms to the word A21 is the word A43.
*Words A21, A58, A43 and A63 are synonyms (which we may call "G(As)1").
*It seems that words A28, A38, A88, A76, A39, A20 and A22 are not
related with the group G(As)1. However, we will put the connections in
the memory, but we will not write it into the dictionary. Imagine that
the word ''blood'' literary means in some language "red bird". Of
course, there are some ''red birds'' in the area where that language
is spoken. So, in this sense, blood will be connected with the word
"bird" and, almost for sure, with some specie of birds. However, this
will be the only connection to the birds. Other connections will be
inside of the descriptions for erythrocyte, lymphocyte, heart and so
on. Of course, mistakes are possible, but we may analyze results :)
*This may be very useful for smaller languages which have some two
language dictionaries (where the language B is English). We may be
able to generate one language Wiktionaries for all of such languages.

==Stage 2, two languages dictionary==
(To be continued.)

==Stage 3, cross language dictionaries==
(To be continued.)



More information about the Wiktionary-l mailing list