thesis: automatically building a multilingual thesaurus from wikipedia

List overview All Threads
Download

newer

older

Re: [Wiki-research-l] thesis:...

thesis: automatically building a...

Daniel Kinzler

30 May 2008 30 May '08

4:54 a.m.

My diploma thesis about a system to automatically build a multilingual thesaurus from wikipedia, "WikiWord", is finally done. I handed it in yesterday. My research will hopefully help to make Wikipedia more accessible for automatic processing, especially for applications natural languae processing, machine translation and information retrieval. What this could mean for Wikipedia is: better search and conceptual navigation, tools for suggesting categories, and more.

Here's the thesis (in German, i'm afraid): http://brightbyte.de/DA/WikiWord.pdf

Daniel Kinzler, "Automatischer Aufbau eines multilingualen Thesaurus durch Extraktion semantischer und lexikalischer Relationen aus der Wikipedia", Diplomarbeit an der Abteilung für Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig, 2008.

For the curious, http://brightbyte.de/DA/ also contains source code and data. See http://brightbyte.de/page/WikiWord for more information.

Some more data is for now avialable at http://aspra27.informatik.uni-leipzig.de/~dkinzler/rdfdumps/. This includes full SKOS dumps for en, de, fr, nl, and no covering about six million concepts.

The thesis ended up being rather large... 220 pages thesis and 30k lines of code. I'm plannign to write a research paper in english soon, which will give an overview over WikiWord and what it can be used for.

The thesis is licensed under the GFDL, WikiWord is GPL software. All data taken or derived from wikipedia is GFDL.

Enjoy, Daniel

Show replies by date

Magnus Manske

30 May 30 May

5:16 a.m.

On Fri, May 30, 2008 at 10:54 AM, Daniel Kinzler daniel@brightbyte.de wrote:

...

My diploma thesis about a system to automatically build a multilingual thesaurus from wikipedia, "WikiWord", is finally done. I handed it in yesterday. My research will hopefully help to make Wikipedia more accessible for automatic processing, especially for applications natural languae processing, machine translation and information retrieval. What this could mean for Wikipedia is: better search and conceptual navigation, tools for suggesting categories, and more.

Here's the thesis (in German, i'm afraid): http://brightbyte.de/DA/WikiWord.pdf

Daniel Kinzler, "Automatischer Aufbau eines multilingualen Thesaurus durch Extraktion semantischer und lexikalischer Relationen aus der Wikipedia", Diplomarbeit an der Abteilung für Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig, 2008.

For the curious, http://brightbyte.de/DA/ also contains source code and data. See http://brightbyte.de/page/WikiWord for more information.

Some more data is for now avialable at http://aspra27.informatik.uni-leipzig.de/~dkinzler/rdfdumps/. This includes full SKOS dumps for en, de, fr, nl, and no covering about six million concepts.

The thesis ended up being rather large... 220 pages thesis and 30k lines of code. I'm plannign to write a research paper in english soon, which will give an overview over WikiWord and what it can be used for.

The thesis is licensed under the GFDL, WikiWord is GPL software. All data taken or derived from wikipedia is GFDL.

Congratulations! Looks nice (quickly scanning over the pages).

Magnus

Daniel Kinzler

5:23 a.m.

New subject: thesis: automatically building a multilingual thesaurus from wikipedia

Magnus Manske wrote:

...

Congratulations! Looks nice (quickly scanning over the pages).

thanks!

Oh, I should have mentioned: to get a good impression without reading 200 pages, reas pages 26-31. They contain a good overview.

cheers Daniel

Desilets, Alain

3:31 p.m.

BTW Daniel,

Would you be interested in submitting a position paper for the BabelWiki 08 workshop which I am co-organising:

Babelwiki.notlong.com

It will be at the same time as WikiSym 2008, in Porto, Portugal, Sept 8-10. We also had full papers, but the deadline for those is now passed.

I hope you can make it. Your research is smack down in the middle of the workshop theme.

Best regards,

Alain Désilets

Daniel Kinzler

3:38 p.m.

New subject: thesis: automatically building a multilingual thesaurus from wikipedia

Desilets, Alain wrote:

...

BTW Daniel,

Would you be interested in submitting a position paper for the BabelWiki 08 workshop which I am co-organising:

Babelwiki.notlong.com

It will be at the same time as WikiSym 2008, in Porto, Portugal, Sept 8-10. We also had full papers, but the deadline for those is now passed.

That sounds very interesting. I saw the WikiSym CFP and was sorely tempted, but had to let it pass in order to get my thesis done in time.

...

I hope you can make it. Your research is smack down in the middle of the workshop theme.

I'll try, but it's a question of money, family and exams. I'm a bit in a tight spot right now. I'd sure like to come :)

-- Daniel

PS: this must be the fastest mail exchange i have had in a while. Do you use IRC? I'm on freenode.net as Duesentrieb frequently (and right now).

Desilets, Alain

3:49 p.m.

...

PS: this must be the fastest mail exchange i have had in a while. Do you use IRC? I'm on freenode.net as Duesentrieb frequently (and right now).

I use Skype myself. I am alain_desilets there.

Alain

Daniel Kinzler

4:06 p.m.

New subject: thesis: automatically building a multilingual thesaurus from wikipedia

Desilets, Alain wrote:

...

...
PS: this must be the fastest mail exchange i have had in a while. Do you use IRC? I'm on freenode.net as Duesentrieb frequently (and right now).

I use Skype myself. I am alain_desilets there.

Hm, I dislike Skype... it doesn't play too well with (my) linux, and they have shifty policies. Anyway, mail will do for now :)

-- Daniel

Desilets, Alain

2:54 p.m.

New subject: thesis: automatically building a multilingualthesaurus from wikipedia

...

...
For the curious, http://brightbyte.de/DA/ also contains source code

and data.

...
See http://brightbyte.de/page/WikiWord for more information.

Is there documentation about how to use this code?

Alain

Desilets, Alain

7:04 a.m.

New subject: thesis: automatically building a multilingualthesaurus from wikipedia

Really interesting! Can you post an HTML or text only version so I could read it using Google Translate?

At WikiMania 07, I presented a paper that looked at how useful wiki resources like wikipedia, wiktionary and OmegaWiki might be for the needs of translators.

http://wikimania2007.wikimedia.org/wiki/Proceedings:AD1

One of the things we found was that in isolation, each of those resources at best covered ~30% of the translation difficulties typically encountered by professional translators for the English-French pair. But combined, they were able to cover ~50%. We also found that the presentation of information on Wikipedia and Wiktionary was not suited for the needs of translators.

Based on those two findings, we proposed the idea of a robot capable of pulling cross-lingual information from those resources and presenting in a way that is better suited for the needs of translators. Sounds like you may have just done this!

Is there a web interface to this multilingual resource that I could try?

Alain Désilets

...

-----Original Message----- From: wiki-research-l-bounces@lists.wikimedia.org [mailto:wiki- research-l-bounces@lists.wikimedia.org] On Behalf Of Daniel Kinzler Sent: May 30, 2008 5:54 AM To: wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] thesis: automatically building a multilingualthesaurus from wikipedia

My diploma thesis about a system to automatically build a multilingual thesaurus from wikipedia, "WikiWord", is finally done. I handed it in yesterday. My research will hopefully help to make Wikipedia more accessible for automatic processing, especially for applications natural languae processing, machine translation and information retrieval. What this could mean for Wikipedia is: better search and conceptual navigation, tools for suggesting categories, and more.

Here's the thesis (in German, i'm afraid): http://brightbyte.de/DA/WikiWord.pdf

Daniel Kinzler, "Automatischer Aufbau eines multilingualen Thesaurus durch Extraktion semantischer und lexikalischer Relationen aus der Wikipedia", Diplomarbeit an der Abteilung für Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig, 2008.

For the curious, http://brightbyte.de/DA/ also contains source code and data. See http://brightbyte.de/page/WikiWord for more information.

Some more data is for now avialable at http://aspra27.informatik.uni-leipzig.de/~dkinzler/rdfdumps/. This includes full SKOS dumps for en, de, fr, nl, and no covering about six million concepts.

The thesis ended up being rather large... 220 pages thesis and 30k lines of code. I'm plannign to write a research paper in english soon, which will give an overview over WikiWord and what it can be used for.

The thesis is licensed under the GFDL, WikiWord is GPL software. All data taken or derived from wikipedia is GFDL.

Enjoy, Daniel

Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Daniel Kinzler

2:31 p.m.

New subject: thesis: automatically building a multilingualthesaurus from wikipedia

Desilets, Alain wrote:

...

Really interesting! Can you post an HTML or text only version so I could read it using Google Translate?

I don't know a good way off hand to get that from TeX, but I'll see what I can do.

...

At WikiMania 07, I presented a paper that looked at how useful wiki resources like wikipedia, wiktionary and OmegaWiki might be for the needs of translators.

http://wikimania2007.wikimedia.org/wiki/Proceedings:AD1

Oh, interesting. To bad I didn't find your publications before, I could have nicely cited them :)

...

One of the things we found was that in isolation, each of those resources at best covered ~30% of the translation difficulties typically encountered by professional translators for the English-French pair. But combined, they were able to cover ~50%. We also found that the presentation of information on Wikipedia and Wiktionary was not suited for the needs of translators.

I have not looked at Wiktionary. Thinking about it, I'm afraid I failed to mention my reasons for that in the thesis (losing some points there i guess). Some of the methods I used are surely applicable there too, though the pages are a bit more difficult to parse, and we quickly get into "structured record" terrioy, which i generally avoided (Auer and Lehmann did some interesting reasearch there with their DBpedia project).

...

Based on those two findings, we proposed the idea of a robot capable of pulling cross-lingual information from those resources and presenting in a way that is better suited for the needs of translators. Sounds like you may have just done this!

Well... I hope I did :) The aim was to automatically generate a multilingual thesaurus, which is surely a good tool for translators. However, the quality of the results are not what a translator would expect from a traditional lexicon or thesaurus. The data is probably most useful directly when used in the context of information retrieval and language processing, that is, as a basis for computers to link text to conceptual world knowledge ("common sense"). I hope however that the data I generate could be usefull to translators anyway, as an addition or extension to traditional, manually maintained dictionaries and thesauri. One point that would veryx much interest me is to try to use my work aqs a basis of building a bridge between Wikipedia and OmegaWiki.

...

Is there a web interface to this multilingual resource that I could try?

Sadly, no. I was planning one and started to implement it, but there was no time to get it up and running. Maybe there will be such an interface in the future. But I really see the place of my code more as a backend -- a good interface to that data may in fact be OmegaWiki, if we find a way to integrate it nicely. But I do feel the urge to provide a simple query interface to my data for testing, so maybe it'll still happen :)

Regards, Daniel

Desilets, Alain

2:46 p.m.

New subject: thesis: automatically building a multilingualthesaurus from wikipedia

...

...
Based on those two findings, we proposed the idea of a robot capable of pulling cross-lingual information from those resources and presenting in a way that is better suited for the needs of translators. Sounds like you may have just done this!

Well... I hope I did :) The aim was to automatically generate a multilingual thesaurus, which is surely a good tool for translators. However, the quality of the results are not what a translator would expect from a traditional lexicon or thesaurus. The data is probably most useful directly when used in the context of information retrieval and language processing, that is, as a basis for computers to link text to conceptual world knowledge ("common sense"). I hope however that the data I generate could be usefull to translators anyway, as an addition or extension to traditional, manually maintained dictionaries and thesauri. One point that would veryx much interest me is to try to use my work aqs a basis of building a bridge between Wikipedia and OmegaWiki.

One of my goals in the next year or two is to participate in the creation of large, open, wiki-like terminology databases for translators. We call this concept a WikiTerm.

Having observed and interviewed a dozen translators doing their work, I can say without hesitation that they don't worry too much about upstream quality control in terminology database. Most of the quality control is done downstream, by the translator himself. Translators naturally develop a sort of 6th sense that allows them to very rapidly sift through a list of term equivalents, and decide which one (if any) is most appropriate for their current need.

One of the conclusions we came to in our paper was that, while OmegaWiki was currently wiki resource whose coverage of typical translation difficulties was lowest, its user interface was closest to what translators need. And we suggested that a good way to get to a WikiTerm would be to do exactly what you propose, i.e. extract write a robot that can extract cross-lingual information from Wikipedia and Wiktionary, and pour that into OmegaWiki.

So, let's talk! Do you have contacts at OmegaWiki? If not, I can put you in touch with them.

...

...
Is there a web interface to this multilingual resource that I could

try?

Sadly, no. I was planning one and started to implement it, but there was no time to get it up and running. Maybe there will be such an interface in the future. But I really see the place of my code more as a backend -- a good interface to that data may in fact be OmegaWiki, if we find a way to integrate it nicely. But I do feel the urge to provide a simple query interface to my data for testing, so maybe it'll still happen :)

Are you planning to do more work on this, or are you moving on to other things?

Have a good weekend.

Alain Désilets

Daniel Kinzler

3:21 p.m.

New subject: thesis: automatically building a multilingualthesaurus from wikipedia

Desilets, Alain wrote:

...

One of my goals in the next year or two is to participate in the creation of large, open, wiki-like terminology databases for translators. We call this concept a WikiTerm.

That sounds quite interesting

...

So, let's talk! Do you have contacts at OmegaWiki? If not, I can put you in touch with them.

Yes, I have talked with Gerard Meijssen about this several times, and he seemed quite exited :) I have also talked to Barend Mons of Knewco obout this, we seem to have pretty similar ideas. So, yes, let's talk :) Ideally, you, me, and them. To bad I'll probably not going to make it to WikiSym or Wikimania this year, that would be an ideal opportunity.

...

Is it unit tested?

If so, then I forgive you ;-) .

Yes. Not every single method, but all the important bits are unit tested.

...

Is there documentation about how to use this code?

Some -- in german, in the thesis. Frankly, finishing 30k lines of code and 220 pages of thesis in 7 monthes proved to be a bit tight :)

...

Are you planning to do more work on this, or are you moving on to other things?

If I can, I would try to continue working on this. Currently I'm planning to finish university by the end of the year, and I don't know yet how i'll be earning my living then. Preferrably, by working on this -- or something similarly wiki-related, my head is full of ideas :)

Regards, Daniel

Desilets, Alain

3:34 p.m.

New subject: thesis: automatically building a multilingualthesaurus from wikipedia

...

...
Is it unit tested?

If so, then I forgive you ;-) .

Yes. Not every single method, but all the important bits are unit tested.

Then the code doesn't need to be clean. People can refactor it to cleanliness without fearing that they are destroying it.

...

...
Is there documentation about how to use this code?

Some -- in german, in the thesis. Frankly, finishing 30k lines of code and 220 pages of thesis in 7 monthes proved to be a bit tight :)

Which pages of the thesis?

...

...
Are you planning to do more work on this, or are you moving on to

other things?

If I can, I would try to continue working on this. Currently I'm planning to finish university by the end of the year, and I don't know yet how i'll be earning my living then. Preferrably, by working on

this

...

-- or something similarly wiki-related, my head is full of ideas :)

Might be hard to earn a living working on this but who knows.

Alain

Daniel Kinzler

3:42 p.m.

New subject: thesis: automatically building a multilingualthesaurus from wikipedia

Desilets, Alain wrote:

...

...
...
Is there documentation about how to use this code?

Some -- in german, in the thesis. Frankly, finishing 30k lines of code and 220 pages of thesis in 7 monthes proved to be a bit tight :)

Which pages of the thesis?

Well, it depends on what you mean by "use the code". The command line interface is described on pages 142ff (numbers as shown on the pages; it's actually 152/220 i think). Some of the core classes and methods are described all over part III of the thesis.

...

...
...
Are you planning to do more work on this, or are you moving on to

other things?

If I can, I would try to continue working on this. Currently I'm planning to finish university by the end of the year, and I don't know yet how i'll be earning my living then. Preferrably, by working on

this

...
-- or something similarly wiki-related, my head is full of ideas :)

Might be hard to earn a living working on this but who knows.

It's worth a try :)

-- Daniel

Piotr Konieczny

7:56 a.m.

New subject: thesis: automatically building a multilingual thesaurus from wikipedia

Nice job, please add it to WP:ACST so it can be known to a wider audience.

-- Prokonsul Piotrus aka Piotr Konieczny |||____ __ __ I=I====__|_|I__I=> "Lennier, get us the hell out of here." ||| "Initiating 'getting the hell out of here' maneuver." -- Ivanova and Lennier in Babylon 5:"The Hour of the Wolf" Daniel Kinzler wrote: > My diploma thesis about a system to automatically build a multilingual thesaurus > from wikipedia, "WikiWord", is finally done. I handed it in yesterday. My > research will hopefully help to make Wikipedia more accessible for automatic > processing, especially for applications natural languae processing, machine > translation and information retrieval. What this could mean for Wikipedia is: > better search and conceptual navigation, tools for suggesting categories, and more. > > Here's the thesis (in German, i'm afraid): http://brightbyte.de/DA/WikiWord.pdf > > Daniel Kinzler, "Automatischer Aufbau eines multilingualen Thesaurus durch > Extraktion semantischer und lexikalischer Relationen aus der Wikipedia", > Diplomarbeit an der Abteilung für Automatische Sprachverarbeitung, Institut > für Informatik, Universität Leipzig, 2008. > > For the curious, http://brightbyte.de/DA/ also contains source code and data. > See http://brightbyte.de/page/WikiWord for more information. > > Some more data is for now avialable at > http://aspra27.informatik.uni-leipzig.de/~dkinzler/rdfdumps/. This includes > full SKOS dumps for en, de, fr, nl, and no covering about six million concepts. > > The thesis ended up being rather large... 220 pages thesis and 30k lines of > code. I'm plannign to write a research paper in english soon, which will give an > overview over WikiWord and what it can be used for. > > The thesis is licensed under the GFDL, WikiWord is GPL software. All data taken > or derived from wikipedia is GFDL.

Han-Teng Liao (OII)

8:09 a.m.

New subject: thesis: automatically building a multilingual thesaurus from wikipedia

Daniel Kinzler

2:41 p.m.

New subject: thesis: automatically building a multilingual thesaurus from wikipedia

Han-Teng Liao (OII) wrote:

...

Dear Mr. Kinzler, Could you give me an indication if your code is ready for other languages as well? I am asking particularly about the Unicode processing because I am really interested in trying it out in East Asian context (e.g. Chinese, Japanese, and Korean)

The code should be fully unicode-capable, at least as far as the encoding is concerned. The methods and algorithms I used are designed to be mostly language-independant, but some of them will probably have to be adopted for CJK languages. Especially the code for word- and sentence-splitting as well as for measuring lexicographic similarity/distance would have to be looked at closely. However, providing a suitable implementation for different languages or scripts should be possible without problems, due to the modular design I used for the text processing classes.

Applying my code to CJK languages would be a great challange to my design, and I would be very interested to see how it works out. I did not test it, simply because I know next to nothing about those languages. I would be happy to assist you in trying to adopt it to CJK languages and scripts.

Regards, Daniel

PS: I have to appologize in advance to anyone trying to understand the code. I tried to kep the design clean, but the code is not always pretty, and worst of all, there are close to no comments. The thesis explains the most important bits, but if you don't read german, that does you little good i'm afraid. I hope I will be able to improve on this over time.

Desilets, Alain

2:47 p.m.

...

PS: I have to appologize in advance to anyone trying to understand the code. I tried to kep the design clean, but the code is not always pretty, and worst of all, there are close to no comments. The thesis explains the most important bits, but if you don't read german, that does you little good i'm afraid. I hope I will be able to improve on this over time.

Is it unit tested?

If so, then I forgive you ;-).

Alain Désilets

6049

Age (days ago)

6049

Last active (days ago)

wiki-research-l@lists.wikimedia.org

17 comments

5 participants

tags (0)

participants (5)

Daniel Kinzler
Desilets, Alain
Han-Teng Liao (OII)
Magnus Manske
Piotr Konieczny