My diploma thesis about a system to automatically build a multilingual thesaurus from wikipedia, "WikiWord", is finally done. I handed it in yesterday. My research will hopefully help to make Wikipedia more accessible for automatic processing, especially for applications natural languae processing, machine translation and information retrieval. What this could mean for Wikipedia is: better search and conceptual navigation, tools for suggesting categories, and more.
Here's the thesis (in German, i'm afraid): http://brightbyte.de/DA/WikiWord.pdf
Daniel Kinzler, "Automatischer Aufbau eines multilingualen Thesaurus durch Extraktion semantischer und lexikalischer Relationen aus der Wikipedia", Diplomarbeit an der Abteilung für Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig, 2008.
For the curious, http://brightbyte.de/DA/ also contains source code and data. See http://brightbyte.de/page/WikiWord for more information.
Some more data is for now avialable at http://aspra27.informatik.uni-leipzig.de/~dkinzler/rdfdumps/. This includes full SKOS dumps for en, de, fr, nl, and no covering about six million concepts.
The thesis ended up being rather large... 220 pages thesis and 30k lines of code. I'm plannign to write a research paper in english soon, which will give an overview over WikiWord and what it can be used for.
The thesis is licensed under the GFDL, WikiWord is GPL software. All data taken or derived from wikipedia is GFDL.
Enjoy, Daniel
On Fri, May 30, 2008 at 10:54 AM, Daniel Kinzler daniel@brightbyte.de wrote:
My diploma thesis about a system to automatically build a multilingual thesaurus from wikipedia, "WikiWord", is finally done. I handed it in yesterday. My research will hopefully help to make Wikipedia more accessible for automatic processing, especially for applications natural languae processing, machine translation and information retrieval. What this could mean for Wikipedia is: better search and conceptual navigation, tools for suggesting categories, and more.
Here's the thesis (in German, i'm afraid): http://brightbyte.de/DA/WikiWord.pdf
Daniel Kinzler, "Automatischer Aufbau eines multilingualen Thesaurus durch Extraktion semantischer und lexikalischer Relationen aus der Wikipedia", Diplomarbeit an der Abteilung für Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig, 2008.
For the curious, http://brightbyte.de/DA/ also contains source code and data. See http://brightbyte.de/page/WikiWord for more information.
Some more data is for now avialable at http://aspra27.informatik.uni-leipzig.de/~dkinzler/rdfdumps/. This includes full SKOS dumps for en, de, fr, nl, and no covering about six million concepts.
The thesis ended up being rather large... 220 pages thesis and 30k lines of code. I'm plannign to write a research paper in english soon, which will give an overview over WikiWord and what it can be used for.
The thesis is licensed under the GFDL, WikiWord is GPL software. All data taken or derived from wikipedia is GFDL.
Congratulations! Looks nice (quickly scanning over the pages).
Magnus
Magnus Manske wrote:
Congratulations! Looks nice (quickly scanning over the pages).
thanks!
Oh, I should have mentioned: to get a good impression without reading 200 pages, reas pages 26-31. They contain a good overview.
cheers Daniel
BTW Daniel,
Would you be interested in submitting a position paper for the BabelWiki 08 workshop which I am co-organising:
Babelwiki.notlong.com
It will be at the same time as WikiSym 2008, in Porto, Portugal, Sept 8-10. We also had full papers, but the deadline for those is now passed.
I hope you can make it. Your research is smack down in the middle of the workshop theme.
Best regards,
Alain Désilets
Desilets, Alain wrote:
BTW Daniel,
Would you be interested in submitting a position paper for the BabelWiki 08 workshop which I am co-organising:
Babelwiki.notlong.com
It will be at the same time as WikiSym 2008, in Porto, Portugal, Sept 8-10. We also had full papers, but the deadline for those is now passed.
That sounds very interesting. I saw the WikiSym CFP and was sorely tempted, but had to let it pass in order to get my thesis done in time.
I hope you can make it. Your research is smack down in the middle of the workshop theme.
I'll try, but it's a question of money, family and exams. I'm a bit in a tight spot right now. I'd sure like to come :)
-- Daniel
PS: this must be the fastest mail exchange i have had in a while. Do you use IRC? I'm on freenode.net as Duesentrieb frequently (and right now).
Desilets, Alain wrote:
PS: this must be the fastest mail exchange i have had in a while. Do you use IRC? I'm on freenode.net as Duesentrieb frequently (and right now).
I use Skype myself. I am alain_desilets there.
Hm, I dislike Skype... it doesn't play too well with (my) linux, and they have shifty policies. Anyway, mail will do for now :)
-- Daniel
For the curious, http://brightbyte.de/DA/ also contains source code
and data.
See http://brightbyte.de/page/WikiWord for more information.
Is there documentation about how to use this code?
Alain
Really interesting! Can you post an HTML or text only version so I could read it using Google Translate?
At WikiMania 07, I presented a paper that looked at how useful wiki resources like wikipedia, wiktionary and OmegaWiki might be for the needs of translators.
http://wikimania2007.wikimedia.org/wiki/Proceedings:AD1
One of the things we found was that in isolation, each of those resources at best covered ~30% of the translation difficulties typically encountered by professional translators for the English-French pair. But combined, they were able to cover ~50%. We also found that the presentation of information on Wikipedia and Wiktionary was not suited for the needs of translators.
Based on those two findings, we proposed the idea of a robot capable of pulling cross-lingual information from those resources and presenting in a way that is better suited for the needs of translators. Sounds like you may have just done this!
Is there a web interface to this multilingual resource that I could try?
Alain Désilets
-----Original Message----- From: wiki-research-l-bounces@lists.wikimedia.org [mailto:wiki- research-l-bounces@lists.wikimedia.org] On Behalf Of Daniel Kinzler Sent: May 30, 2008 5:54 AM To: wiki-research-l@lists.wikimedia.org Subject: [Wiki-research-l] thesis: automatically building a multilingualthesaurus from wikipedia
My diploma thesis about a system to automatically build a multilingual thesaurus from wikipedia, "WikiWord", is finally done. I handed it in yesterday. My research will hopefully help to make Wikipedia more accessible for automatic processing, especially for applications natural languae processing, machine translation and information retrieval. What this could mean for Wikipedia is: better search and conceptual navigation, tools for suggesting categories, and more.
Here's the thesis (in German, i'm afraid): http://brightbyte.de/DA/WikiWord.pdf
Daniel Kinzler, "Automatischer Aufbau eines multilingualen Thesaurus durch Extraktion semantischer und lexikalischer Relationen aus der Wikipedia", Diplomarbeit an der Abteilung für Automatische Sprachverarbeitung, Institut für Informatik, Universität Leipzig, 2008.
For the curious, http://brightbyte.de/DA/ also contains source code and data. See http://brightbyte.de/page/WikiWord for more information.
Some more data is for now avialable at http://aspra27.informatik.uni-leipzig.de/~dkinzler/rdfdumps/. This includes full SKOS dumps for en, de, fr, nl, and no covering about six million concepts.
The thesis ended up being rather large... 220 pages thesis and 30k lines of code. I'm plannign to write a research paper in english soon, which will give an overview over WikiWord and what it can be used for.
The thesis is licensed under the GFDL, WikiWord is GPL software. All data taken or derived from wikipedia is GFDL.
Enjoy, Daniel
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Desilets, Alain wrote:
Really interesting! Can you post an HTML or text only version so I could read it using Google Translate?
I don't know a good way off hand to get that from TeX, but I'll see what I can do.
At WikiMania 07, I presented a paper that looked at how useful wiki resources like wikipedia, wiktionary and OmegaWiki might be for the needs of translators.
Oh, interesting. To bad I didn't find your publications before, I could have nicely cited them :)
One of the things we found was that in isolation, each of those resources at best covered ~30% of the translation difficulties typically encountered by professional translators for the English-French pair. But combined, they were able to cover ~50%. We also found that the presentation of information on Wikipedia and Wiktionary was not suited for the needs of translators.
I have not looked at Wiktionary. Thinking about it, I'm afraid I failed to mention my reasons for that in the thesis (losing some points there i guess). Some of the methods I used are surely applicable there too, though the pages are a bit more difficult to parse, and we quickly get into "structured record" terrioy, which i generally avoided (Auer and Lehmann did some interesting reasearch there with their DBpedia project).
Based on those two findings, we proposed the idea of a robot capable of pulling cross-lingual information from those resources and presenting in a way that is better suited for the needs of translators. Sounds like you may have just done this!
Well... I hope I did :) The aim was to automatically generate a multilingual thesaurus, which is surely a good tool for translators. However, the quality of the results are not what a translator would expect from a traditional lexicon or thesaurus. The data is probably most useful directly when used in the context of information retrieval and language processing, that is, as a basis for computers to link text to conceptual world knowledge ("common sense"). I hope however that the data I generate could be usefull to translators anyway, as an addition or extension to traditional, manually maintained dictionaries and thesauri. One point that would veryx much interest me is to try to use my work aqs a basis of building a bridge between Wikipedia and OmegaWiki.
Is there a web interface to this multilingual resource that I could try?
Sadly, no. I was planning one and started to implement it, but there was no time to get it up and running. Maybe there will be such an interface in the future. But I really see the place of my code more as a backend -- a good interface to that data may in fact be OmegaWiki, if we find a way to integrate it nicely. But I do feel the urge to provide a simple query interface to my data for testing, so maybe it'll still happen :)
Regards, Daniel
Based on those two findings, we proposed the idea of a robot capable of pulling cross-lingual information from those resources and presenting in a way that is better suited for the needs of translators. Sounds like you may have just done this!
Well... I hope I did :) The aim was to automatically generate a multilingual thesaurus, which is surely a good tool for translators. However, the quality of the results are not what a translator would expect from a traditional lexicon or thesaurus. The data is probably most useful directly when used in the context of information retrieval and language processing, that is, as a basis for computers to link text to conceptual world knowledge ("common sense"). I hope however that the data I generate could be usefull to translators anyway, as an addition or extension to traditional, manually maintained dictionaries and thesauri. One point that would veryx much interest me is to try to use my work aqs a basis of building a bridge between Wikipedia and OmegaWiki.
One of my goals in the next year or two is to participate in the creation of large, open, wiki-like terminology databases for translators. We call this concept a WikiTerm.
Having observed and interviewed a dozen translators doing their work, I can say without hesitation that they don't worry too much about upstream quality control in terminology database. Most of the quality control is done downstream, by the translator himself. Translators naturally develop a sort of 6th sense that allows them to very rapidly sift through a list of term equivalents, and decide which one (if any) is most appropriate for their current need.
One of the conclusions we came to in our paper was that, while OmegaWiki was currently wiki resource whose coverage of typical translation difficulties was lowest, its user interface was closest to what translators need. And we suggested that a good way to get to a WikiTerm would be to do exactly what you propose, i.e. extract write a robot that can extract cross-lingual information from Wikipedia and Wiktionary, and pour that into OmegaWiki.
So, let's talk! Do you have contacts at OmegaWiki? If not, I can put you in touch with them.
Is there a web interface to this multilingual resource that I could
try?
Sadly, no. I was planning one and started to implement it, but there was no time to get it up and running. Maybe there will be such an interface in the future. But I really see the place of my code more as a backend -- a good interface to that data may in fact be OmegaWiki, if we find a way to integrate it nicely. But I do feel the urge to provide a simple query interface to my data for testing, so maybe it'll still happen :)
Are you planning to do more work on this, or are you moving on to other things?
Have a good weekend.
Alain Désilets
Desilets, Alain wrote:
One of my goals in the next year or two is to participate in the creation of large, open, wiki-like terminology databases for translators. We call this concept a WikiTerm.
That sounds quite interesting
...
So, let's talk! Do you have contacts at OmegaWiki? If not, I can put you in touch with them.
Yes, I have talked with Gerard Meijssen about this several times, and he seemed quite exited :) I have also talked to Barend Mons of Knewco obout this, we seem to have pretty similar ideas. So, yes, let's talk :) Ideally, you, me, and them. To bad I'll probably not going to make it to WikiSym or Wikimania this year, that would be an ideal opportunity.
Is it unit tested?
If so, then I forgive you ;-) .
Yes. Not every single method, but all the important bits are unit tested.
Is there documentation about how to use this code?
Some -- in german, in the thesis. Frankly, finishing 30k lines of code and 220 pages of thesis in 7 monthes proved to be a bit tight :)
Are you planning to do more work on this, or are you moving on to other things?
If I can, I would try to continue working on this. Currently I'm planning to finish university by the end of the year, and I don't know yet how i'll be earning my living then. Preferrably, by working on this -- or something similarly wiki-related, my head is full of ideas :)
Regards, Daniel
Is it unit tested?
If so, then I forgive you ;-) .
Yes. Not every single method, but all the important bits are unit tested.
Then the code doesn't need to be clean. People can refactor it to cleanliness without fearing that they are destroying it.
Is there documentation about how to use this code?
Some -- in german, in the thesis. Frankly, finishing 30k lines of code and 220 pages of thesis in 7 monthes proved to be a bit tight :)
Which pages of the thesis?
Are you planning to do more work on this, or are you moving on to
other things?
If I can, I would try to continue working on this. Currently I'm planning to finish university by the end of the year, and I don't know yet how i'll be earning my living then. Preferrably, by working on
this
-- or something similarly wiki-related, my head is full of ideas :)
Might be hard to earn a living working on this but who knows.
Alain
Desilets, Alain wrote:
Is there documentation about how to use this code?
Some -- in german, in the thesis. Frankly, finishing 30k lines of code and 220 pages of thesis in 7 monthes proved to be a bit tight :)
Which pages of the thesis?
Well, it depends on what you mean by "use the code". The command line interface is described on pages 142ff (numbers as shown on the pages; it's actually 152/220 i think). Some of the core classes and methods are described all over part III of the thesis.
Are you planning to do more work on this, or are you moving on to
other things?
If I can, I would try to continue working on this. Currently I'm planning to finish university by the end of the year, and I don't know yet how i'll be earning my living then. Preferrably, by working on
this
-- or something similarly wiki-related, my head is full of ideas :)
Might be hard to earn a living working on this but who knows.
It's worth a try :)
-- Daniel
Nice job, please add it to WP:ACST so it can be known to a wider audience.
Han-Teng Liao (OII) wrote:
Dear Mr. Kinzler, Could you give me an indication if your code is ready for other languages as well? I am asking particularly about the Unicode processing because I am really interested in trying it out in East Asian context (e.g. Chinese, Japanese, and Korean)
The code should be fully unicode-capable, at least as far as the encoding is concerned. The methods and algorithms I used are designed to be mostly language-independant, but some of them will probably have to be adopted for CJK languages. Especially the code for word- and sentence-splitting as well as for measuring lexicographic similarity/distance would have to be looked at closely. However, providing a suitable implementation for different languages or scripts should be possible without problems, due to the modular design I used for the text processing classes.
Applying my code to CJK languages would be a great challange to my design, and I would be very interested to see how it works out. I did not test it, simply because I know next to nothing about those languages. I would be happy to assist you in trying to adopt it to CJK languages and scripts.
Regards, Daniel
PS: I have to appologize in advance to anyone trying to understand the code. I tried to kep the design clean, but the code is not always pretty, and worst of all, there are close to no comments. The thesis explains the most important bits, but if you don't read german, that does you little good i'm afraid. I hope I will be able to improve on this over time.
PS: I have to appologize in advance to anyone trying to understand the code. I tried to kep the design clean, but the code is not always pretty, and worst of all, there are close to no comments. The thesis explains the most important bits, but if you don't read german, that does you little good i'm afraid. I hope I will be able to improve on this over time.
Is it unit tested?
If so, then I forgive you ;-).
Alain Désilets
wiki-research-l@lists.wikimedia.org