Re: [Wiki-research-l] thesis: automatically building a multilingual thesaurus from wikipedia

30 May 2008

Han-Teng Liao (OII) wrote:
...
   Dear Mr. Kinzler,
   Could you give me an indication if your code is ready for other
 languages as well? I am asking particularly about the Unicode processing
 because I am really interested in trying it out in East Asian context
 (e.g. Chinese, Japanese, and Korean) 
The code should be fully unicode-capable, at least as far as the encoding is
concerned. The methods and algorithms I used are designed to be mostly
language-independant, but some of them will probably have to be adopted for CJK
languages. Especially the code for word- and sentence-splitting as well as for
measuring lexicographic similarity/distance would have to be looked at closely.
However, providing a suitable implementation for different languages or scripts
should be possible without problems, due to the modular design I used for the
text processing classes.

Applying my code to CJK languages would be a great challange to my design, and I
would be very interested to see how it works out. I did not test it, simply
because I know next to nothing about those languages. I would be happy to assist
you in trying to adopt it to CJK languages and scripts.

Regards,
Daniel

PS: I have to appologize in advance to anyone trying to understand the code. I
tried to kep the design clean, but the code is not always pretty, and worst of
all, there are close to no comments. The thesis explains the most important
bits, but if you don't read german, that does you little good i'm afraid. I hope
I will be able to improve on this over time.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Wiki-research-l] thesis: automatically building a multilingual thesaurus from wikipedia