Han-Teng Liao (OII) wrote:
Dear Mr. Kinzler, Could you give me an indication if your code is ready for other languages as well? I am asking particularly about the Unicode processing because I am really interested in trying it out in East Asian context (e.g. Chinese, Japanese, and Korean)
The code should be fully unicode-capable, at least as far as the encoding is concerned. The methods and algorithms I used are designed to be mostly language-independant, but some of them will probably have to be adopted for CJK languages. Especially the code for word- and sentence-splitting as well as for measuring lexicographic similarity/distance would have to be looked at closely. However, providing a suitable implementation for different languages or scripts should be possible without problems, due to the modular design I used for the text processing classes.
Applying my code to CJK languages would be a great challange to my design, and I would be very interested to see how it works out. I did not test it, simply because I know next to nothing about those languages. I would be happy to assist you in trying to adopt it to CJK languages and scripts.
Regards, Daniel
PS: I have to appologize in advance to anyone trying to understand the code. I tried to kep the design clean, but the code is not always pretty, and worst of all, there are close to no comments. The thesis explains the most important bits, but if you don't read german, that does you little good i'm afraid. I hope I will be able to improve on this over time.