[Wikimedia-l] The case for supporting open source machine translation

Andrea Zanni zanni.andrea84 at gmail.com
Wed Apr 24 12:39:18 UTC 2013


On Wed, Apr 24, 2013 at 2:04 PM, Mathieu Stumpf <
psychoslave at culture-libre.org> wrote:

> I would like to add that (I'm no specialist of this subject) translating
> natural language probably need at least a large set of existing
> translations, at least to get read of "obvious well known" idiotisms like
> "kitchen sink" translated "usine à gaz" when you are speaking of a software
> for example. On this regard, we probably have such a base with wikisource.
> What do you think?


Personally, I think this is an awesome idea :-)
Wikisource corpora could be a huge asset in developing this.
We already host different public domain translations, and in the future, we
hope, more and more Wikisources will allow user generated translations.

At the moment, Wikisource could be a interesting corpora and laboratory for
improving and enhancing OCR,
as the OCR generated text is always proofread and corrected by humans.
As part of our project (
http://wikisource.org/wiki/Wikisource_vision_development), Micru was
looking for a GSoC candidate for studing the reinsertion of proofread text
into djvus [1], but at the moment didn't find any interested student. We
have some contacts with people at Google working on Tesseract, and they
were available for mentoring.

Aubrey

[1] We thought about this both for OCR enhancement purposes and files
updating on Commons and Internet Archive (which is off topic here).


More information about the Wikimedia-l mailing list