On Wed, Apr 24, 2013 at 2:04 PM, Mathieu Stumpf <
psychoslave(a)culture-libre.org> wrote:
I would like to add that (I'm no specialist of
this subject) translating
natural language probably need at least a large set of existing
translations, at least to get read of "obvious well known" idiotisms like
"kitchen sink" translated "usine à gaz" when you are speaking of a
software
for example. On this regard, we probably have such a base with wikisource.
What do you think?
Personally, I think this is an awesome idea :-)
Wikisource corpora could be a huge asset in developing this.
We already host different public domain translations, and in the future, we
hope, more and more Wikisources will allow user generated translations.
At the moment, Wikisource could be a interesting corpora and laboratory for
improving and enhancing OCR,
as the OCR generated text is always proofread and corrected by humans.
As part of our project (
http://wikisource.org/wiki/Wikisource_vision_development), Micru was
looking for a GSoC candidate for studing the reinsertion of proofread text
into djvus [1], but at the moment didn't find any interested student. We
have some contacts with people at Google working on Tesseract, and they
were available for mentoring.
Aubrey
[1] We thought about this both for OCR enhancement purposes and files
updating on Commons and Internet Archive (which is off topic here).