[Wikimedia-l] The case for supporting open source machine translation

Fred Bauder fredbaud at fairpoint.net
Wed Apr 24 12:30:27 UTC 2013

> On 24/04/13 08:29, Erik Moeller wrote:
>> Could open source MT be such a strategic investment? I don't know, but
>> I'd like to at least raise the question. I think the alternative will
>> be, for the foreseeable future, to accept that this piece of
>> technology will be proprietary, and to rely on goodwill for any
>> integration that concerns Wikimedia. Not the worst outcome, but also
>> not the best one.
>> Are there open source MT efforts that are close enough to merit
>> scrutiny? In order to be able to provide high quality result, you
>> would need not only a motivated, well-intentioned group of people, but
>> some of the smartest people in the field working on it.  I doubt we
>> could more than kickstart an effort, but perhaps financial backing at
>> significant scale could at least help a non-profit, open source effort
>> to develop enough critical mass to go somewhere.
> A huge and worthwile effort on its own, and anyway a necessary step for
> creating free MT software, would be to build a free (as in freedom)
> parallel translation corpus. This corpus could then be used as the
> starting point by people and groups who are producing free MT software,
> either under WMF or on their own.
> This could be done by creating a new project where volunteers could
> compare Wikipedia articles and other free translated texts and mark
> sentences that are translations of other sentences. By the way, I
> believe Google Translate's corpus was created in this way.
> Perhaps this could be best achieved by teaming with www.zooniverse.org
> or www.pgdp.net who have experience in this kind of projects. This would
> require specialized non-wiki software, and I don't think that the
> Foundation has enough experience in developing it.
> (By the way, similar things that could be similarly useful include free
> OCR training data or free fully annotated text.)

The Bible is quite good for this.


More information about the Wikimedia-l mailing list