[Wikimedia-l] The case for supporting open source machine translation

Wed Apr 24 11:48:06 UTC 2013

On 24/04/13 08:29, Erik Moeller wrote:
> Could open source MT be such a strategic investment? I don't know, but
> I'd like to at least raise the question. I think the alternative will
> be, for the foreseeable future, to accept that this piece of
> technology will be proprietary, and to rely on goodwill for any
> integration that concerns Wikimedia. Not the worst outcome, but also
> not the best one.
>
> Are there open source MT efforts that are close enough to merit
> scrutiny? In order to be able to provide high quality result, you
> would need not only a motivated, well-intentioned group of people, but
> some of the smartest people in the field working on it.  I doubt we
> could more than kickstart an effort, but perhaps financial backing at
> significant scale could at least help a non-profit, open source effort
> to develop enough critical mass to go somewhere.

A huge and worthwile effort on its own, and anyway a necessary step for 
creating free MT software, would be to build a free (as in freedom) 
parallel translation corpus. This corpus could then be used as the 
starting point by people and groups who are producing free MT software, 
either under WMF or on their own.

This could be done by creating a new project where volunteers could 
compare Wikipedia articles and other free translated texts and mark 
sentences that are translations of other sentences. By the way, I 
believe Google Translate's corpus was created in this way.

Perhaps this could be best achieved by teaming with www.zooniverse.org 
or www.pgdp.net who have experience in this kind of projects. This would 
require specialized non-wiki software, and I don't think that the 
Foundation has enough experience in developing it.

(By the way, similar things that could be similarly useful include free 
OCR training data or free fully annotated text.)