[Wikimedia-l] The case for supporting open source machine translation

Fri Apr 26 17:57:54 UTC 2013

On Fri, Apr 26, 2013 at 1:24 PM, Bjoern Hoehrmann <derhoermi at gmx.net> wrote:
> * Erik Moeller wrote:
>>Are there open source MT efforts that are close enough to merit
>>scrutiny?
>
> Wiktionary. If you want to help free software efforts in the area of
> machine translation, then what they seem to need most is high quality
> data about words, word forms, and so on, in a readily machine-usable
> form, and freely licensed.

Yes.  Finding a way to capture and integrate the work OmegaWiki has
done into a new Wikidata-powered Wiktionary would be a useful start.
And we've already sort of claimed the space (though we are neglecting
it) -- it's discouraging to anyone else who might otherwise try to
build a brilliant free structured dictionary that we are *so close* to
getting it right.

><       [ Andrea's ideas about using Wikisource to improve OCR tools ]
>
> I built various tools that could be fairly easily adapted for this, my
> http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr
> notes are available. One of the tools for instance is a diff tool, see
> image at <http://lists.w3.org/Archives/Public/www-archive/2012Apr/0031>.

I hope the related GSOC project gets support.  Getting mentoring from
Tesseract team members seems like a handy way to keep the projects
connected.

Tim Starling writes:
> We could basically clone the frontend component of Google Translate,
> and use Moses as a backend. The work would be mostly JavaScript...
> the next job would be to develop a corpus sharing site, hosting any
> available freely-licensed output of the frontend tool.

This would be most useful.  There are often short quick translation
projects that I would like to do through this sort of TM-capturing
interface; for which the translatewiki prep-process is rather time
consuming.

We can set up a corpus sharing site now, with translatewiki - there is
already a lot of material there that could be part of it.  Different
corpora (say, encyclopedic articles v. dictionary pages v. quotes)
would need to be tagged for context.  And we could start letting
people upload their own freely licensed corpora to include as well.
We would probably want a vetting process before giving users the
import tool; or a quarantine until we had better ways to let editors
revert / bulk-modify entire imports.

SJ