New subject: The case for supporting open source machine translation

24 Apr 2013


      Wikimedia's mission is to make the sum of all knowledge available to
every person on the planet. We do this by enabling communities in all
languages to organize and collect knowledge in our projects, removing
any barriers that we're able to remove.
In spite of this, there are and will always be large disparities in
the amount of locally created and curated knowledge available per
language, as is evident by simple statistical comparison (and most
beautifully visualized in Erik Zachte's bubble chart [1]).
Google, Microsoft and others have made great strides in developing
free-as-in-beer translation tools that can be used to translate from
and to many different languages. Increasingly, it is possible to at
least make basic sense of content in many different languages using
these tools. Machine translation can also serve as a starting point
for human translations.
Although free-as-in-beer for basic usage, integration can be
expensive. Google Translate charges $20 per 1M characters of text for
API usage. [2] These tools get better from users using them, but I've
seen little evidence of sharing of open datasets that would help the
field get better over time.
Undoubtedly, building the technology and the infrastructure for these
translation services is a very expensive undertaking, and it's
understandable that there are multiple commercial reasons that drive
the major players' ambitions in this space. But if we look at it from
the perspective of "How will billions of people learn in the coming
decades", it seems clear that better translation tools should at least
play some part in reducing knowledge disparities in different
languages, and that ideally, such tools should be "free-as-in-speech"
(since they're fundamentally related to speech itself).
If we imagine a world where top notch open source MT is available,
that would be a world where increasingly, language barriers to
accessing human knowledge could be reduced. True, translation is no
substitute for original content creation in a language -- but it could
at least powerfully support and enable such content creation, and
thereby help hundreds of millions of people. Beyond Wikimedia, high
quality open source MT would likely be integrated in many contexts
where it would do good for humanity and allow people to cross into
cultural and linguistic spaces they would otherwise not have access
to.
While Wikimedia is still only a medium-sized organization, it is not
poor. With more than 1M donors supporting our mission and a cash
position of $40M, we do now have a greater ability to make strategic
investments that further our mission, as communicated to our donors.
That's a serious level of trust and not to be taken lightly, either by
irresponsibly spending, or by ignoring our ability to do good.
Could open source MT be such a strategic investment? I don't know, but
I'd like to at least raise the question. I think the alternative will
be, for the foreseeable future, to accept that this piece of
technology will be proprietary, and to rely on goodwill for any
integration that concerns Wikimedia. Not the worst outcome, but also
not the best one.
Are there open source MT efforts that are close enough to merit
scrutiny? In order to be able to provide high quality result, you
would need not only a motivated, well-intentioned group of people, but
some of the smartest people in the field working on it.  I doubt we
could more than kickstart an effort, but perhaps financial backing at
significant scale could at least help a non-profit, open source effort
to develop enough critical mass to go somewhere.
All best,
Erik
[1] http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrow...
[2] https://developers.google.com/translate/v2/pricing
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every
month. The world population is estimated to be >7 billion. Still a long
way to go. Support us. Join us. Share: https://wikimediafoundation.org/