[Wikimedia-l] The case for supporting open source machine translation

Wed Apr 24 10:35:33 UTC 2013

Erik, all,

sorry for the long mail.

Incidentally, I have been thinking in this direction myself for a while,
and I have come to a number of conclusions:
1) the Wikimedia movement can not, in its current state, tackle the problem
of machine translation of arbitrary text from and to all of our supported
languages
2) the Wikimedia movement is probably the single most important source of
training data already. Research that I have done with colleagues based on
Wikimedia corpora as training data easily beat other corpora, and others
are using Wikimedia corpora routinely already. There is not much we can
improve here, actually
3) Wiktionary could be an even more amazing resource if we would finally
tackle the issue of structuring its content more appropriately. I think
Wikidata opened a few venues to structure planning in this direction and
provide some software, but this would have the potential to provide more
support for any external project than many other things we could tackle

Looking at the first statement, there are two ways we could constrain it to
make it possibly feasible:
a) constrain the number of supported languages. Whereas this would be
technically the simpler solution, I think there is agreement that this is
not in our interest at all
b) constrain the kind of input text we want to support

If we constrain b) a lot, we could just go and develop "pages to display
for pages that do not exist yet based on Wikidata" in the smaller
languages. That's a far cry from machine translating the articles, but it
would be a low hanging fruit. And it might help with a desire which is
evidently strongly expressed by the mass creation of articles through bots
in a growing number of languages. Even more constraints would still allow
us to use Wikidata items for tagging and structuring Commons in a
language-independent way (this was suggested by Erik earlier).

Current machine translation research aims at using massive machine learning
supported systems. They usually require big parallel corpora. We do not
have big parallel corpora (Wikipedia articles are not translations of each
other, in general), especially not for many languages, and there is no
reason to believe this is going to change. I would question if we want to
build an infrastructure for gathering those corpora from the Web
continuously. I do not think we can compete in this arena, or that is the
best use of our resources to support projects in this area. We should use
our unique features to our advantage.

How can we use the unique features of the Wikimedia movement to our
advantage? What are our unique features? Well, obviously, the awesome
community we are. Our technology, as amazing as it is, running our Websites
on the given budget, is nevertheless not what makes us what we are. Most
processes on the Wikimedia projects are developed in the community space,
and not implemented in bits. To summon Lessing, if code is law, Wikimedia
projects are really good in creating a space that allows for a community to
live in this space and have the freedom to create their own ecosystem.

One idea I have been mulling over for years is basically how can we use
this advantage for the task of creating content available in many
languages. Wikidata is an obvious attempt at that, but it really goes only
so far. The system I am really aiming at is a different one, and there has
been plenty of related work in this direction: imagine a wiki where you
enter or edit content, sentence by sentence, but the natural language
representation is just a surface syntax for an internal structure. Your
editing interface is a constrained, but natural language. Now, in order to
really make this fly, both the rules for the parsers (interpreting the
input) and the serializer (creating the output) would need to be editable
by the community - in addition to the content itself. There are a number of
major challenges involved, but I have by now a fair idea of how to tackle
most of them (and I don't have the time to detail them right now). Wikidata
had some design decision inside it that are already geared towards enabling
the solution for some of the problems for this kind of wiki. Whatever a
structured Wiktionary would look like, it should also be aligned with the
requirements of the project outlined here. Basically, we take constrain b,
but make it possible to push the constraint further and further through the
community - that's how we could scale on this task.

This would be far away from solving the problem of automatic translation of
text, and even further away from understanding text. But given where we are
and the resources we have available, I think it would be a more feasible
path towards achieving the mission of the Wikimedia movement than tackling
the problem of general machine learning.

In summary, I see four calls for action right now (and for all of them this
means to first actually think more and write down a project plan and gather
input on that), that could and should be tackled in parallel if possible:
I ) develop  a structured Wiktionary
II ) develop a feature that blends into Wikipedia's search if an article
about a topic does not exist yet, but we  have data on Wikidata about that
topic
III ) develop a multilingual search, tagging, and structuring environment
for Commons
IV ) develop structured Wiki content using natural language as a surface
syntax, with extensible parsers and serializers

None of these goals would require tens of millions or decades of research
and development. I think we could have an actionable plan developed within
a month or two for all four goals, and my gut feeling is we could reach
them all by 2015 or 16, depending when we actually start with implementing
them.

Goal IV carries a considerable risk, but there's a fair chance it could
work out. It could also fail utterly, but if it would even partially
succeed ...

Cheers,
Denny

2013/4/24 Erik Moeller <erik at wikimedia.org>

> Wikimedia's mission is to make the sum of all knowledge available to
> every person on the planet. We do this by enabling communities in all
> languages to organize and collect knowledge in our projects, removing
> any barriers that we're able to remove.
>
> In spite of this, there are and will always be large disparities in
> the amount of locally created and curated knowledge available per
> language, as is evident by simple statistical comparison (and most
> beautifully visualized in Erik Zachte's bubble chart [1]).
>
> Google, Microsoft and others have made great strides in developing
> free-as-in-beer translation tools that can be used to translate from
> and to many different languages. Increasingly, it is possible to at
> least make basic sense of content in many different languages using
> these tools. Machine translation can also serve as a starting point
> for human translations.
>
> Although free-as-in-beer for basic usage, integration can be
> expensive. Google Translate charges $20 per 1M characters of text for
> API usage. [2] These tools get better from users using them, but I've
> seen little evidence of sharing of open datasets that would help the
> field get better over time.
>
> Undoubtedly, building the technology and the infrastructure for these
> translation services is a very expensive undertaking, and it's
> understandable that there are multiple commercial reasons that drive
> the major players' ambitions in this space. But if we look at it from
> the perspective of "How will billions of people learn in the coming
> decades", it seems clear that better translation tools should at least
> play some part in reducing knowledge disparities in different
> languages, and that ideally, such tools should be "free-as-in-speech"
> (since they're fundamentally related to speech itself).
>
> If we imagine a world where top notch open source MT is available,
> that would be a world where increasingly, language barriers to
> accessing human knowledge could be reduced. True, translation is no
> substitute for original content creation in a language -- but it could
> at least powerfully support and enable such content creation, and
> thereby help hundreds of millions of people. Beyond Wikimedia, high
> quality open source MT would likely be integrated in many contexts
> where it would do good for humanity and allow people to cross into
> cultural and linguistic spaces they would otherwise not have access
> to.
>
> While Wikimedia is still only a medium-sized organization, it is not
> poor. With more than 1M donors supporting our mission and a cash
> position of $40M, we do now have a greater ability to make strategic
> investments that further our mission, as communicated to our donors.
> That's a serious level of trust and not to be taken lightly, either by
> irresponsibly spending, or by ignoring our ability to do good.
>
> Could open source MT be such a strategic investment? I don't know, but
> I'd like to at least raise the question. I think the alternative will
> be, for the foreseeable future, to accept that this piece of
> technology will be proprietary, and to rely on goodwill for any
> integration that concerns Wikimedia. Not the worst outcome, but also
> not the best one.
>
> Are there open source MT efforts that are close enough to merit
> scrutiny? In order to be able to provide high quality result, you
> would need not only a motivated, well-intentioned group of people, but
> some of the smartest people in the field working on it.  I doubt we
> could more than kickstart an effort, but perhaps financial backing at
> significant scale could at least help a non-profit, open source effort
> to develop enough critical mass to go somewhere.
>
> All best,
> Erik
>
> [1]
> http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
> [2] https://developers.google.com/translate/v2/pricing
> --
> Erik Möller
> VP of Engineering and Product Development, Wikimedia Foundation
>
> Wikipedia and our other projects reach more than 500 million people every
> month. The world population is estimated to be >7 billion. Still a long
> way to go. Support us. Join us. Share: https://wikimediafoundation.org/
>
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>

-- 
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.