[Wikimedia-l] The case for supporting open source machine translation

Wed Apr 24 15:01:14 UTC 2013

I really like Erik's original suggestion, and these ideas, Denny.

Since there are many different possible goals, it's worth having a
page just to list all of the possible different goals and compare them
- both how they fit with one another and how they fit with existing
active projects elsewhere on the web.

SJ

On Wed, Apr 24, 2013 at 6:35 AM, Denny Vrandečić
<denny.vrandecic at wikimedia.de> wrote:
> Erik, all,
>
> sorry for the long mail.
>
> Incidentally, I have been thinking in this direction myself for a while,
> and I have come to a number of conclusions:
> 1) the Wikimedia movement can not, in its current state, tackle the problem
> of machine translation of arbitrary text from and to all of our supported
> languages
> 2) the Wikimedia movement is probably the single most important source of
> training data already. Research that I have done with colleagues based on
> Wikimedia corpora as training data easily beat other corpora, and others
> are using Wikimedia corpora routinely already. There is not much we can
> improve here, actually
> 3) Wiktionary could be an even more amazing resource if we would finally
> tackle the issue of structuring its content more appropriately. I think
> Wikidata opened a few venues to structure planning in this direction and
> provide some software, but this would have the potential to provide more
> support for any external project than many other things we could tackle
>
> Looking at the first statement, there are two ways we could constrain it to
> make it possibly feasible:
> a) constrain the number of supported languages. Whereas this would be
> technically the simpler solution, I think there is agreement that this is
> not in our interest at all
> b) constrain the kind of input text we want to support
>
> If we constrain b) a lot, we could just go and develop "pages to display
> for pages that do not exist yet based on Wikidata" in the smaller
> languages. That's a far cry from machine translating the articles, but it
> would be a low hanging fruit. And it might help with a desire which is
> evidently strongly expressed by the mass creation of articles through bots
> in a growing number of languages. Even more constraints would still allow
> us to use Wikidata items for tagging and structuring Commons in a
> language-independent way (this was suggested by Erik earlier).
>
> Current machine translation research aims at using massive machine learning
> supported systems. They usually require big parallel corpora. We do not
> have big parallel corpora (Wikipedia articles are not translations of each
> other, in general), especially not for many languages, and there is no
> reason to believe this is going to change. I would question if we want to
> build an infrastructure for gathering those corpora from the Web
> continuously. I do not think we can compete in this arena, or that is the
> best use of our resources to support projects in this area. We should use
> our unique features to our advantage.
>
> How can we use the unique features of the Wikimedia movement to our
> advantage? What are our unique features? Well, obviously, the awesome
> community we are. Our technology, as amazing as it is, running our Websites
> on the given budget, is nevertheless not what makes us what we are. Most
> processes on the Wikimedia projects are developed in the community space,
> and not implemented in bits. To summon Lessing, if code is law, Wikimedia
> projects are really good in creating a space that allows for a community to
> live in this space and have the freedom to create their own ecosystem.
>
> One idea I have been mulling over for years is basically how can we use
> this advantage for the task of creating content available in many
> languages. Wikidata is an obvious attempt at that, but it really goes only
> so far. The system I am really aiming at is a different one, and there has
> been plenty of related work in this direction: imagine a wiki where you
> enter or edit content, sentence by sentence, but the natural language
> representation is just a surface syntax for an internal structure. Your
> editing interface is a constrained, but natural language. Now, in order to
> really make this fly, both the rules for the parsers (interpreting the
> input) and the serializer (creating the output) would need to be editable
> by the community - in addition to the content itself. There are a number of
> major challenges involved, but I have by now a fair idea of how to tackle
> most of them (and I don't have the time to detail them right now). Wikidata
> had some design decision inside it that are already geared towards enabling
> the solution for some of the problems for this kind of wiki. Whatever a
> structured Wiktionary would look like, it should also be aligned with the
> requirements of the project outlined here. Basically, we take constrain b,
> but make it possible to push the constraint further and further through the
> community - that's how we could scale on this task.
>
> This would be far away from solving the problem of automatic translation of
> text, and even further away from understanding text. But given where we are
> and the resources we have available, I think it would be a more feasible
> path towards achieving the mission of the Wikimedia movement than tackling
> the problem of general machine learning.
>
> In summary, I see four calls for action right now (and for all of them this
> means to first actually think more and write down a project plan and gather
> input on that), that could and should be tackled in parallel if possible:
> I ) develop  a structured Wiktionary
> II ) develop a feature that blends into Wikipedia's search if an article
> about a topic does not exist yet, but we  have data on Wikidata about that
> topic
> III ) develop a multilingual search, tagging, and structuring environment
> for Commons
> IV ) develop structured Wiki content using natural language as a surface
> syntax, with extensible parsers and serializers
>
> None of these goals would require tens of millions or decades of research
> and development. I think we could have an actionable plan developed within
> a month or two for all four goals, and my gut feeling is we could reach
> them all by 2015 or 16, depending when we actually start with implementing
> them.
>
> Goal IV carries a considerable risk, but there's a fair chance it could
> work out. It could also fail utterly, but if it would even partially
> succeed ...
>
> Cheers,
> Denny
>
>
>
> 2013/4/24 Erik Moeller <erik at wikimedia.org>
>
>> Wikimedia's mission is to make the sum of all knowledge available to
>> every person on the planet. We do this by enabling communities in all
>> languages to organize and collect knowledge in our projects, removing
>> any barriers that we're able to remove.
>>
>> In spite of this, there are and will always be large disparities in
>> the amount of locally created and curated knowledge available per
>> language, as is evident by simple statistical comparison (and most
>> beautifully visualized in Erik Zachte's bubble chart [1]).
>>
>> Google, Microsoft and others have made great strides in developing
>> free-as-in-beer translation tools that can be used to translate from
>> and to many different languages. Increasingly, it is possible to at
>> least make basic sense of content in many different languages using
>> these tools. Machine translation can also serve as a starting point
>> for human translations.
>>
>> Although free-as-in-beer for basic usage, integration can be
>> expensive. Google Translate charges $20 per 1M characters of text for
>> API usage. [2] These tools get better from users using them, but I've
>> seen little evidence of sharing of open datasets that would help the
>> field get better over time.
>>
>> Undoubtedly, building the technology and the infrastructure for these
>> translation services is a very expensive undertaking, and it's
>> understandable that there are multiple commercial reasons that drive
>> the major players' ambitions in this space. But if we look at it from
>> the perspective of "How will billions of people learn in the coming
>> decades", it seems clear that better translation tools should at least
>> play some part in reducing knowledge disparities in different
>> languages, and that ideally, such tools should be "free-as-in-speech"
>> (since they're fundamentally related to speech itself).
>>
>> If we imagine a world where top notch open source MT is available,
>> that would be a world where increasingly, language barriers to
>> accessing human knowledge could be reduced. True, translation is no
>> substitute for original content creation in a language -- but it could
>> at least powerfully support and enable such content creation, and
>> thereby help hundreds of millions of people. Beyond Wikimedia, high
>> quality open source MT would likely be integrated in many contexts
>> where it would do good for humanity and allow people to cross into
>> cultural and linguistic spaces they would otherwise not have access
>> to.
>>
>> While Wikimedia is still only a medium-sized organization, it is not
>> poor. With more than 1M donors supporting our mission and a cash
>> position of $40M, we do now have a greater ability to make strategic
>> investments that further our mission, as communicated to our donors.
>> That's a serious level of trust and not to be taken lightly, either by
>> irresponsibly spending, or by ignoring our ability to do good.
>>
>> Could open source MT be such a strategic investment? I don't know, but
>> I'd like to at least raise the question. I think the alternative will
>> be, for the foreseeable future, to accept that this piece of
>> technology will be proprietary, and to rely on goodwill for any
>> integration that concerns Wikimedia. Not the worst outcome, but also
>> not the best one.
>>
>> Are there open source MT efforts that are close enough to merit
>> scrutiny? In order to be able to provide high quality result, you
>> would need not only a motivated, well-intentioned group of people, but
>> some of the smartest people in the field working on it.  I doubt we
>> could more than kickstart an effort, but perhaps financial backing at
>> significant scale could at least help a non-profit, open source effort
>> to develop enough critical mass to go somewhere.
>>
>> All best,
>> Erik
>>
>> [1]
>> http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
>> [2] https://developers.google.com/translate/v2/pricing
>> --
>> Erik Möller
>> VP of Engineering and Product Development, Wikimedia Foundation
>>
>> Wikipedia and our other projects reach more than 500 million people every
>> month. The world population is estimated to be >7 billion. Still a long
>> way to go. Support us. Join us. Share: https://wikimediafoundation.org/
>>
>> _______________________________________________
>> Wikimedia-l mailing list
>> Wikimedia-l at lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>>
>
>
>
> --
> Project director Wikidata
> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
> der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
> Körperschaften I Berlin, Steuernummer 27/681/51985.
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l

-- 
Samuel Klein          @metasj           w:user:sj          +1 617 529 4266