[Wikimedia-l] The case for supporting open source machine translation

Ryu Cheol rcheol at gmail.com
Thu Apr 25 05:53:55 UTC 2013


Thank you for my learning on what you are going forward with Wikidata, Denny.

I am a Korean Wikipedia contributor. I definitely agree with Erik that we have to tackle the problem of information disparity between languages. But I feel we can take better choices than investing to open source machine translation itself. Wikipedia content could be reused for commercial purposes. We know it will help the spreading of the Wikipedia. I think it is all the same. If proprietary machine translations could help the getting rid of the barrier of the language, it would great also. I hope we could support any machine translation developing team as well as open source machine  translation team. But I believe finally open source machine translation will prevail.

Wikidata-based approaches are great! But I hope Wikipedia could do more including providing well aligned parallel corpora. I had looked into Google's translation workbench which tried to provide a customized translation tool for Wikipedia. I tried to translate a few English articles into Korean myself. The tool has a translation memory and a customizable dictionary. It lacked lots of features for practical translation and the interface was clumsy. 

I believe translatewiki.net could do better than Google. I hope the translatewiki could provide a translation workbench not just for messages in softwares but Wikipedia articles. Through the workbenk, we could get out more great data in addition to parallel corpus. We can track how a human translator works. If we have more data on the editing activity, we can improve the translation job and get new clues for automatic translation. 
The translator will start from a stub and he will improve the draft. Peer reviewers will give eyes on the draft and will make it better. I mean logs for collaborated translation on a parallel corpora could provide more things to learn. 

I think Wikipedia community could start an initiative for supporting raw materials for machine learning to translate. Those would be common asset for machine translation systems. 

Best regards

RYU Cheol
Chair of Wikimedia Korea Preparation Committee


2013. 4. 24., 오후 7:35, Denny Vrandečić <denny.vrandecic at wikimedia.de> 작성:

> Erik, all,
> 
> sorry for the long mail.
> 
> Incidentally, I have been thinking in this direction myself for a while,
> and I have come to a number of conclusions:
> 1) the Wikimedia movement can not, in its current state, tackle the problem
> of machine translation of arbitrary text from and to all of our supported
> languages
> 2) the Wikimedia movement is probably the single most important source of
> training data already. Research that I have done with colleagues based on
> Wikimedia corpora as training data easily beat other corpora, and others
> are using Wikimedia corpora routinely already. There is not much we can
> improve here, actually
> 3) Wiktionary could be an even more amazing resource if we would finally
> tackle the issue of structuring its content more appropriately. I think
> Wikidata opened a few venues to structure planning in this direction and
> provide some software, but this would have the potential to provide more
> support for any external project than many other things we could tackle
> 
> Looking at the first statement, there are two ways we could constrain it to
> make it possibly feasible:
> a) constrain the number of supported languages. Whereas this would be
> technically the simpler solution, I think there is agreement that this is
> not in our interest at all
> b) constrain the kind of input text we want to support
> 
> If we constrain b) a lot, we could just go and develop "pages to display
> for pages that do not exist yet based on Wikidata" in the smaller
> languages. That's a far cry from machine translating the articles, but it
> would be a low hanging fruit. And it might help with a desire which is
> evidently strongly expressed by the mass creation of articles through bots
> in a growing number of languages. Even more constraints would still allow
> us to use Wikidata items for tagging and structuring Commons in a
> language-independent way (this was suggested by Erik earlier).
> 
> Current machine translation research aims at using massive machine learning
> supported systems. They usually require big parallel corpora. We do not
> have big parallel corpora (Wikipedia articles are not translations of each
> other, in general), especially not for many languages, and there is no
> reason to believe this is going to change. I would question if we want to
> build an infrastructure for gathering those corpora from the Web
> continuously. I do not think we can compete in this arena, or that is the
> best use of our resources to support projects in this area. We should use
> our unique features to our advantage.
> 
> How can we use the unique features of the Wikimedia movement to our
> advantage? What are our unique features? Well, obviously, the awesome
> community we are. Our technology, as amazing as it is, running our Websites
> on the given budget, is nevertheless not what makes us what we are. Most
> processes on the Wikimedia projects are developed in the community space,
> and not implemented in bits. To summon Lessing, if code is law, Wikimedia
> projects are really good in creating a space that allows for a community to
> live in this space and have the freedom to create their own ecosystem.
> 
> One idea I have been mulling over for years is basically how can we use
> this advantage for the task of creating content available in many
> languages. Wikidata is an obvious attempt at that, but it really goes only
> so far. The system I am really aiming at is a different one, and there has
> been plenty of related work in this direction: imagine a wiki where you
> enter or edit content, sentence by sentence, but the natural language
> representation is just a surface syntax for an internal structure. Your
> editing interface is a constrained, but natural language. Now, in order to
> really make this fly, both the rules for the parsers (interpreting the
> input) and the serializer (creating the output) would need to be editable
> by the community - in addition to the content itself. There are a number of
> major challenges involved, but I have by now a fair idea of how to tackle
> most of them (and I don't have the time to detail them right now). Wikidata
> had some design decision inside it that are already geared towards enabling
> the solution for some of the problems for this kind of wiki. Whatever a
> structured Wiktionary would look like, it should also be aligned with the
> requirements of the project outlined here. Basically, we take constrain b,
> but make it possible to push the constraint further and further through the
> community - that's how we could scale on this task.
> 
> This would be far away from solving the problem of automatic translation of
> text, and even further away from understanding text. But given where we are
> and the resources we have available, I think it would be a more feasible
> path towards achieving the mission of the Wikimedia movement than tackling
> the problem of general machine learning.
> 
> In summary, I see four calls for action right now (and for all of them this
> means to first actually think more and write down a project plan and gather
> input on that), that could and should be tackled in parallel if possible:
> I ) develop  a structured Wiktionary
> II ) develop a feature that blends into Wikipedia's search if an article
> about a topic does not exist yet, but we  have data on Wikidata about that
> topic
> III ) develop a multilingual search, tagging, and structuring environment
> for Commons
> IV ) develop structured Wiki content using natural language as a surface
> syntax, with extensible parsers and serializers
> 
> None of these goals would require tens of millions or decades of research
> and development. I think we could have an actionable plan developed within
> a month or two for all four goals, and my gut feeling is we could reach
> them all by 2015 or 16, depending when we actually start with implementing
> them.
> 
> Goal IV carries a considerable risk, but there's a fair chance it could
> work out. It could also fail utterly, but if it would even partially
> succeed ...
> 
> Cheers,
> Denny
> 
> 
> 
> 2013/4/24 Erik Moeller <erik at wikimedia.org>
> 
>> Wikimedia's mission is to make the sum of all knowledge available to
>> every person on the planet. We do this by enabling communities in all
>> languages to organize and collect knowledge in our projects, removing
>> any barriers that we're able to remove.
>> 
>> In spite of this, there are and will always be large disparities in
>> the amount of locally created and curated knowledge available per
>> language, as is evident by simple statistical comparison (and most
>> beautifully visualized in Erik Zachte's bubble chart [1]).
>> 
>> Google, Microsoft and others have made great strides in developing
>> free-as-in-beer translation tools that can be used to translate from
>> and to many different languages. Increasingly, it is possible to at
>> least make basic sense of content in many different languages using
>> these tools. Machine translation can also serve as a starting point
>> for human translations.
>> 
>> Although free-as-in-beer for basic usage, integration can be
>> expensive. Google Translate charges $20 per 1M characters of text for
>> API usage. [2] These tools get better from users using them, but I've
>> seen little evidence of sharing of open datasets that would help the
>> field get better over time.
>> 
>> Undoubtedly, building the technology and the infrastructure for these
>> translation services is a very expensive undertaking, and it's
>> understandable that there are multiple commercial reasons that drive
>> the major players' ambitions in this space. But if we look at it from
>> the perspective of "How will billions of people learn in the coming
>> decades", it seems clear that better translation tools should at least
>> play some part in reducing knowledge disparities in different
>> languages, and that ideally, such tools should be "free-as-in-speech"
>> (since they're fundamentally related to speech itself).
>> 
>> If we imagine a world where top notch open source MT is available,
>> that would be a world where increasingly, language barriers to
>> accessing human knowledge could be reduced. True, translation is no
>> substitute for original content creation in a language -- but it could
>> at least powerfully support and enable such content creation, and
>> thereby help hundreds of millions of people. Beyond Wikimedia, high
>> quality open source MT would likely be integrated in many contexts
>> where it would do good for humanity and allow people to cross into
>> cultural and linguistic spaces they would otherwise not have access
>> to.
>> 
>> While Wikimedia is still only a medium-sized organization, it is not
>> poor. With more than 1M donors supporting our mission and a cash
>> position of $40M, we do now have a greater ability to make strategic
>> investments that further our mission, as communicated to our donors.
>> That's a serious level of trust and not to be taken lightly, either by
>> irresponsibly spending, or by ignoring our ability to do good.
>> 
>> Could open source MT be such a strategic investment? I don't know, but
>> I'd like to at least raise the question. I think the alternative will
>> be, for the foreseeable future, to accept that this piece of
>> technology will be proprietary, and to rely on goodwill for any
>> integration that concerns Wikimedia. Not the worst outcome, but also
>> not the best one.
>> 
>> Are there open source MT efforts that are close enough to merit
>> scrutiny? In order to be able to provide high quality result, you
>> would need not only a motivated, well-intentioned group of people, but
>> some of the smartest people in the field working on it.  I doubt we
>> could more than kickstart an effort, but perhaps financial backing at
>> significant scale could at least help a non-profit, open source effort
>> to develop enough critical mass to go somewhere.
>> 
>> All best,
>> Erik
>> 
>> [1]
>> http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrowthWp.html
>> [2] https://developers.google.com/translate/v2/pricing
>> --
>> Erik Möller
>> VP of Engineering and Product Development, Wikimedia Foundation
>> 
>> Wikipedia and our other projects reach more than 500 million people every
>> month. The world population is estimated to be >7 billion. Still a long
>> way to go. Support us. Join us. Share: https://wikimediafoundation.org/
>> 
>> _______________________________________________
>> Wikimedia-l mailing list
>> Wikimedia-l at lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>> 
> 
> 
> 
> -- 
> Project director Wikidata
> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
> Tel. +49-30-219 158 26-0 | http://wikimedia.de
> 
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
> der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
> Körperschaften I Berlin, Steuernummer 27/681/51985.
> _______________________________________________
> Wikimedia-l mailing list
> Wikimedia-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l




More information about the Wikimedia-l mailing list