Wikimedia's mission is to make the sum of all knowledge available to every person on the planet. We do this by enabling communities in all languages to organize and collect knowledge in our projects, removing any barriers that we're able to remove.
In spite of this, there are and will always be large disparities in the amount of locally created and curated knowledge available per language, as is evident by simple statistical comparison (and most beautifully visualized in Erik Zachte's bubble chart [1]).
Google, Microsoft and others have made great strides in developing free-as-in-beer translation tools that can be used to translate from and to many different languages. Increasingly, it is possible to at least make basic sense of content in many different languages using these tools. Machine translation can also serve as a starting point for human translations.
Although free-as-in-beer for basic usage, integration can be expensive. Google Translate charges $20 per 1M characters of text for API usage. [2] These tools get better from users using them, but I've seen little evidence of sharing of open datasets that would help the field get better over time.
Undoubtedly, building the technology and the infrastructure for these translation services is a very expensive undertaking, and it's understandable that there are multiple commercial reasons that drive the major players' ambitions in this space. But if we look at it from the perspective of "How will billions of people learn in the coming decades", it seems clear that better translation tools should at least play some part in reducing knowledge disparities in different languages, and that ideally, such tools should be "free-as-in-speech" (since they're fundamentally related to speech itself).
If we imagine a world where top notch open source MT is available, that would be a world where increasingly, language barriers to accessing human knowledge could be reduced. True, translation is no substitute for original content creation in a language -- but it could at least powerfully support and enable such content creation, and thereby help hundreds of millions of people. Beyond Wikimedia, high quality open source MT would likely be integrated in many contexts where it would do good for humanity and allow people to cross into cultural and linguistic spaces they would otherwise not have access to.
While Wikimedia is still only a medium-sized organization, it is not poor. With more than 1M donors supporting our mission and a cash position of $40M, we do now have a greater ability to make strategic investments that further our mission, as communicated to our donors. That's a serious level of trust and not to be taken lightly, either by irresponsibly spending, or by ignoring our ability to do good.
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
All best, Erik
[1] http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrow... [2] https://developers.google.com/translate/v2/pricing -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Oh yes, this would really be great. Just think about the money the Foundation gives out meanwhile for translation, plus the many many volunteers' work invested into translation. A free and open translation software is long overdue indeed. Great idea Erik.
Greetings Ting
Am 4/24/2013 8:29 AM, schrieb Erik Moeller:
Wikimedia's mission is to make the sum of all knowledge available to every person on the planet. We do this by enabling communities in all languages to organize and collect knowledge in our projects, removing any barriers that we're able to remove.
In spite of this, there are and will always be large disparities in the amount of locally created and curated knowledge available per language, as is evident by simple statistical comparison (and most beautifully visualized in Erik Zachte's bubble chart [1]).
Google, Microsoft and others have made great strides in developing free-as-in-beer translation tools that can be used to translate from and to many different languages. Increasingly, it is possible to at least make basic sense of content in many different languages using these tools. Machine translation can also serve as a starting point for human translations.
Although free-as-in-beer for basic usage, integration can be expensive. Google Translate charges $20 per 1M characters of text for API usage. [2] These tools get better from users using them, but I've seen little evidence of sharing of open datasets that would help the field get better over time.
Undoubtedly, building the technology and the infrastructure for these translation services is a very expensive undertaking, and it's understandable that there are multiple commercial reasons that drive the major players' ambitions in this space. But if we look at it from the perspective of "How will billions of people learn in the coming decades", it seems clear that better translation tools should at least play some part in reducing knowledge disparities in different languages, and that ideally, such tools should be "free-as-in-speech" (since they're fundamentally related to speech itself).
If we imagine a world where top notch open source MT is available, that would be a world where increasingly, language barriers to accessing human knowledge could be reduced. True, translation is no substitute for original content creation in a language -- but it could at least powerfully support and enable such content creation, and thereby help hundreds of millions of people. Beyond Wikimedia, high quality open source MT would likely be integrated in many contexts where it would do good for humanity and allow people to cross into cultural and linguistic spaces they would otherwise not have access to.
While Wikimedia is still only a medium-sized organization, it is not poor. With more than 1M donors supporting our mission and a cash position of $40M, we do now have a greater ability to make strategic investments that further our mission, as communicated to our donors. That's a serious level of trust and not to be taken lightly, either by irresponsibly spending, or by ignoring our ability to do good.
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
All best, Erik
[1] http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrow... [2] https://developers.google.com/translate/v2/pricing -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
I agree. This is a timely observation about a major problem which directly affects the Foundation's core goals.
I am unsure how far an effort can go today given the state of the art and science, but I think that this is entirely appropriate to think about and investigate and perhaps either fund or bring attention to, perhaps both.
George William Herbert
On Apr 23, 2013, at 11:39 PM, Ting Chen wing.philopp@gmx.de wrote:
Oh yes, this would really be great. Just think about the money the Foundation gives out meanwhile for translation, plus the many many volunteers' work invested into translation. A free and open translation software is long overdue indeed. Great idea Erik.
Greetings Ting
Am 4/24/2013 8:29 AM, schrieb Erik Moeller:
Wikimedia's mission is to make the sum of all knowledge available to every person on the planet. We do this by enabling communities in all languages to organize and collect knowledge in our projects, removing any barriers that we're able to remove.
In spite of this, there are and will always be large disparities in the amount of locally created and curated knowledge available per language, as is evident by simple statistical comparison (and most beautifully visualized in Erik Zachte's bubble chart [1]).
Google, Microsoft and others have made great strides in developing free-as-in-beer translation tools that can be used to translate from and to many different languages. Increasingly, it is possible to at least make basic sense of content in many different languages using these tools. Machine translation can also serve as a starting point for human translations.
Although free-as-in-beer for basic usage, integration can be expensive. Google Translate charges $20 per 1M characters of text for API usage. [2] These tools get better from users using them, but I've seen little evidence of sharing of open datasets that would help the field get better over time.
Undoubtedly, building the technology and the infrastructure for these translation services is a very expensive undertaking, and it's understandable that there are multiple commercial reasons that drive the major players' ambitions in this space. But if we look at it from the perspective of "How will billions of people learn in the coming decades", it seems clear that better translation tools should at least play some part in reducing knowledge disparities in different languages, and that ideally, such tools should be "free-as-in-speech" (since they're fundamentally related to speech itself).
If we imagine a world where top notch open source MT is available, that would be a world where increasingly, language barriers to accessing human knowledge could be reduced. True, translation is no substitute for original content creation in a language -- but it could at least powerfully support and enable such content creation, and thereby help hundreds of millions of people. Beyond Wikimedia, high quality open source MT would likely be integrated in many contexts where it would do good for humanity and allow people to cross into cultural and linguistic spaces they would otherwise not have access to.
While Wikimedia is still only a medium-sized organization, it is not poor. With more than 1M donors supporting our mission and a cash position of $40M, we do now have a greater ability to make strategic investments that further our mission, as communicated to our donors. That's a serious level of trust and not to be taken lightly, either by irresponsibly spending, or by ignoring our ability to do good.
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
All best, Erik
[1] http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrow... [2] https://developers.google.com/translate/v2/pricing -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
A few links: * 2010 discussion: https://strategy.wikimedia.org/wiki/Proposal:Free_Translation_Memory as one of the https://strategy.wikimedia.org/wiki/List_of_things_that_need_to_be_free (follow links, including) * http://www.apertium.org : was used by translatewiki.net but isn't any longer https://translatewiki.net/wiki/Technology * Translate also has a translation memory (of course current use case is more limited) ** Example exposed to the world http://translatewiki.net/w/api.php?action=ttmserver&sourcelanguage=en&targetlanguage=fi&text=january&format=jsonfm ** Docs https://www.mediawiki.org/wiki/Help:Extension:Translate/Translation_memories#TTMServer_API ** All Wikimedia projects share one http://laxstrom.name/blag/2012/09/07/translation-memory-all-wikimedia-wikis/ ** We could join forces if more FLOSS projects used Translate https://translatewiki.net/wiki/Translate_Roll
Nemo
Hi all,
On Wed, 24 Apr 2013 08:39:55 +0200 Ting Chen wing.philopp@gmx.de wrote:
Oh yes, this would really be great. Just think about the money the Foundation gives out meanwhile for translation, plus the many many volunteers' work invested into translation. A free and open translation software is long overdue indeed. Great idea Erik.
unfortunately, I don't think we can expect that with the current state of the art that a machine translation would do as good a job as a human translator, so don't hold your hopes up for that. For example if we translate http://shlomif.livejournal.com/63847.html to English with Google Translate we get: http://translate.google.com/translate?sl=iw&tl=en&js=n&prev=_t&a...
<<<<< Yotam and "hifh own and the Geek "
I have been offered several times to participate B"hifh and the Geek "and I refused. Those who have forgotten, this is what is said in the Bible parable of Jotham :
And they told Jotham, he went and stood on a mountain top - Gerizim, and lifted up his voice and called; And said to them - they heard me Shechem, and God will hear you:
The trees went forth anointed king over them. And they said olive Malka us! Olive said unto them: I stopped the - fertilizers, which - I will honor God and man - And go to the - the trees! And the trees said to the fig: Go - the Kings of us! The fig tree said unto them: I stopped the - sweetness, and - good yield - And go to the - the trees! And the trees said to the vine: Go - the Kings of us! Vine said unto them: I stopped the - Tirosh, auspicious God and man - And go to the - the trees! And tell all - the trees to the - bramble: You're the king - on us! And bramble said to the - trees: If in truth ye anoint me king over you - come and take refuge in my shade; If - no - let fire come out - the bramble, and devour the - cedars of Lebanon!
Sounds incredibly awkward and the main text was taken from http://www.heraldmag.org/literature/doc_12.htm .
So it hardly makes a good job and we cannot expect it to replace human translations.
Regards,
Shlomi Fish
Erik Moeller wrote:
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
[...]
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Putting aside the worrying focus on questionable metrics, the first part of your new e-mail footer "Wikipedia and our other projects" seems to hint at the underlying issue here: Wikimedia already operates a number of projects (about a dozen), but truly supports only one (Wikipedia). Though the Wikimedia community seems eager to add new projects (Wikidata, Wikivoyage), I wonder how it can be sensible or reasonable to focus on yet another project when the current projects are largely neglected (Wikinews, Wikisource, Wikiversity, Wikibooks, Wikiquote, Wiktionary, etc.).
There's a general trend currently within the Wikimedia Foundation to "narrow focus," which includes shelling out third-party MediaWiki release support to an outside contractor or group, because there are apparently not enough resources within the Wikimedia Foundation's 160-plus staff to support the Wikimedia software platform for anyone other than Wikimedia.
In light of this, it seems even more unreasonable and against good sense to pursue a new machine translation endeavor, virtuous as it may be. If an outside organization wants Wikimedia's help and support and their values align with ours, it's certainly something to explore. Otherwise, surely we have enough projects in need of support already.
MZMcBride
On Wed, Apr 24, 2013 at 12:06 AM, MZMcBride z@mzmcbride.com wrote:
Though the Wikimedia community seems eager to add new projects (Wikidata, Wikivoyage), I wonder how it can be sensible or reasonable to focus on yet another project when the current projects are largely neglected (Wikinews, Wikisource, Wikiversity, Wikibooks, Wikiquote, Wiktionary, etc.).
I've stated before why I disagree with this characterization, and I reject this framing. Functionality like the Visual Editor, the mobile site improvements, Lua, and other core engineering initiatives aren't limited in their impact to Wikipedia. The recent efforts on mobile uploading are actually focused on Commons. Deploying new software every two weeks and continually making key usability improvements is not what neglect looks like.
What WMF rarely does is directly focus effort on functionality that primarily serves narrower use cases, which I think is appropriate at this point in the history of our endeavor. My view is that such narrow more vertically focused efforts should be enabled and supported by creating structures like Labs where volunteers can meaningfully prototype specialized functionality and work towards deployment on the cluster.
Moreover, the lens of project/domain name is a very arbitrary one to define vertically focused efforts. There are specialized efforts within Wikipedia that have more scale today than some of our sister projects do, such as individual WikiProjects. There are efforts like the partnerships with cultural institutions which have led to hundreds of thousands of images being made available under a free license. Yet I don't see you complaining about lack of support for GLAM tooling, or WikiProject support (both of which are needed). Why should English Wikinews with 15 active editors demand more collective attention than any other specialized efforts?
Historically, we've drawn that project/domain name dividing line because starting a new wiki was the best way to put a flag in the ground and say "We will solve problem X". And we didn't know which efforts would immediately succeed and which ones wouldn't. But in the year 2013, you could just as well argue that instead of slapping the Wikispecies logo on the frontpage of Wikipedia, we should make more prominent mention of "How to contribute video on Wikipedia" or "Work with your local museum" or "Become a campus ambassador" or any other specialized effort which has shown promise but could use that extra visibility. The idea that just because user X proposed project Y sometime back in the early years of Wikimedia, effort Y must forever be part of a first order prioritization lens, is not rationally defensible.
So, even when our goal isn't simply to make general site improvements that benefit everyone but to support specialized new forms of content or collaboration, I wouldn't use project/domain name division as a tool for assessing impact, but rather frame it in terms of "What problem is being solved here? Who is going to be reached? How many people will be impacted"? And sometimes that does translate well to lens of a single domain name level project, and sometimes it doesn't.
There's a general trend currently within the Wikimedia Foundation to "narrow focus," which includes shelling out third-party MediaWiki release support to an outside contractor or group, because there are apparently not enough resources within the Wikimedia Foundation's 160-plus staff to support the Wikimedia software platform for anyone other than Wikimedia.
It's not a question whether we have enough resources to support it, but how to best put a financial boundary around third party engagement, while also actually enabling third parties to play an important role in the process as well (including potentially chipping in financial support).
In light of this, it seems even more unreasonable and against good sense to pursue a new machine translation endeavor, virtuous as it may be.
To be clear: I was not proposing that WMF should undertake such an effort directly. But I do think that if there are ways to support an effort that has a reasonable probability of success, with a reasonable structure of accountability around such an engagement, it's worth assessing. And again, that position is entirely consistent with my view that WMF should primarily invest in technologies with broad horizontal impact (which open source MT could have) rather than narrower, vertical impact.
Erik -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Erik Moeller, 24/04/2013 10:06:
[...] Moreover, the lens of project/domain name is a very arbitrary one to define vertically focused efforts.
A good and interesting reasoning here. Indeed something to keep in mind, but which adds problems.
There are specialized efforts within Wikipedia that have more scale today than some of our sister projects do, such as individual WikiProjects. There are efforts like the partnerships with cultural institutions which have led to hundreds of thousands of images being made available under a free license. Yet I don't see you complaining about lack of support for GLAM tooling, or WikiProject support (both of which are needed).
You're perhaps right about MZ, but surely GLAM tooling is something often asked; however it arguably falls under Commons development. I've no idea of what WikiProject support you have in mind, and surely WikiProjects are too often dangerous factions to be disbanded rather than encouraged, but we may agree in principle.
Why should English Wikinews with 15 active editors demand more collective attention than any other specialized efforts?
Historically, we've drawn that project/domain name dividing line because starting a new wiki was the best way to put a flag in the ground and say "We will solve problem X". And we didn't know which efforts would immediately succeed and which ones wouldn't. But in the year 2013, you could just as well argue that instead of slapping the Wikispecies logo on the frontpage of Wikipedia, we should make more prominent mention of "How to contribute video on Wikipedia" or "Work with your local museum" or "Become a campus ambassador" or any other specialized effort which has shown promise but could use that extra visibility.
Again, "how to contribute video" is just Commons promotion, work with museums is usually either Commons or Wikipedia (sometimes Wikisource), campus ambassadors are a program to improve some articles on some Wikipedias. What I mean to say is those are means rather than goals; you're not disagreeing with MZ that we shouldn't expand our goals further.
The idea that just because user X proposed project Y sometime back in the early years of Wikimedia, effort Y must forever be part of a first order prioritization lens, is not rationally defensible.
So, even when our goal isn't simply to make general site improvements that benefit everyone but to support specialized new forms of content or collaboration, I wouldn't use project/domain name division as a tool for assessing impact, but rather frame it in terms of "What problem is being solved here? Who is going to be reached? How many people will be impacted"? And sometimes that does translate well to lens of a single domain name level project, and sometimes it doesn't.
There's a general trend currently within the Wikimedia Foundation to "narrow focus," which includes shelling out third-party MediaWiki release support to an outside contractor or group, because there are apparently not enough resources within the Wikimedia Foundation's 160-plus staff to support the Wikimedia software platform for anyone other than Wikimedia.
It's not a question whether we have enough resources to support it, but how to best put a financial boundary around third party engagement, while also actually enabling third parties to play an important role in the process as well (including potentially chipping in financial support).
In light of this, it seems even more unreasonable and against good sense to pursue a new machine translation endeavor, virtuous as it may be.
To be clear: I was not proposing that WMF should undertake such an effort directly. But I do think that if there are ways to support an effort that has a reasonable probability of success, with a reasonable structure of accountability around such an engagement, it's worth assessing. And again, that position is entirely consistent with my view that WMF should primarily invest in technologies with broad horizontal impact (which open source MT could have) rather than narrower, vertical impact.
In other words we wouldn't be adding another goal alongside those of creating an encyclopedia, a media repository, a dictionary, a dictionary of quotations etc. etc. but only a tool to the extent needed by one or more of them? Currently the only projects using machine translation or translation memory are our backstage wikis, the MediaWiki interface translation and some highly controversial article creation drives on a handful small wikis (did they continue in the last couple years?). Many ways exist to expand the scope of such a tool and the corpus we could provide to it, but the rationale of your proposal is currently a bit lacking and needs some work, just this.
Nemo
I've stated before why I disagree with this characterization, and I reject this framing. Functionality like the Visual Editor, the mobile site improvements, Lua, and other core engineering initiatives aren't limited in their impact to Wikipedia. The recent efforts on mobile uploading are actually focused on Commons. Deploying new software every two weeks and continually making key usability improvements is not what neglect looks like.
Thank you Erik for your response. I don't agree with all of your points, but it's refreshing to see that there's been what seem to be a lot of thought in this. Often (we the 15 active users of sister projects) just feel nobody cares of SP, and attention and thought and answers sometimes are just enough.
Anyway, I would just add that one of the major problem, I think, is that when we think at the "human knowledge" in ("*I**magine a world in which every single person on the planet* is given free access to the sum of all human knowledge"), we probably just think at "human knowledge in the form of neutral encyclopedic articles", which, in fact, it's not true.
I feel that we could boost a lot the idea of a "family of projects", of an integrated, global, comprehensive approach to knowledge. Right now, the fact is that Wikipedia both attracts and cannibalizes users to/from sister projects, which are kinda invisible if you don't know they exist.
Could we promote better our sister projects, making them more visible? For this purpose, user Micru and me just created a RfC for interproject links https://meta.wikimedia.org/wiki/Requests_for_comment/Interproject_links_inte... (I invite you all to propose other solutions), but the underlying question is if we, as the Wikimedia community, are aware of the "theoretical" shift this means.
Aubrey
On Wed, Apr 24, 2013 at 11:59 AM, Erik Moeller erik@wikimedia.org wrote:
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
There is a compelling need to assess availability of training corpus of significant breadth and depth for the languages. Most open-source implementations of MT end up hitting this hurdle because content of scale is not easily available. It would be appropriate to decide whether WMF/Wikipedia is well placed to turn on a firehose like API that would enable MT implementations to use statistical and other methods on the existing content itself.
-- sankarshan mukhopadhyay https://twitter.com/#!/sankarshan
On Wed, Apr 24, 2013 at 8:29 AM, Erik Moeller erik@wikimedia.org wrote:
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny?
http://www.statmt.org/moses/ is live an kicking. Someone with a background in computer linguistics should have a close look at them.
I would like to mention however that there are a couple of cases in which commercial companies could be convinced to open source some of their software, for example Mozilla. Google has open sourced terract for OCR. Google might see the value of their translation efforts in more than just the software but also in the actual integration in some of their products (Gmail, Goggles, Glass) so that open sourcing it would not hurt their financial interests. It appears to me that the cost vs. the potential gain for everyone of simply asking a company like Google or Microsoft if they are willing to negotiate.
In any case, I would love to see WMF engage in the topic of machine translation.
Mathias
On Wed, Apr 24, 2013 at 10:49 AM, Mathias Schindler mathias.schindler@gmail.com wrote:
On Wed, Apr 24, 2013 at 8:29 AM, Erik Moeller erik@wikimedia.org wrote:
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny?
http://www.statmt.org/moses/ is live an kicking. Someone with a background in computer linguistics should have a close look at them.
I would like to mention however that there are a couple of cases in
...
In any case, I would love to see WMF engage in the topic of machine translation.
thanks a lot erik and mathias for this constructive input! i'd love to see that too, and, from a volunteer standpoint, not only financing further development seems adicting, but also training (eg http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining) seems to be something bite-sized which might fit the wiki-model and the wikimedia volunteer community structure quite well.
rupert.
On 4/24/13 8:29 AM, Erik Moeller wrote:
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
I do think this is strategically relevant to Wikimedia. But there is already significant financial backing attempting to kickstart open-source MT, with some results. The goal is strategically relevant to another, much larger organization: the European Union. From 2006 through 2012 they allocated about $10m to kickstart open-source MT, though focused primarily on European languages, via the EuroMatrix (2006-09) and EuroMatrixPlus (2009-12) research projects. One of the concrete results [1] of those projects was Moses, which I believe is currently the most actively developed open-source MT system. http://www.statmt.org/moses/
In light of that, I would suggest trying to see if we can adapt or join those efforts, rather than starting a new project or organization. One strategy could be to: 1) fund internal Wikimedia work to see if Moses can already be used for our purposes; and 2) fund improvements in cases where it isn't good enough yet (whether this is best done through grants to academic researchers, payments to contractors, hiring internal staff, or posting open bounties for implementing features, I haven't thought much about).
Best, Mark
[1] They have a nice list of other software and data coming out of the project as well: http://www.euromatrixplus.net/resources/
A brief addendum,
On 4/24/13 12:25 PM, Mark wrote:
From 2006 through 2012 [the ERC] allocated about $10m to kickstart open-source MT, though focused primarily on European languages, via the EuroMatrix (2006-09) and EuroMatrixPlus (2009-12) research projects.
Missed some projects. Seems the European Research Council is *really* pushing for this, with more like $20-25m overall. A few FP7 projects that may be useful to us:
* Let's MT! https://www.letsmt.eu/, which is supposed to organize resources to help organizations & companies build their own MT systems on open data and software, reducing reliance on closed-source cloud providers.
* MosesCore http://www.statmt.org/mosescore/index.php?n=Main.HomePage, focused mainly on improving Moses itself.
* The Multilingual Europe Technology Alliance http://www.meta-net.eu/meta-research/overview, a giant consortium that seems to have a commitment to liberal licensing http://www.meta-net.eu/meta-share/licenses
-Mark
Erik, all,
sorry for the long mail.
Incidentally, I have been thinking in this direction myself for a while, and I have come to a number of conclusions: 1) the Wikimedia movement can not, in its current state, tackle the problem of machine translation of arbitrary text from and to all of our supported languages 2) the Wikimedia movement is probably the single most important source of training data already. Research that I have done with colleagues based on Wikimedia corpora as training data easily beat other corpora, and others are using Wikimedia corpora routinely already. There is not much we can improve here, actually 3) Wiktionary could be an even more amazing resource if we would finally tackle the issue of structuring its content more appropriately. I think Wikidata opened a few venues to structure planning in this direction and provide some software, but this would have the potential to provide more support for any external project than many other things we could tackle
Looking at the first statement, there are two ways we could constrain it to make it possibly feasible: a) constrain the number of supported languages. Whereas this would be technically the simpler solution, I think there is agreement that this is not in our interest at all b) constrain the kind of input text we want to support
If we constrain b) a lot, we could just go and develop "pages to display for pages that do not exist yet based on Wikidata" in the smaller languages. That's a far cry from machine translating the articles, but it would be a low hanging fruit. And it might help with a desire which is evidently strongly expressed by the mass creation of articles through bots in a growing number of languages. Even more constraints would still allow us to use Wikidata items for tagging and structuring Commons in a language-independent way (this was suggested by Erik earlier).
Current machine translation research aims at using massive machine learning supported systems. They usually require big parallel corpora. We do not have big parallel corpora (Wikipedia articles are not translations of each other, in general), especially not for many languages, and there is no reason to believe this is going to change. I would question if we want to build an infrastructure for gathering those corpora from the Web continuously. I do not think we can compete in this arena, or that is the best use of our resources to support projects in this area. We should use our unique features to our advantage.
How can we use the unique features of the Wikimedia movement to our advantage? What are our unique features? Well, obviously, the awesome community we are. Our technology, as amazing as it is, running our Websites on the given budget, is nevertheless not what makes us what we are. Most processes on the Wikimedia projects are developed in the community space, and not implemented in bits. To summon Lessing, if code is law, Wikimedia projects are really good in creating a space that allows for a community to live in this space and have the freedom to create their own ecosystem.
One idea I have been mulling over for years is basically how can we use this advantage for the task of creating content available in many languages. Wikidata is an obvious attempt at that, but it really goes only so far. The system I am really aiming at is a different one, and there has been plenty of related work in this direction: imagine a wiki where you enter or edit content, sentence by sentence, but the natural language representation is just a surface syntax for an internal structure. Your editing interface is a constrained, but natural language. Now, in order to really make this fly, both the rules for the parsers (interpreting the input) and the serializer (creating the output) would need to be editable by the community - in addition to the content itself. There are a number of major challenges involved, but I have by now a fair idea of how to tackle most of them (and I don't have the time to detail them right now). Wikidata had some design decision inside it that are already geared towards enabling the solution for some of the problems for this kind of wiki. Whatever a structured Wiktionary would look like, it should also be aligned with the requirements of the project outlined here. Basically, we take constrain b, but make it possible to push the constraint further and further through the community - that's how we could scale on this task.
This would be far away from solving the problem of automatic translation of text, and even further away from understanding text. But given where we are and the resources we have available, I think it would be a more feasible path towards achieving the mission of the Wikimedia movement than tackling the problem of general machine learning.
In summary, I see four calls for action right now (and for all of them this means to first actually think more and write down a project plan and gather input on that), that could and should be tackled in parallel if possible: I ) develop a structured Wiktionary II ) develop a feature that blends into Wikipedia's search if an article about a topic does not exist yet, but we have data on Wikidata about that topic III ) develop a multilingual search, tagging, and structuring environment for Commons IV ) develop structured Wiki content using natural language as a surface syntax, with extensible parsers and serializers
None of these goals would require tens of millions or decades of research and development. I think we could have an actionable plan developed within a month or two for all four goals, and my gut feeling is we could reach them all by 2015 or 16, depending when we actually start with implementing them.
Goal IV carries a considerable risk, but there's a fair chance it could work out. It could also fail utterly, but if it would even partially succeed ...
Cheers, Denny
2013/4/24 Erik Moeller erik@wikimedia.org
Wikimedia's mission is to make the sum of all knowledge available to every person on the planet. We do this by enabling communities in all languages to organize and collect knowledge in our projects, removing any barriers that we're able to remove.
In spite of this, there are and will always be large disparities in the amount of locally created and curated knowledge available per language, as is evident by simple statistical comparison (and most beautifully visualized in Erik Zachte's bubble chart [1]).
Google, Microsoft and others have made great strides in developing free-as-in-beer translation tools that can be used to translate from and to many different languages. Increasingly, it is possible to at least make basic sense of content in many different languages using these tools. Machine translation can also serve as a starting point for human translations.
Although free-as-in-beer for basic usage, integration can be expensive. Google Translate charges $20 per 1M characters of text for API usage. [2] These tools get better from users using them, but I've seen little evidence of sharing of open datasets that would help the field get better over time.
Undoubtedly, building the technology and the infrastructure for these translation services is a very expensive undertaking, and it's understandable that there are multiple commercial reasons that drive the major players' ambitions in this space. But if we look at it from the perspective of "How will billions of people learn in the coming decades", it seems clear that better translation tools should at least play some part in reducing knowledge disparities in different languages, and that ideally, such tools should be "free-as-in-speech" (since they're fundamentally related to speech itself).
If we imagine a world where top notch open source MT is available, that would be a world where increasingly, language barriers to accessing human knowledge could be reduced. True, translation is no substitute for original content creation in a language -- but it could at least powerfully support and enable such content creation, and thereby help hundreds of millions of people. Beyond Wikimedia, high quality open source MT would likely be integrated in many contexts where it would do good for humanity and allow people to cross into cultural and linguistic spaces they would otherwise not have access to.
While Wikimedia is still only a medium-sized organization, it is not poor. With more than 1M donors supporting our mission and a cash position of $40M, we do now have a greater ability to make strategic investments that further our mission, as communicated to our donors. That's a serious level of trust and not to be taken lightly, either by irresponsibly spending, or by ignoring our ability to do good.
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
All best, Erik
[1] http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrow... [2] https://developers.google.com/translate/v2/pricing -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
On 24 April 2013 11:35, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
If we constrain b) a lot, we could just go and develop "pages to display for pages that do not exist yet based on Wikidata" in the smaller languages. That's a far cry from machine translating the articles, but it would be a low hanging fruit. And it might help with a desire which is evidently strongly expressed by the mass creation of articles through bots in a growing number of languages.
There has historically been a lot of tension around mass-creation of articles because of the maintenance problem - we can create two hundred thousand stubs in Tibetan or Tamil, but who will maintain them? Wikidata gives us the potential of squaring that circle, and in fact you bring it up here...
II ) develop a feature that blends into Wikipedia's search if an article about a topic does not exist yet, but we have data on Wikidata about that topic
I think this would be amazing. A software hook that says "we know X article does not exist yet, but it is matched to Y topic on Wikidata" and pulls out core information, along with a set of localised descriptions... we gain all the benefit of having stub articles (scope, coverage) without the problems of a small community having to curate a million pages. It's not the same as hand-written content, but it's immeasurably better than no content, or even an attempt at machine-translating free text.
XXX is [a species of: fish] [in the: Y family]. It [is found in: Laos, Vietnam]. It [grows to: 20 cm]. (pictures)
Wikidata Phase 4, perhaps :-)
Le 2013-04-24 12:35, Denny Vrandečić a écrit :
- Wiktionary could be an even more amazing resource if we would
finally tackle the issue of structuring its content more appropriately. I think Wikidata opened a few venues to structure planning in this direction and provide some software, but this would have the potential to provide more support for any external project than many other things we could tackle
If you have any information/idea related to Wikitionary structuration, please share it on https://meta.wikimedia.org/wiki/Wiktionary_future
One idea I have been mulling over for years is basically how can we use this advantage for the task of creating content available in many languages. Wikidata is an obvious attempt at that, but it really goes only so far. The system I am really aiming at is a different one, and there has been plenty of related work in this direction: imagine a wiki where you enter or edit content, sentence by sentence, but the natural language representation is just a surface syntax for an internal structure.
I don't understand what you mean. To begin with, I doubt that sentence is the good scale to translate a natural language discourse. Sure some time you may translate one word with one word in an other language. Sometime you may translate a sentence with one sentence. Sometime you need to grab the whole paragraph, or even more, and sometime you need to have a whole cultural background to get the meaning of a single word in the current context. To my mind, natural languages deals with more than context free language. Could a static "internal structure" deal with such a dynamics?
Your editing interface is a constrained, but natural language.
This is realy where I don't see how you hope to manage that.
Now, in order to really make this fly, both the rules for the parsers (interpreting the input) and the serializer (creating the output) would need to be editable by the community - in addition to the content itself. There are a number of major challenges involved, but I have by now a fair idea of how to tackle most of them (and I don't have the time to detail them right now).
Well I'll be curious to have more information, like references I should read. Otherwise I'm affraid that what you says sounds like the Fermat's Last Theorem[1] and the famous margin which was too small to contain Fermat's alleged proof of his "last theorem".
[1] https://en.wikipedia.org/wiki/Fermat%27s_Last_Theorem
I really like Erik's original suggestion, and these ideas, Denny.
Since there are many different possible goals, it's worth having a page just to list all of the possible different goals and compare them - both how they fit with one another and how they fit with existing active projects elsewhere on the web.
SJ
On Wed, Apr 24, 2013 at 6:35 AM, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
Erik, all,
sorry for the long mail.
Incidentally, I have been thinking in this direction myself for a while, and I have come to a number of conclusions:
- the Wikimedia movement can not, in its current state, tackle the problem
of machine translation of arbitrary text from and to all of our supported languages 2) the Wikimedia movement is probably the single most important source of training data already. Research that I have done with colleagues based on Wikimedia corpora as training data easily beat other corpora, and others are using Wikimedia corpora routinely already. There is not much we can improve here, actually 3) Wiktionary could be an even more amazing resource if we would finally tackle the issue of structuring its content more appropriately. I think Wikidata opened a few venues to structure planning in this direction and provide some software, but this would have the potential to provide more support for any external project than many other things we could tackle
Looking at the first statement, there are two ways we could constrain it to make it possibly feasible: a) constrain the number of supported languages. Whereas this would be technically the simpler solution, I think there is agreement that this is not in our interest at all b) constrain the kind of input text we want to support
If we constrain b) a lot, we could just go and develop "pages to display for pages that do not exist yet based on Wikidata" in the smaller languages. That's a far cry from machine translating the articles, but it would be a low hanging fruit. And it might help with a desire which is evidently strongly expressed by the mass creation of articles through bots in a growing number of languages. Even more constraints would still allow us to use Wikidata items for tagging and structuring Commons in a language-independent way (this was suggested by Erik earlier).
Current machine translation research aims at using massive machine learning supported systems. They usually require big parallel corpora. We do not have big parallel corpora (Wikipedia articles are not translations of each other, in general), especially not for many languages, and there is no reason to believe this is going to change. I would question if we want to build an infrastructure for gathering those corpora from the Web continuously. I do not think we can compete in this arena, or that is the best use of our resources to support projects in this area. We should use our unique features to our advantage.
How can we use the unique features of the Wikimedia movement to our advantage? What are our unique features? Well, obviously, the awesome community we are. Our technology, as amazing as it is, running our Websites on the given budget, is nevertheless not what makes us what we are. Most processes on the Wikimedia projects are developed in the community space, and not implemented in bits. To summon Lessing, if code is law, Wikimedia projects are really good in creating a space that allows for a community to live in this space and have the freedom to create their own ecosystem.
One idea I have been mulling over for years is basically how can we use this advantage for the task of creating content available in many languages. Wikidata is an obvious attempt at that, but it really goes only so far. The system I am really aiming at is a different one, and there has been plenty of related work in this direction: imagine a wiki where you enter or edit content, sentence by sentence, but the natural language representation is just a surface syntax for an internal structure. Your editing interface is a constrained, but natural language. Now, in order to really make this fly, both the rules for the parsers (interpreting the input) and the serializer (creating the output) would need to be editable by the community - in addition to the content itself. There are a number of major challenges involved, but I have by now a fair idea of how to tackle most of them (and I don't have the time to detail them right now). Wikidata had some design decision inside it that are already geared towards enabling the solution for some of the problems for this kind of wiki. Whatever a structured Wiktionary would look like, it should also be aligned with the requirements of the project outlined here. Basically, we take constrain b, but make it possible to push the constraint further and further through the community - that's how we could scale on this task.
This would be far away from solving the problem of automatic translation of text, and even further away from understanding text. But given where we are and the resources we have available, I think it would be a more feasible path towards achieving the mission of the Wikimedia movement than tackling the problem of general machine learning.
In summary, I see four calls for action right now (and for all of them this means to first actually think more and write down a project plan and gather input on that), that could and should be tackled in parallel if possible: I ) develop a structured Wiktionary II ) develop a feature that blends into Wikipedia's search if an article about a topic does not exist yet, but we have data on Wikidata about that topic III ) develop a multilingual search, tagging, and structuring environment for Commons IV ) develop structured Wiki content using natural language as a surface syntax, with extensible parsers and serializers
None of these goals would require tens of millions or decades of research and development. I think we could have an actionable plan developed within a month or two for all four goals, and my gut feeling is we could reach them all by 2015 or 16, depending when we actually start with implementing them.
Goal IV carries a considerable risk, but there's a fair chance it could work out. It could also fail utterly, but if it would even partially succeed ...
Cheers, Denny
2013/4/24 Erik Moeller erik@wikimedia.org
Wikimedia's mission is to make the sum of all knowledge available to every person on the planet. We do this by enabling communities in all languages to organize and collect knowledge in our projects, removing any barriers that we're able to remove.
In spite of this, there are and will always be large disparities in the amount of locally created and curated knowledge available per language, as is evident by simple statistical comparison (and most beautifully visualized in Erik Zachte's bubble chart [1]).
Google, Microsoft and others have made great strides in developing free-as-in-beer translation tools that can be used to translate from and to many different languages. Increasingly, it is possible to at least make basic sense of content in many different languages using these tools. Machine translation can also serve as a starting point for human translations.
Although free-as-in-beer for basic usage, integration can be expensive. Google Translate charges $20 per 1M characters of text for API usage. [2] These tools get better from users using them, but I've seen little evidence of sharing of open datasets that would help the field get better over time.
Undoubtedly, building the technology and the infrastructure for these translation services is a very expensive undertaking, and it's understandable that there are multiple commercial reasons that drive the major players' ambitions in this space. But if we look at it from the perspective of "How will billions of people learn in the coming decades", it seems clear that better translation tools should at least play some part in reducing knowledge disparities in different languages, and that ideally, such tools should be "free-as-in-speech" (since they're fundamentally related to speech itself).
If we imagine a world where top notch open source MT is available, that would be a world where increasingly, language barriers to accessing human knowledge could be reduced. True, translation is no substitute for original content creation in a language -- but it could at least powerfully support and enable such content creation, and thereby help hundreds of millions of people. Beyond Wikimedia, high quality open source MT would likely be integrated in many contexts where it would do good for humanity and allow people to cross into cultural and linguistic spaces they would otherwise not have access to.
While Wikimedia is still only a medium-sized organization, it is not poor. With more than 1M donors supporting our mission and a cash position of $40M, we do now have a greater ability to make strategic investments that further our mission, as communicated to our donors. That's a serious level of trust and not to be taken lightly, either by irresponsibly spending, or by ignoring our ability to do good.
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
All best, Erik
[1] http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrow... [2] https://developers.google.com/translate/v2/pricing -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Thank you for my learning on what you are going forward with Wikidata, Denny.
I am a Korean Wikipedia contributor. I definitely agree with Erik that we have to tackle the problem of information disparity between languages. But I feel we can take better choices than investing to open source machine translation itself. Wikipedia content could be reused for commercial purposes. We know it will help the spreading of the Wikipedia. I think it is all the same. If proprietary machine translations could help the getting rid of the barrier of the language, it would great also. I hope we could support any machine translation developing team as well as open source machine translation team. But I believe finally open source machine translation will prevail.
Wikidata-based approaches are great! But I hope Wikipedia could do more including providing well aligned parallel corpora. I had looked into Google's translation workbench which tried to provide a customized translation tool for Wikipedia. I tried to translate a few English articles into Korean myself. The tool has a translation memory and a customizable dictionary. It lacked lots of features for practical translation and the interface was clumsy.
I believe translatewiki.net could do better than Google. I hope the translatewiki could provide a translation workbench not just for messages in softwares but Wikipedia articles. Through the workbenk, we could get out more great data in addition to parallel corpus. We can track how a human translator works. If we have more data on the editing activity, we can improve the translation job and get new clues for automatic translation. The translator will start from a stub and he will improve the draft. Peer reviewers will give eyes on the draft and will make it better. I mean logs for collaborated translation on a parallel corpora could provide more things to learn.
I think Wikipedia community could start an initiative for supporting raw materials for machine learning to translate. Those would be common asset for machine translation systems.
Best regards
RYU Cheol Chair of Wikimedia Korea Preparation Committee
2013. 4. 24., 오후 7:35, Denny Vrandečić denny.vrandecic@wikimedia.de 작성:
Erik, all,
sorry for the long mail.
Incidentally, I have been thinking in this direction myself for a while, and I have come to a number of conclusions:
- the Wikimedia movement can not, in its current state, tackle the problem
of machine translation of arbitrary text from and to all of our supported languages 2) the Wikimedia movement is probably the single most important source of training data already. Research that I have done with colleagues based on Wikimedia corpora as training data easily beat other corpora, and others are using Wikimedia corpora routinely already. There is not much we can improve here, actually 3) Wiktionary could be an even more amazing resource if we would finally tackle the issue of structuring its content more appropriately. I think Wikidata opened a few venues to structure planning in this direction and provide some software, but this would have the potential to provide more support for any external project than many other things we could tackle
Looking at the first statement, there are two ways we could constrain it to make it possibly feasible: a) constrain the number of supported languages. Whereas this would be technically the simpler solution, I think there is agreement that this is not in our interest at all b) constrain the kind of input text we want to support
If we constrain b) a lot, we could just go and develop "pages to display for pages that do not exist yet based on Wikidata" in the smaller languages. That's a far cry from machine translating the articles, but it would be a low hanging fruit. And it might help with a desire which is evidently strongly expressed by the mass creation of articles through bots in a growing number of languages. Even more constraints would still allow us to use Wikidata items for tagging and structuring Commons in a language-independent way (this was suggested by Erik earlier).
Current machine translation research aims at using massive machine learning supported systems. They usually require big parallel corpora. We do not have big parallel corpora (Wikipedia articles are not translations of each other, in general), especially not for many languages, and there is no reason to believe this is going to change. I would question if we want to build an infrastructure for gathering those corpora from the Web continuously. I do not think we can compete in this arena, or that is the best use of our resources to support projects in this area. We should use our unique features to our advantage.
How can we use the unique features of the Wikimedia movement to our advantage? What are our unique features? Well, obviously, the awesome community we are. Our technology, as amazing as it is, running our Websites on the given budget, is nevertheless not what makes us what we are. Most processes on the Wikimedia projects are developed in the community space, and not implemented in bits. To summon Lessing, if code is law, Wikimedia projects are really good in creating a space that allows for a community to live in this space and have the freedom to create their own ecosystem.
One idea I have been mulling over for years is basically how can we use this advantage for the task of creating content available in many languages. Wikidata is an obvious attempt at that, but it really goes only so far. The system I am really aiming at is a different one, and there has been plenty of related work in this direction: imagine a wiki where you enter or edit content, sentence by sentence, but the natural language representation is just a surface syntax for an internal structure. Your editing interface is a constrained, but natural language. Now, in order to really make this fly, both the rules for the parsers (interpreting the input) and the serializer (creating the output) would need to be editable by the community - in addition to the content itself. There are a number of major challenges involved, but I have by now a fair idea of how to tackle most of them (and I don't have the time to detail them right now). Wikidata had some design decision inside it that are already geared towards enabling the solution for some of the problems for this kind of wiki. Whatever a structured Wiktionary would look like, it should also be aligned with the requirements of the project outlined here. Basically, we take constrain b, but make it possible to push the constraint further and further through the community - that's how we could scale on this task.
This would be far away from solving the problem of automatic translation of text, and even further away from understanding text. But given where we are and the resources we have available, I think it would be a more feasible path towards achieving the mission of the Wikimedia movement than tackling the problem of general machine learning.
In summary, I see four calls for action right now (and for all of them this means to first actually think more and write down a project plan and gather input on that), that could and should be tackled in parallel if possible: I ) develop a structured Wiktionary II ) develop a feature that blends into Wikipedia's search if an article about a topic does not exist yet, but we have data on Wikidata about that topic III ) develop a multilingual search, tagging, and structuring environment for Commons IV ) develop structured Wiki content using natural language as a surface syntax, with extensible parsers and serializers
None of these goals would require tens of millions or decades of research and development. I think we could have an actionable plan developed within a month or two for all four goals, and my gut feeling is we could reach them all by 2015 or 16, depending when we actually start with implementing them.
Goal IV carries a considerable risk, but there's a fair chance it could work out. It could also fail utterly, but if it would even partially succeed ...
Cheers, Denny
2013/4/24 Erik Moeller erik@wikimedia.org
Wikimedia's mission is to make the sum of all knowledge available to every person on the planet. We do this by enabling communities in all languages to organize and collect knowledge in our projects, removing any barriers that we're able to remove.
In spite of this, there are and will always be large disparities in the amount of locally created and curated knowledge available per language, as is evident by simple statistical comparison (and most beautifully visualized in Erik Zachte's bubble chart [1]).
Google, Microsoft and others have made great strides in developing free-as-in-beer translation tools that can be used to translate from and to many different languages. Increasingly, it is possible to at least make basic sense of content in many different languages using these tools. Machine translation can also serve as a starting point for human translations.
Although free-as-in-beer for basic usage, integration can be expensive. Google Translate charges $20 per 1M characters of text for API usage. [2] These tools get better from users using them, but I've seen little evidence of sharing of open datasets that would help the field get better over time.
Undoubtedly, building the technology and the infrastructure for these translation services is a very expensive undertaking, and it's understandable that there are multiple commercial reasons that drive the major players' ambitions in this space. But if we look at it from the perspective of "How will billions of people learn in the coming decades", it seems clear that better translation tools should at least play some part in reducing knowledge disparities in different languages, and that ideally, such tools should be "free-as-in-speech" (since they're fundamentally related to speech itself).
If we imagine a world where top notch open source MT is available, that would be a world where increasingly, language barriers to accessing human knowledge could be reduced. True, translation is no substitute for original content creation in a language -- but it could at least powerfully support and enable such content creation, and thereby help hundreds of millions of people. Beyond Wikimedia, high quality open source MT would likely be integrated in many contexts where it would do good for humanity and allow people to cross into cultural and linguistic spaces they would otherwise not have access to.
While Wikimedia is still only a medium-sized organization, it is not poor. With more than 1M donors supporting our mission and a cash position of $40M, we do now have a greater ability to make strategic investments that further our mission, as communicated to our donors. That's a serious level of trust and not to be taken lightly, either by irresponsibly spending, or by ignoring our ability to do good.
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
All best, Erik
[1] http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrow... [2] https://developers.google.com/translate/v2/pricing -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
On 24/04/13 12:35, Denny Vrandečić wrote:
Current machine translation research aims at using massive machine learning supported systems. They usually require big parallel corpora. We do not have big parallel corpora (Wikipedia articles are not translations of each other, in general), especially not for many languages, and there is no
Could you define "big"? If 10% of Wikipedia articles are translations of each other, we have 2 million translation pairs. Assuming ten sentences per average article, this is 20 million sentence pairs. An average Wikipedia with 100,000 articles would have 10,000 translations and 100,000 sentence pairs; a large Wikipedia with 1,000,000 articles would have 100,000 translations and 1,000,000 sentence pairs - is this not enough to kickstart a massive machine learning supported system? (Consider also that the articles are somewhat similar in structure and less rich than general text - future tense is rarely used for example.)
On 24/04/13 12:35, Denny Vrandečić wrote:
In summary, I see four calls for action right now (and for all of them this means to first actually think more and write down a project plan and gather input on that), that could and should be tackled in parallel if possible: I ) develop a structured Wiktionary II ) develop a feature that blends into Wikipedia's search if an article about a topic does not exist yet, but we have data on Wikidata about that topic III ) develop a multilingual search, tagging, and structuring environment for Commons IV ) develop structured Wiki content using natural language as a surface syntax, with extensible parsers and serializers
None of these goals would require tens of millions or decades of research and development. I think we could have an actionable plan developed within a month or two for all four goals, and my gut feeling is we could reach them all by 2015 or 16, depending when we actually start with implementing them.
I fully support this, though! This is fully within Wikimedia's current infrastructure, and generally was planned to be done anyway.
Denny,
very good and compelling reasoning as always. I think the argument that we can potentially do a lot for the MT space (including open source efforts) in part by getting our own house in order on the dictionary side of things makes a lot of sense. I don't think it necessarily excludes investing in open source MT efforts, but Mark makes a good point that there are already existing institutions pouring money into promising initiatives. Let me try to understand some of the more complex ideas outlined in your note a bit better.
The system I am really aiming at is a different one, and there has been plenty of related work in this direction: imagine a wiki where you enter or edit content, sentence by sentence, but the natural language representation is just a surface syntax for an internal structure. Your editing interface is a constrained, but natural language. Now, in order to really make this fly, both the rules for the parsers (interpreting the input) and the serializer (creating the output) would need to be editable by the community - in addition to the content itself. There are a number of major challenges involved, but I have by now a fair idea of how to tackle most of them (and I don't have the time to detail them right now).
So what would you want to enable with this? Faster bootstrapping of content? How would it work, and how would this be superior to an approach like the one taken in the Translate extension (basically, providing good interfaces for 1:1 translation, tracking differences between documents, and offering MT and translation memory based suggestions)? Are there examples of this approach being taken somewhere else?
Thanks, Erik
Erik,
2013/4/25 Erik Moeller erik@wikimedia.org
The system I am really aiming at is a different one, and there has been plenty of related work in this direction: imagine a wiki where you enter or edit content, sentence by sentence, but the natural language representation is just a surface syntax for an internal structure. Your editing interface is a constrained, but natural language. Now, in order
to
really make this fly, both the rules for the parsers (interpreting the input) and the serializer (creating the output) would need to be editable by the community - in addition to the content itself. There are a number
of
major challenges involved, but I have by now a fair idea of how to tackle most of them (and I don't have the time to detail them right now).
So what would you want to enable with this? Faster bootstrapping of content? How would it work, and how would this be superior to an approach like the one taken in the Translate extension (basically, providing good interfaces for 1:1 translation, tracking differences between documents, and offering MT and translation memory based suggestions)? Are there examples of this approach being taken somewhere else?
Not just bootstrapping the content. By having the primary content be saved in a language independent form, and always translating it on the fly, it would not merely bootstrap content in different languages, but it would mean that editors from different languages would be working on the same content. The texts in the different language is not a translation of each other, but they are all created from the same source. There would be no primacy of, say, English.
It would be foolish to create any such plan without reusing tools and concepts from the Translate extension, translation memories, etc. There is a lot of UI and conceptual goodness in these tools. The idea would be to make them user extensible with rules.
If you want, examples of that are the bots working on some Wikipedias currently, creating text from structured input. They are partially reusing the same structured input, and need "merely" a translation in the way the bots create the text to save in the given Wikipedia. I have seen some research in the area, but they all have one or the other drawbacks, but can and should be used as an inspiration and to inform the project (like Allegro Controlled English, or a Chat program developed at the Open University in Milton Keynes to allow conducting business in different languages, etc.)
I hope this helps a bit.
Cheers, Denny
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
On Thu, Apr 25, 2013 at 7:26 AM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
Not just bootstrapping the content. By having the primary content be saved in a language independent form, and always translating it on the fly, it would not merely bootstrap content in different languages, but it would mean that editors from different languages would be working on the same content. The texts in the different language is not a translation of each other, but they are all created from the same source. There would be no primacy of, say, English.
You are blowing my mind, dude. :)
I suspect this approach won't serve for everything, but it sounds *awesome*. If we can tie natural-language statements directly to data nodes (rather than merely annotating vague references like we do today), then we'd be much better able to keep language versions in sync. How to make them sane to edit... sounds harder. :)
It would be foolish to create any such plan without reusing tools and
concepts from the Translate extension, translation memories, etc. There is a lot of UI and conceptual goodness in these tools. The idea would be to make them user extensible with rules.
Heck yeah!
If you want, examples of that are the bots working on some Wikipedias
currently, creating text from structured input. They are partially reusing the same structured input, and need "merely" a translation in the way the bots create the text to save in the given Wikipedia. I have seen some research in the area, but they all have one or the other drawbacks, but can and should be used as an inspiration and to inform the project (like Allegro Controlled English, or a Chat program developed at the Open University in Milton Keynes to allow conducting business in different languages, etc.)
Yessss... make them real-time updatable instead of one-time bots producing language which can't be maintained.
-- brion
2013/4/25 Brion Vibber bvibber@wikimedia.org
You are blowing my mind, dude. :)
Glad to do hear :)
I suspect this approach won't serve for everything, but it sounds
*awesome*. If we can tie natural-language statements directly to data nodes (rather than merely annotating vague references like we do today), then we'd be much better able to keep language versions in sync. How to make them sane to edit... sounds harder. :)
Absolutely correct, it would not serve for everything. And it doesn't have to. For an encyclopedia we should be able to get a useful amount of "frames" in a decent timeframe. For song lyrics, it might take a bit longer.
It would and should start with a restricted set of possible frames, but the trick would be to make the user extensible. Because that is where we are good at -- users who fill and extend the frameworks we provide. I don't know of much work where the frames and rules themselves are user editable and extensible, but heck, they people said we are crazy when we made the properties user editable and extensible in Semantic MediaWiki and later Wikidata, and it seems to be working out.
A sane editing interface - both for the rules and the content, and their interaction - would be something that would need to be explored first, just to check whether this is indeed possible or just wishful thinking. Starting without this kind of exploration beforehand would be a bit adventurous, or optimistic.
Cheers, Denny
Le 2013-04-25 16:26, Denny Vrandečić a écrit :
Erik,
2013/4/25 Erik Moeller erik@wikimedia.org
The system I am really aiming at is a different one, and there has been plenty of related work in this direction: imagine a wiki
where you
enter or edit content, sentence by sentence, but the natural
language
representation is just a surface syntax for an internal structure.
Your
editing interface is a constrained, but natural language. Now, in
order to
really make this fly, both the rules for the parsers (interpreting
the
input) and the serializer (creating the output) would need to be
editable
by the community - in addition to the content itself. There are a
number of
major challenges involved, but I have by now a fair idea of how to
tackle
most of them (and I don't have the time to detail them right now).
So what would you want to enable with this? Faster bootstrapping of content? How would it work, and how would this be superior to an approach like the one taken in the Translate extension (basically, providing good interfaces for 1:1 translation, tracking differences between documents, and offering MT and translation memory based suggestions)? Are there examples of this approach being taken somewhere else?
Not just bootstrapping the content. By having the primary content be saved in a language independent form, and always translating it on the fly, it would not merely bootstrap content in different languages, but it would mean that editors from different languages would be working on the same content. The texts in the different language is not a translation of each other, but they are all created from the same source. There would be no primacy of, say, English.
What would be the limits you would expect from your solution, because you can't expect to just "translate" everything. Form may be a part of the meaning. It's clear that you can't translate a poem for example. Sur wikipedia is not primary concerned about poetry, but it does treat the subject.
2013/4/25 Mathieu Stumpf psychoslave@culture-libre.org
What would be the limits you would expect from your solution, because you can't expect to just "translate" everything. Form may be a part of the meaning. It's clear that you can't translate a poem for example. Sur wikipedia is not primary concerned about poetry, but it does treat the subject.
I don't know where the limits would be. Probably further then we think right now, but yes, they still would be there and severe. The nice thing is that we would be collecting data about the limits constantly, and could thus "feed" the system to further improve and grow. Not automatically (I guess, but bots would obviously also be allowed to work on the rules as well), but through human intelligence, analyzing the input and trying to refine and extend the rules.
But, considering the already existing bot created articles, which number in the hundred thousands in languages like Swedish, Dutch, or Polish, there seems to be some consensus that this can be considered as a useful starting block. It's just that with the current system, even with Wikidata, we cannot really grow into this direction further.
Cheers, Denny
This subthread seems headed out into practical / applied epistemology, if there is such a thing.
I am not sure if we can get from here to there; that said, a new structure with language independent facts / information points that then got machine-explained or described in a local language would be an interesting structure to build an encyclopedia around. Wikidata is a good idea but not enough here. I'm not sure the state of knowledge theory and practice is good enough to do this, but I am suddenly more interested in IBM's Watson project and some related knowledge / natural language interaction AI work...
This is very interesting, but probably less midterm-practical than machine translation and the existing WP / other project data.
On Thu, Apr 25, 2013 at 8:46 AM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
2013/4/25 Mathieu Stumpf psychoslave@culture-libre.org
What would be the limits you would expect from your solution, because you can't expect to just "translate" everything. Form may be a part of the meaning. It's clear that you can't translate a poem for example. Sur wikipedia is not primary concerned about poetry, but it does treat the subject.
I don't know where the limits would be. Probably further then we think right now, but yes, they still would be there and severe. The nice thing is that we would be collecting data about the limits constantly, and could thus "feed" the system to further improve and grow. Not automatically (I guess, but bots would obviously also be allowed to work on the rules as well), but through human intelligence, analyzing the input and trying to refine and extend the rules.
But, considering the already existing bot created articles, which number in the hundred thousands in languages like Swedish, Dutch, or Polish, there seems to be some consensus that this can be considered as a useful starting block. It's just that with the current system, even with Wikidata, we cannot really grow into this direction further.
Cheers, Denny
-- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
On Thu, Apr 25, 2013 at 7:56 PM, Denny Vrandečić < denny.vrandecic@wikimedia.de> wrote:
Not just bootstrapping the content. By having the primary content be saved in a language independent form, and always translating it on the fly, it would not merely bootstrap content in different languages, but it would mean that editors from different languages would be working on the same content. The texts in the different language is not a translation of each other, but they are all created from the same source. There would be no primacy of, say, English.
This is a thought but I've never heard of a Language independent form. I also question its importance to your core idea vs. say, a primary language of choice. An argument can be made that language independent on a computer medium can't exist, down to a programming language, the instructions and even binary bits, there is a language running on top of higher inputs (even transitioning between computer languages isn't at an absolute level)- to that extent, I wonder if data can truly be language independent.
As far as Linguistic typology goes, it's far too unique and too varied to have a language independent form develop as easily. Perhaps it also depends on the perspective. For example, the majority of people commenting here (Americans, Europeans) might have exposure to a limited set of a linguistic branch. Machine-translations as someone pointed out, are still not preferred in some languages, even with years of research and potentially unlimited resources at Google's disposal, they still come out sounding clunky in some ways. And perhaps they will never get to the level of absolute, where they are truly language independent. If you read some of the discussions in linguistic relativity (Sapir-Whorf hypothesis), there is research to suggest that a language a person is born with dictates their thought processes and their view of the world - there might not be absolutes when it comes to linguistic cognition. There is something inherently unique in the cognitive patterns of different languages.
Which brings me to the point, why not English? Your idea seems plausible enough even if your remove the abstract idea of complete language universality, without venturing into the science-fiction labyrinth of man-machine collaboration.
Regards Theo
Le 2013-04-25 20:56, Theo10011 a écrit :
As far as Linguistic typology goes, it's far too unique and too varied to have a language independent form develop as easily. Perhaps it also depends on the perspective. For example, the majority of people commenting here (Americans, Europeans) might have exposure to a limited set of a linguistic branch. Machine-translations as someone pointed out, are still not preferred in some languages, even with years of research and potentially unlimited resources at Google's disposal, they still come out sounding clunky in some ways. And perhaps they will never get to the level of absolute, where they are truly language independent.
To my mind, there's no such thing as "absolute" meaning. It's all about intrepretation in a given a context by a given interpreter. I mean, I do think that MT could probably be as good as a profesional translators. But even profesional translators can't make "perfect translations". I already gave the example of poetry, but you may also take example of humour, which ask for some cultural background, otherwise you have to explain why it's funny and you know that you have to explain a joke, it's not a joke.
If you read some of the discussions in linguistic relativity (Sapir-Whorf hypothesis), there is research to suggest that a language a person is born with dictates their thought processes and their view of the world - there might not be absolutes when it comes to linguistic cognition. There is something inherently unique in the cognitive patterns of different languages.
That's just how learning process work, you can't "understand" something you didn't experiment. Reading an algorithm won't give you the insight you'll get when you process it mentaly (with the help of pencil and paper) and a textual description of "making love" won't provide you the feeling it provide.
Which brings me to the point, why not English? Your idea seems plausible enough even if your remove the abstract idea of complete language universality, without venturing into the science-fiction labyrinth of man-machine collaboration.
English have many so called "non-neutral" problems. As far as I know, if the goal is to use syntactically unambiguous human language, lojban is the best current candidate. English as an international language is a very harmful situation. Believe it or not, but I sometime have to translate to English sentences which are written in French, because the writer was thinking with English idiomatic locution that he poorly translated to French, its native language in which it doesn't know the idiomatic locution. Even worst, I red people which where where using concepts with an English locution because they never matched it with the French locution that they know. And in the other way, I'm not sure that having millions of people speaking a broken English is a wonderful situation for this language.
Search "why not english as international language" if you need more documentation.
We already have the translation options on the left side of the screen in any Wikipedia article. This choice is generally a smattering of languages, and a long term goal for many small-language Wikipedias is to be able to translate an article from related languages (say from Dutch into Frisian, where the Frisian Wikipedia has no article at all on the title subject) and the even longer-term goal is to translate into some other really-really-really foreign language.
Wouldn't it be easier however, to start with a project that uses translatewiki and the related-language pairs? Usually there is a big difference in numbers of articles (like between the Dutch Wikipedia and the Frisian Wikipedia). Presumably the demand is larger on the destination wikipedia (because there are fewer articles in those languages), and the potential number of human translators is larger (because most editors active in the smaller Wikipedia are versed in both langages).
The Dutch Wikimedia chapter took part in a European multilingual synchronization tool project called CoSyne: http://cosyne.eu/index.php/Main_Page
It was not a success, because it was hard to figure out how this would be beneficial to Wikipedians actually joining the project. Some funding that was granted to the chapter to work on the project will be returned, because it was never spent.
In order to tackle this problem on a large scale, it needs to be broken down into words, sentences, paragraphs and perhaps other structures (category trees?). I think CoSyne was trying to do this. I think it would be easier to keep the effort in one-way-traffic, so try to offer machine translation from Dutch to Frisian and not the other way around, and then as you go, define concepts that work both ways, so that eventually it would be possible to translated from Frisian into Dutch.
2013/4/26, Mathieu Stumpf psychoslave@culture-libre.org:
Le 2013-04-25 20:56, Theo10011 a écrit :
As far as Linguistic typology goes, it's far too unique and too varied to have a language independent form develop as easily. Perhaps it also depends on the perspective. For example, the majority of people commenting here (Americans, Europeans) might have exposure to a limited set of a linguistic branch. Machine-translations as someone pointed out, are still not preferred in some languages, even with years of research and potentially unlimited resources at Google's disposal, they still come out sounding clunky in some ways. And perhaps they will never get to the level of absolute, where they are truly language independent.
To my mind, there's no such thing as "absolute" meaning. It's all about intrepretation in a given a context by a given interpreter. I mean, I do think that MT could probably be as good as a profesional translators. But even profesional translators can't make "perfect translations". I already gave the example of poetry, but you may also take example of humour, which ask for some cultural background, otherwise you have to explain why it's funny and you know that you have to explain a joke, it's not a joke.
If you read some of the discussions in linguistic relativity (Sapir-Whorf hypothesis), there is research to suggest that a language a person is born with dictates their thought processes and their view of the world - there might not be absolutes when it comes to linguistic cognition. There is something inherently unique in the cognitive patterns of different languages.
That's just how learning process work, you can't "understand" something you didn't experiment. Reading an algorithm won't give you the insight you'll get when you process it mentaly (with the help of pencil and paper) and a textual description of "making love" won't provide you the feeling it provide.
Which brings me to the point, why not English? Your idea seems plausible enough even if your remove the abstract idea of complete language universality, without venturing into the science-fiction labyrinth of man-machine collaboration.
English have many so called "non-neutral" problems. As far as I know, if the goal is to use syntactically unambiguous human language, lojban is the best current candidate. English as an international language is a very harmful situation. Believe it or not, but I sometime have to translate to English sentences which are written in French, because the writer was thinking with English idiomatic locution that he poorly translated to French, its native language in which it doesn't know the idiomatic locution. Even worst, I red people which where where using concepts with an English locution because they never matched it with the French locution that they know. And in the other way, I'm not sure that having millions of people speaking a broken English is a wonderful situation for this language.
Search "why not english as international language" if you need more documentation.
-- Association Culture-Libre http://www.culture-libre.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Thanks to Jane for introducing CoSyne. But I feel all the wikis do not want to be synchronized to certain wikis. Rather than having identical articles, I hope they would have their own articles. I hope I could have two more tabs at right of the 'Article' and 'Talk' on English Wikipedia for Korean language. The two tabs are 'Article in Korean' and 'Talk in Korean'. The translations would have same information in originals and any editing on an article or a talk in translation pages would go back to the originals. In this case they need to be synchronized precisely.
I mean these are done in the scope of English Wikipedia, not related to Korean Wikipedia. But the Korean Wikipedia linked to the left side of a page would be benefited from the translations in English Wikipedia eventually when an Korean Wikipedia editor find a good part of English Wikipedia article could be inserted to Korean Wikipedia.
You can find the merits of the exact Korean translation of English Wikipedia or the scheme of the exact translation of big Wikipedias. It will help you reach to more potential contributors. It will make the language barrier lower for those who want to contribute to a Wikipedia they do not speak very well. Also, It could provide the better aligned corpora and it could could track how human translators or reviewers improve the translations.
Cheol
On 2013. 4. 26., at 오후 9:04, Jane Darnell jane023@gmail.com wrote:
We already have the translation options on the left side of the screen in any Wikipedia article. This choice is generally a smattering of languages, and a long term goal for many small-language Wikipedias is to be able to translate an article from related languages (say from Dutch into Frisian, where the Frisian Wikipedia has no article at all on the title subject) and the even longer-term goal is to translate into some other really-really-really foreign language.
Wouldn't it be easier however, to start with a project that uses translatewiki and the related-language pairs? Usually there is a big difference in numbers of articles (like between the Dutch Wikipedia and the Frisian Wikipedia). Presumably the demand is larger on the destination wikipedia (because there are fewer articles in those languages), and the potential number of human translators is larger (because most editors active in the smaller Wikipedia are versed in both langages).
The Dutch Wikimedia chapter took part in a European multilingual synchronization tool project called CoSyne: http://cosyne.eu/index.php/Main_Page
It was not a success, because it was hard to figure out how this would be beneficial to Wikipedians actually joining the project. Some funding that was granted to the chapter to work on the project will be returned, because it was never spent.
In order to tackle this problem on a large scale, it needs to be broken down into words, sentences, paragraphs and perhaps other structures (category trees?). I think CoSyne was trying to do this. I think it would be easier to keep the effort in one-way-traffic, so try to offer machine translation from Dutch to Frisian and not the other way around, and then as you go, define concepts that work both ways, so that eventually it would be possible to translated from Frisian into Dutch.
2013/4/26, Mathieu Stumpf psychoslave@culture-libre.org:
Le 2013-04-25 20:56, Theo10011 a écrit :
As far as Linguistic typology goes, it's far too unique and too varied to have a language independent form develop as easily. Perhaps it also depends on the perspective. For example, the majority of people commenting here (Americans, Europeans) might have exposure to a limited set of a linguistic branch. Machine-translations as someone pointed out, are still not preferred in some languages, even with years of research and potentially unlimited resources at Google's disposal, they still come out sounding clunky in some ways. And perhaps they will never get to the level of absolute, where they are truly language independent.
To my mind, there's no such thing as "absolute" meaning. It's all about intrepretation in a given a context by a given interpreter. I mean, I do think that MT could probably be as good as a profesional translators. But even profesional translators can't make "perfect translations". I already gave the example of poetry, but you may also take example of humour, which ask for some cultural background, otherwise you have to explain why it's funny and you know that you have to explain a joke, it's not a joke.
If you read some of the discussions in linguistic relativity (Sapir-Whorf hypothesis), there is research to suggest that a language a person is born with dictates their thought processes and their view of the world - there might not be absolutes when it comes to linguistic cognition. There is something inherently unique in the cognitive patterns of different languages.
That's just how learning process work, you can't "understand" something you didn't experiment. Reading an algorithm won't give you the insight you'll get when you process it mentaly (with the help of pencil and paper) and a textual description of "making love" won't provide you the feeling it provide.
Which brings me to the point, why not English? Your idea seems plausible enough even if your remove the abstract idea of complete language universality, without venturing into the science-fiction labyrinth of man-machine collaboration.
English have many so called "non-neutral" problems. As far as I know, if the goal is to use syntactically unambiguous human language, lojban is the best current candidate. English as an international language is a very harmful situation. Believe it or not, but I sometime have to translate to English sentences which are written in French, because the writer was thinking with English idiomatic locution that he poorly translated to French, its native language in which it doesn't know the idiomatic locution. Even worst, I red people which where where using concepts with an English locution because they never matched it with the French locution that they know. And in the other way, I'm not sure that having millions of people speaking a broken English is a wonderful situation for this language.
Search "why not english as international language" if you need more documentation.
-- Association Culture-Libre http://www.culture-libre.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Just the thought of synchronizing Wikis makes me shudder. I think this was also the reason that no Wikipedian editors were attracted to the CoSyne project, though as it was explained to me the idea was that only sections of a "source" Wikipedia article would be translated that did not exist yet in the target article. This may be useful in the case of a large article being the source, and a stub being the target, but in the case where the source and the target are about equal size, it could lead to a major mess.
In the example of the Wikipedia article on Haarlem, I noticed many of the things lacking in the English version are things more relevant to local people reading the Dutch version, such as local mass transit information. The other way around, the things in the English version that are lacking in the Dutch version are items that seem obvious to locals.
2013/4/27, Ryu Cheol rcheol@gmail.com:
Thanks to Jane for introducing CoSyne. But I feel all the wikis do not want to be synchronized to certain wikis. Rather than having identical articles, I hope they would have their own articles. I hope I could have two more tabs at right of the 'Article' and 'Talk' on English Wikipedia for Korean language. The two tabs are 'Article in Korean' and 'Talk in Korean'. The translations would have same information in originals and any editing on an article or a talk in translation pages would go back to the originals. In this case they need to be synchronized precisely.
I mean these are done in the scope of English Wikipedia, not related to Korean Wikipedia. But the Korean Wikipedia linked to the left side of a page would be benefited from the translations in English Wikipedia eventually when an Korean Wikipedia editor find a good part of English Wikipedia article could be inserted to Korean Wikipedia.
You can find the merits of the exact Korean translation of English Wikipedia or the scheme of the exact translation of big Wikipedias. It will help you reach to more potential contributors. It will make the language barrier lower for those who want to contribute to a Wikipedia they do not speak very well. Also, It could provide the better aligned corpora and it could could track how human translators or reviewers improve the translations.
Cheol
On 2013. 4. 26., at 오후 9:04, Jane Darnell jane023@gmail.com wrote:
We already have the translation options on the left side of the screen in any Wikipedia article. This choice is generally a smattering of languages, and a long term goal for many small-language Wikipedias is to be able to translate an article from related languages (say from Dutch into Frisian, where the Frisian Wikipedia has no article at all on the title subject) and the even longer-term goal is to translate into some other really-really-really foreign language.
Wouldn't it be easier however, to start with a project that uses translatewiki and the related-language pairs? Usually there is a big difference in numbers of articles (like between the Dutch Wikipedia and the Frisian Wikipedia). Presumably the demand is larger on the destination wikipedia (because there are fewer articles in those languages), and the potential number of human translators is larger (because most editors active in the smaller Wikipedia are versed in both langages).
The Dutch Wikimedia chapter took part in a European multilingual synchronization tool project called CoSyne: http://cosyne.eu/index.php/Main_Page
It was not a success, because it was hard to figure out how this would be beneficial to Wikipedians actually joining the project. Some funding that was granted to the chapter to work on the project will be returned, because it was never spent.
In order to tackle this problem on a large scale, it needs to be broken down into words, sentences, paragraphs and perhaps other structures (category trees?). I think CoSyne was trying to do this. I think it would be easier to keep the effort in one-way-traffic, so try to offer machine translation from Dutch to Frisian and not the other way around, and then as you go, define concepts that work both ways, so that eventually it would be possible to translated from Frisian into Dutch.
2013/4/26, Mathieu Stumpf psychoslave@culture-libre.org:
Le 2013-04-25 20:56, Theo10011 a écrit :
As far as Linguistic typology goes, it's far too unique and too varied to have a language independent form develop as easily. Perhaps it also depends on the perspective. For example, the majority of people commenting here (Americans, Europeans) might have exposure to a limited set of a linguistic branch. Machine-translations as someone pointed out, are still not preferred in some languages, even with years of research and potentially unlimited resources at Google's disposal, they still come out sounding clunky in some ways. And perhaps they will never get to the level of absolute, where they are truly language independent.
To my mind, there's no such thing as "absolute" meaning. It's all about intrepretation in a given a context by a given interpreter. I mean, I do think that MT could probably be as good as a profesional translators. But even profesional translators can't make "perfect translations". I already gave the example of poetry, but you may also take example of humour, which ask for some cultural background, otherwise you have to explain why it's funny and you know that you have to explain a joke, it's not a joke.
If you read some of the discussions in linguistic relativity (Sapir-Whorf hypothesis), there is research to suggest that a language a person is born with dictates their thought processes and their view of the world - there might not be absolutes when it comes to linguistic cognition. There is something inherently unique in the cognitive patterns of different languages.
That's just how learning process work, you can't "understand" something you didn't experiment. Reading an algorithm won't give you the insight you'll get when you process it mentaly (with the help of pencil and paper) and a textual description of "making love" won't provide you the feeling it provide.
Which brings me to the point, why not English? Your idea seems plausible enough even if your remove the abstract idea of complete language universality, without venturing into the science-fiction labyrinth of man-machine collaboration.
English have many so called "non-neutral" problems. As far as I know, if the goal is to use syntactically unambiguous human language, lojban is the best current candidate. English as an international language is a very harmful situation. Believe it or not, but I sometime have to translate to English sentences which are written in French, because the writer was thinking with English idiomatic locution that he poorly translated to French, its native language in which it doesn't know the idiomatic locution. Even worst, I red people which where where using concepts with an English locution because they never matched it with the French locution that they know. And in the other way, I'm not sure that having millions of people speaking a broken English is a wonderful situation for this language.
Search "why not english as international language" if you need more documentation.
-- Association Culture-Libre http://www.culture-libre.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Hoi, When we invest in MT it is to convey knowledge, information and primarily Wikipedia articles. They do not have the same problems poetry has. With explanatory articles on a subject there is a web of associated concepts. These concepts are likely to occur in any language if the subject exists in that other language.
Consequently MT can work for Wikipedia and provide quite a solid interpretation of what the article is about. This is helped when the associated concepts are recognised as such and when the translations for these concepts are used in the MT. Thanks, GerardM
On 26 April 2013 10:38, Mathieu Stumpf psychoslave@culture-libre.orgwrote:
Le 2013-04-25 20:56, Theo10011 a écrit :
As far as Linguistic typology goes, it's far too unique and too varied to
have a language independent form develop as easily. Perhaps it also depends on the perspective. For example, the majority of people commenting here (Americans, Europeans) might have exposure to a limited set of a linguistic branch. Machine-translations as someone pointed out, are still not preferred in some languages, even with years of research and potentially unlimited resources at Google's disposal, they still come out sounding clunky in some ways. And perhaps they will never get to the level of absolute, where they are truly language independent.
To my mind, there's no such thing as "absolute" meaning. It's all about intrepretation in a given a context by a given interpreter. I mean, I do think that MT could probably be as good as a profesional translators. But even profesional translators can't make "perfect translations". I already gave the example of poetry, but you may also take example of humour, which ask for some cultural background, otherwise you have to explain why it's funny and you know that you have to explain a joke, it's not a joke.
If you read some of
the discussions in linguistic relativity (Sapir-Whorf hypothesis), there is research to suggest that a language a person is born with dictates their thought processes and their view of the world - there might not be absolutes when it comes to linguistic cognition. There is something inherently unique in the cognitive patterns of different languages.
That's just how learning process work, you can't "understand" something you didn't experiment. Reading an algorithm won't give you the insight you'll get when you process it mentaly (with the help of pencil and paper) and a textual description of "making love" won't provide you the feeling it provide.
Which brings me to the point, why not English? Your idea seems plausible
enough even if your remove the abstract idea of complete language universality, without venturing into the science-fiction labyrinth of man-machine collaboration.
English have many so called "non-neutral" problems. As far as I know, if the goal is to use syntactically unambiguous human language, lojban is the best current candidate. English as an international language is a very harmful situation. Believe it or not, but I sometime have to translate to English sentences which are written in French, because the writer was thinking with English idiomatic locution that he poorly translated to French, its native language in which it doesn't know the idiomatic locution. Even worst, I red people which where where using concepts with an English locution because they never matched it with the French locution that they know. And in the other way, I'm not sure that having millions of people speaking a broken English is a wonderful situation for this language.
Search "why not english as international language" if you need more documentation.
-- Association Culture-Libre http://www.culture-libre.org/
______________________________**_________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.**org Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/**mailman/listinfo/wikimedia-lhttps://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Le 2013-04-26 17:00, Gerard Meijssen a écrit :
Hoi, When we invest in MT it is to convey knowledge, information and primarily Wikipedia articles. They do not have the same problems poetry has. With explanatory articles on a subject there is a web of associated concepts. These concepts are likely to occur in any language if the subject exists in that other language.
Consequently MT can work for Wikipedia and provide quite a solid interpretation of what the article is about. This is helped when the associated concepts are recognised as such and when the translations for these concepts are used in the MT. Thanks, GerardM
I think that poetry just make a easy to grab example of the more general problem of "lexical/meaning" intrication which will appears at some point. Different cultures will have different conceptualizations of what one may perceive. So this is not just a matter of concept sets, but rather of concept network dynamics, how concepts interacts within a world representation. And interaction means combinatorial problems, which require paramount ressources.
Those said, I agree that having MT to help "adapt" articles from one language/culture to an other one would be useful.
On 26 April 2013 10:38, Mathieu Stumpf psychoslave@culture-libre.orgwrote:
Le 2013-04-25 20:56, Theo10011 a écrit :
As far as Linguistic typology goes, it's far too unique and too varied to
have a language independent form develop as easily. Perhaps it also depends on the perspective. For example, the majority of people commenting here (Americans, Europeans) might have exposure to a limited set of a linguistic branch. Machine-translations as someone pointed out, are still not preferred in some languages, even with years of research and potentially unlimited resources at Google's disposal, they still come out sounding clunky in some ways. And perhaps they will never get to the level of absolute, where they are truly language independent.
To my mind, there's no such thing as "absolute" meaning. It's all about intrepretation in a given a context by a given interpreter. I mean, I do think that MT could probably be as good as a profesional translators. But even profesional translators can't make "perfect translations". I already gave the example of poetry, but you may also take example of humour, which ask for some cultural background, otherwise you have to explain why it's funny and you know that you have to explain a joke, it's not a joke.
If you read some of
the discussions in linguistic relativity (Sapir-Whorf hypothesis), there is research to suggest that a language a person is born with dictates their thought processes and their view of the world - there might not be absolutes when it comes to linguistic cognition. There is something inherently unique in the cognitive patterns of different languages.
That's just how learning process work, you can't "understand" something you didn't experiment. Reading an algorithm won't give you the insight you'll get when you process it mentaly (with the help of pencil and paper) and a textual description of "making love" won't provide you the feeling it provide.
Which brings me to the point, why not English? Your idea seems plausible
enough even if your remove the abstract idea of complete language universality, without venturing into the science-fiction labyrinth of man-machine collaboration.
English have many so called "non-neutral" problems. As far as I know, if the goal is to use syntactically unambiguous human language, lojban is the best current candidate. English as an international language is a very harmful situation. Believe it or not, but I sometime have to translate to English sentences which are written in French, because the writer was thinking with English idiomatic locution that he poorly translated to French, its native language in which it doesn't know the idiomatic locution. Even worst, I red people which where where using concepts with an English locution because they never matched it with the French locution that they know. And in the other way, I'm not sure that having millions of people speaking a broken English is a wonderful situation for this language.
Search "why not english as international language" if you need more documentation.
-- Association Culture-Libre http://www.culture-libre.org/
______________________________**_________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.**org Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/**mailman/listinfo/wikimedia-lhttps://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
On Thu, Apr 25, 2013 at 4:26 PM, Denny Vrandečić denny.vrandecic@wikimedia.de wrote:
Not just bootstrapping the content. By having the primary content be saved in a language independent form, and always translating it on the fly, it would not merely bootstrap content in different languages, but it would mean that editors from different languages would be working on the same content. The texts in the different language is not a translation of each other, but they are all created from the same source. There would be no primacy of, say, English.
What we can is to make Simple English Wikipedia more useful and rewrite rules from the Simple English language to the Controlled English language and to allow filling the content of the smaller Wikipedias from Simple English Wikipedia. That's the only way how to get anything more useful than Google Translate output.
There are serious problems in relation to the "translation of translation" process and that kind of complexity is not in the range of contemporary science. (Basically, even good machine translation is not in in the range contemporary science. Statistical approaches are useful for getting basic understanding, but very bad for writing encyclopedia or anything else which requires correct output in the targeted language.)
On a much simpler scale of conversion engines, we can see that even 1% of errors (or manual interventions) are serious issue for the text integrity, while translations of translations are creating much more errors, no matter would there be human interventions or not. And that's not acceptable for average editor of the project in targeted language.
Said so, we'd need serious linguistic work for every language added to the system.
At the other side, I support Erik's intention to make free software tool for machine translation. But note that it's just the second step (Wikidata was the first one) on the long way.
This is closely tied to software which is being developed, some of it secretly, to enable machines to understand and use language. As of now this will be government and corporate owned and controlled. I say closely tied because that is how translation works; only someone or something that understands language can translate perfectly.
That said, crude translations into little used languages are nearly worthless due to syntax issues. Useful work requires at least one person fluent in the language.
Fred
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
All best, Erik
[1] http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrow... [2] https://developers.google.com/translate/v2/pricing -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
only someone or something
that understands language can translate perfectly
Precisely
crude translations into little used languages are nearly
worthless due to syntax issues. Useful work requires at least one person fluent in the language
It's very true! Current Googe MT tools are reasonably good for reader(s) as they really provide a chance to grasp the meaning of the text but they are far not good as writer instrument meaning translation results are far not good to be published.
So it seems reasonable to promote MT as instrument for visitors (readers) of our projects, but not as substitute for that Wikimedians, who are the contributors.
On Wed, Apr 24, 2013 at 2:03 PM, Fred Bauder fredbaud@fairpoint.net wrote:
This is closely tied to software which is being developed, some of it secretly, to enable machines to understand and use language. As of now this will be government and corporate owned and controlled. I say closely tied because that is how translation works; only someone or something that understands language can translate perfectly.
That said, crude translations into little used languages are nearly worthless due to syntax issues. Useful work requires at least one person fluent in the language.
Fred
On 24/04/13 08:29, Erik Moeller wrote:
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
A huge and worthwile effort on its own, and anyway a necessary step for creating free MT software, would be to build a free (as in freedom) parallel translation corpus. This corpus could then be used as the starting point by people and groups who are producing free MT software, either under WMF or on their own.
This could be done by creating a new project where volunteers could compare Wikipedia articles and other free translated texts and mark sentences that are translations of other sentences. By the way, I believe Google Translate's corpus was created in this way.
Perhaps this could be best achieved by teaming with www.zooniverse.org or www.pgdp.net who have experience in this kind of projects. This would require specialized non-wiki software, and I don't think that the Foundation has enough experience in developing it.
(By the way, similar things that could be similarly useful include free OCR training data or free fully annotated text.)
On 24/04/13 08:29, Erik Moeller wrote:
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
A huge and worthwile effort on its own, and anyway a necessary step for creating free MT software, would be to build a free (as in freedom) parallel translation corpus. This corpus could then be used as the starting point by people and groups who are producing free MT software, either under WMF or on their own.
This could be done by creating a new project where volunteers could compare Wikipedia articles and other free translated texts and mark sentences that are translations of other sentences. By the way, I believe Google Translate's corpus was created in this way.
Perhaps this could be best achieved by teaming with www.zooniverse.org or www.pgdp.net who have experience in this kind of projects. This would require specialized non-wiki software, and I don't think that the Foundation has enough experience in developing it.
(By the way, similar things that could be similarly useful include free OCR training data or free fully annotated text.)
The Bible is quite good for this.
Fred
Le 2013-04-24 08:29, Erik Moeller a écrit :
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
I would like to add that (I'm no specialist of this subject) translating natural language probably need at least a large set of existing translations, at least to get read of "obvious well known" idiotisms like "kitchen sink" translated "usine à gaz" when you are speaking of a software for example. On this regard, we probably have such a base with wikisource. What do you think?
All best, Erik
[1]
http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrow... [2] https://developers.google.com/translate/v2/pricing -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
On Wed, Apr 24, 2013 at 2:04 PM, Mathieu Stumpf < psychoslave@culture-libre.org> wrote:
I would like to add that (I'm no specialist of this subject) translating natural language probably need at least a large set of existing translations, at least to get read of "obvious well known" idiotisms like "kitchen sink" translated "usine à gaz" when you are speaking of a software for example. On this regard, we probably have such a base with wikisource. What do you think?
Personally, I think this is an awesome idea :-) Wikisource corpora could be a huge asset in developing this. We already host different public domain translations, and in the future, we hope, more and more Wikisources will allow user generated translations.
At the moment, Wikisource could be a interesting corpora and laboratory for improving and enhancing OCR, as the OCR generated text is always proofread and corrected by humans. As part of our project ( http://wikisource.org/wiki/Wikisource_vision_development), Micru was looking for a GSoC candidate for studing the reinsertion of proofread text into djvus [1], but at the moment didn't find any interested student. We have some contacts with people at Google working on Tesseract, and they were available for mentoring.
Aubrey
[1] We thought about this both for OCR enhancement purposes and files updating on Commons and Internet Archive (which is off topic here).
* Andrea Zanni wrote:
At the moment, Wikisource could be a interesting corpora and laboratory for improving and enhancing OCR, as the OCR generated text is always proofread and corrected by humans. As part of our project ( http://wikisource.org/wiki/Wikisource_vision_development), Micru was looking for a GSoC candidate for studing the reinsertion of proofread text into djvus [1], but at the moment didn't find any interested student. We have some contacts with people at Google working on Tesseract, and they were available for mentoring.
[1] We thought about this both for OCR enhancement purposes and files updating on Commons and Internet Archive (which is off topic here).
I built various tools that could be fairly easily adapted for this, my http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr notes are available. One of the tools for instance is a diff tool, see image at http://lists.w3.org/Archives/Public/www-archive/2012Apr/0031.
On 26/04/13 19:38, Bjoern Hoehrmann wrote:
- Andrea Zanni wrote:
At the moment, Wikisource could be a interesting corpora and laboratory for improving and enhancing OCR, as the OCR generated text is always proofread and corrected by humans.
Try also Distributed Proofreaders. It is my impression that Wikisource's proofreading standards are not always up to par.
As part of our project ( http://wikisource.org/wiki/Wikisource_vision_development), Micru was looking for a GSoC candidate for studing the reinsertion of proofread text into djvus [1], but at the moment didn't find any interested student. We have some contacts with people at Google working on Tesseract, and they were available for mentoring.
[1] We thought about this both for OCR enhancement purposes and files updating on Commons and Internet Archive (which is off topic here).
I built various tools that could be fairly easily adapted for this, my http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr notes are available. One of the tools for instance is a diff tool, see image at http://lists.w3.org/Archives/Public/www-archive/2012Apr/0031.
This is a very interesting approach :)
(FYI this is me speaking with my personal hat on, none of these opinions are official in any way or the opinions of the foundation as an organization)
<personal_hat>
While Wikimedia is still only a medium-sized organization, it is not poor. With more than 1M donors supporting our mission and a cash position of $40M, we do now have a greater ability to make strategic investments that further our mission, as communicated to our donors. That's a serious level of trust and not to be taken lightly, either by irresponsibly spending, or by ignoring our ability to do good.
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
I think that while supporting open source machine translation is an awesome goal, it is out of scope of our budget and the engineering budget could be better spent elsewhere, such as with completing existing tools that are in development, but not deployed/optimized/etc. I think that putting a bunch of money into possibilities isn't the right thing to do when we have a lot of projects that need to be finished and deployed yesterday. Maybe once there's a closer actual project we could support them with text streams, decommissioned machines, and maybe money, but only after it's a pretty sure "investment"
</personal_hat>
Leslie
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
All best, Erik
[1] http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrow... [2] https://developers.google.com/translate/v2/pricing -- Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
-- Leslie Carr Wikimedia Foundation AS 14907, 43821 http://as14907.peeringdb.com/
Leslie Carr wrote (personally, not officially):
I think that while supporting open source machine translation is an
awesome goal, it is out of scope of our budget and the engineering budget could be better spent elsewhere, such as with completing existing tools that are in development, but not deployed/optimized/etc. I think that putting a bunch of money into possibilities isn't the right thing to do when we have a lot of projects that need to be finished and deployed yesterday. Maybe once there's a closer actual project we could support them with text streams, decommissioned machines, and maybe money, but only after it's a pretty sure "investment"
I don't think that it's a good idea to shift resources to it immediately, but I think that every now and then it's very healthy to step back and ask "What is standing between our users and the information they seek? What is standing between our editors and the information they want to update?". Generically, the customers and customer goals problem, applied to WMF's two customer sets (readers, and editors).
Minor UI changes help readers. Most of the other changes are editor-focused, retention or ease of editing or various other things related to that. A few are strategic-data-organization related which are more of a multiplier effect.
The readers and potential readers ARE however clearly disadvantaged by translation issues.
I see this discussion and consideration as strategic; not planning (year, six month) timescales or tactical (month, week) timescales, but a multi-year "What are our main goals for information access?" timescale.
We can't usefully help with internet access (and that's proceeding at good pace even in the third world), but language will remain a barrier when people get access. In a few situations politics / firewalling is as well (China, primarily), which is another strategic challenge. That, however, is political and geopolitical, and not an easy nut for WMF to crack. Of the three issues - net, firewalling, and language, one of them is something we can work on. We should think about how to work on that. MT seems like an obvious answer, but not the only possible one.
On Wed, Apr 24, 2013 at 12:29 PM, Leslie Carr lcarr@wikimedia.org wrote:
(FYI this is me speaking with my personal hat on, none of these opinions are official in any way or the opinions of the foundation as an organization)
<personal_hat>
While Wikimedia is still only a medium-sized organization, it is not poor. With more than 1M donors supporting our mission and a cash position of $40M, we do now have a greater ability to make strategic investments that further our mission, as communicated to our donors. That's a serious level of trust and not to be taken lightly, either by irresponsibly spending, or by ignoring our ability to do good.
Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
I think that while supporting open source machine translation is an awesome goal, it is out of scope of our budget and the engineering budget could be better spent elsewhere, such as with completing existing tools that are in development, but not deployed/optimized/etc. I think that putting a bunch of money into possibilities isn't the right thing to do when we have a lot of projects that need to be finished and deployed yesterday. Maybe once there's a closer actual project we could support them with text streams, decommissioned machines, and maybe money, but only after it's a pretty sure "investment"
</personal_hat>
Leslie
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
All best, Erik
[1]
http://stats.wikimedia.org/wikimedia/animations/growth/AnimationProjectsGrow...
[2] https://developers.google.com/translate/v2/pricing
Erik Möller VP of Engineering and Product Development, Wikimedia Foundation
Wikipedia and our other projects reach more than 500 million people every month. The world population is estimated to be >7 billion. Still a long way to go. Support us. Join us. Share: https://wikimediafoundation.org/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
-- Leslie Carr Wikimedia Foundation AS 14907, 43821 http://as14907.peeringdb.com/
Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
Le 2013-04-25 04:49, George Herbert a écrit :
We can't usefully help with internet access (and that's proceeding at good pace even in the third world), but language will remain a barrier when people get access. In a few situations politics / firewalling is as well (China, primarily), which is another strategic challenge. That, however, is political and geopolitical, and not an easy nut for WMF to crack. Of the three issues - net, firewalling, and language, one of them is something we can work on. We should think about how to work on that. MT seems like an obvious answer, but not the only possible one.
Do you have specific ideas in mind? Apart from having an "international language" and pedagogic material accessible to everyone and able to teach them from a zero knowledge requirement, I fail to see many options. Personally I'm currently learning esperanto as I would be happy to participate in such a process. I learn esperanto because it seems the current most successful language for such a project. It's already used on official china sites, and there's a current petition you can sign to make it an official european language[1].
[1] https://secure.avaaz.org/en/petition/Esperanto_langue_officielle_de_lUE/
On 24/04/13 16:29, Erik Moeller wrote:
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
We could basically clone the frontend component of Google Translate, and use Moses as a backend. The work would be mostly JavaScript, which we can do. When VisualEditor wraps up, we'll have several JavaScript developers looking for a project.
Google Translate gathers its own parallel corpus, and does it in a way that's accessible to non-technical bilingual speakers, so I think it's a nice model. The quality of its translations has improved enormously over the years, and I suppose most of that change is due to improved training data.
If we develop it as a public-facing open source product, then other Moses users could start using it. We could host it on GitHub, so that if it turns out to be popular, we could let it gradually evolve away from WMF control.
Once the frontend tool is done, the next job would be to develop a corpus sharing site, hosting any available freely-licensed output of the frontend tool.
-- Tim Starling
On 24.4.2013, at 9.29, Erik Moeller erik@wikimedia.org wrote:
Could open source MT be such a strategic investment?
Great idea. If we think resources, human languages are definitely a resource that should be kept in commons.
- Teemu
-------------------------------------------------- Teemu Leinonen http://www2.uiah.fi/~tleinone/ +358 50 351 6796 Media Lab http://mlab.uiah.fi Aalto University School of Arts, Design and Architecture --------------------------------------------------
* Erik Moeller wrote:
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
Wiktionary. If you want to help free software efforts in the area of machine translation, then what they seem to need most is high quality data about words, word forms, and so on, in a readily machine-usable form, and freely licensed. Wiktionary does collect and mark-up this data, but there is no easy way to get it out of Wiktionary. Fix that, and people will build machine translation and other tools with it.
On Fri, Apr 26, 2013 at 1:24 PM, Bjoern Hoehrmann derhoermi@gmx.net wrote:
- Erik Moeller wrote:
Are there open source MT efforts that are close enough to merit scrutiny?
Wiktionary. If you want to help free software efforts in the area of machine translation, then what they seem to need most is high quality data about words, word forms, and so on, in a readily machine-usable form, and freely licensed.
Yes. Finding a way to capture and integrate the work OmegaWiki has done into a new Wikidata-powered Wiktionary would be a useful start. And we've already sort of claimed the space (though we are neglecting it) -- it's discouraging to anyone else who might otherwise try to build a brilliant free structured dictionary that we are *so close* to getting it right.
< [ Andrea's ideas about using Wikisource to improve OCR tools ]
I built various tools that could be fairly easily adapted for this, my http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr notes are available. One of the tools for instance is a diff tool, see image at http://lists.w3.org/Archives/Public/www-archive/2012Apr/0031.
I hope the related GSOC project gets support. Getting mentoring from Tesseract team members seems like a handy way to keep the projects connected.
Tim Starling writes:
We could basically clone the frontend component of Google Translate, and use Moses as a backend. The work would be mostly JavaScript... the next job would be to develop a corpus sharing site, hosting any available freely-licensed output of the frontend tool.
This would be most useful. There are often short quick translation projects that I would like to do through this sort of TM-capturing interface; for which the translatewiki prep-process is rather time consuming.
We can set up a corpus sharing site now, with translatewiki - there is already a lot of material there that could be part of it. Different corpora (say, encyclopedic articles v. dictionary pages v. quotes) would need to be tagged for context. And we could start letting people upload their own freely licensed corpora to include as well. We would probably want a vetting process before giving users the import tool; or a quarantine until we had better ways to let editors revert / bulk-modify entire imports.
SJ
On Fri, Apr 26, 2013 at 7:57 PM, Samuel Klein meta.sj@gmail.com wrote:
Yes. Finding a way to capture and integrate the work OmegaWiki has done into a new Wikidata-powered Wiktionary would be a useful start. And we've already sort of claimed the space (though we are neglecting it) -- it's discouraging to anyone else who might otherwise try to build a brilliant free structured dictionary that we are *so close* to getting it right.
OmegaWiki is a masterpiece from the perspective of one [computational] linguist. Erik made the structure so well, that it's the best starting point to create a contemporary multilingual dictionary. I didn't see anything better in concept. (And, yes, when I was thinking about creating such software by my own, I was always at the dead end of "but, OmegaWiki is already that".)
At the other side, OmegaWiki software is from the previous decade and it requires major fixes. And, obviously, WMF should do that.
Le 2013-04-26 20:27, Milos Rancic a écrit :
On Fri, Apr 26, 2013 at 7:57 PM, Samuel Klein meta.sj@gmail.com wrote:
Yes. Finding a way to capture and integrate the work OmegaWiki has done into a new Wikidata-powered Wiktionary would be a useful start. And we've already sort of claimed the space (though we are neglecting it) -- it's discouraging to anyone else who might otherwise try to build a brilliant free structured dictionary that we are *so close* to getting it right.
OmegaWiki is a masterpiece from the perspective of one [computational] linguist. Erik made the structure so well, that it's the best starting point to create a contemporary multilingual dictionary. I didn't see anything better in concept. (And, yes, when I was thinking about creating such software by my own, I was always at the dead end of "but, OmegaWiki is already that".)
Where can I find documentation about this structure, please ?
2013/4/29 Mathieu Stumpf psychoslave@culture-libre.org
Le 2013-04-26 20:27, Milos Rancic a écrit :
OmegaWiki is a masterpiece from the perspective of one [computational]
linguist. Erik made the structure so well, that it's the best starting point to create a contemporary multilingual dictionary. I didn't see anything better in concept. (And, yes, when I was thinking about creating such software by my own, I was always at the dead end of "but, OmegaWiki is already that".)
Where can I find documentation about this structure, please ?
Here (planned structure): http://meta.wikimedia.org/wiki/OmegaWiki_data_design
and also there (current structure): http://www.omegawiki.org/Help:OmegaWiki_database_layout
And a gentle reminder that comments are requested ;-) http://meta.wikimedia.org/wiki/Requests_for_comment/Adopt_OmegaWiki
Le 2013-04-26 19:57, Samuel Klein a écrit :
On Fri, Apr 26, 2013 at 1:24 PM, Bjoern Hoehrmann derhoermi@gmx.net wrote:
- Erik Moeller wrote:
Are there open source MT efforts that are close enough to merit scrutiny?
Wiktionary. If you want to help free software efforts in the area of machine translation, then what they seem to need most is high quality data about words, word forms, and so on, in a readily machine-usable form, and freely licensed.
Yes. Finding a way to capture and integrate the work OmegaWiki has done into a new Wikidata-powered Wiktionary would be a useful start. And we've already sort of claimed the space (though we are neglecting it) -- it's discouraging to anyone else who might otherwise try to build a brilliant free structured dictionary that we are *so close* to getting it right.
If you have suggestions on Wiktionnaries future, please consider to share them on https://meta.wikimedia.org/wiki/Wiktionary_future
Erik Moeller, 24/04/2013 08:29:
[...] Could open source MT be such a strategic investment? I don't know, but I'd like to at least raise the question. I think the alternative will be, for the foreseeable future, to accept that this piece of technology will be proprietary, and to rely on goodwill for any integration that concerns Wikimedia. Not the worst outcome, but also not the best one.
Are there open source MT efforts that are close enough to merit scrutiny? In order to be able to provide high quality result, you would need not only a motivated, well-intentioned group of people, but some of the smartest people in the field working on it. I doubt we could more than kickstart an effort, but perhaps financial backing at significant scale could at least help a non-profit, open source effort to develop enough critical mass to go somewhere.
Some info on state of the art: http://laxstrom.name/blag/2013/05/22/on-course-to-machine-translation/
Nemo
wikimedia-l@lists.wikimedia.org