Google has built in support for using its machine translation technology to help bootstrap human translations of Wikipedia articles.
http://translate.google.com/toolkit/docupload
The benefit to Google is clear - they need sentence-aligned text in multiple languages in order to bootstrap their automated system.
This is a great example of machines helping people help machines help people, etc... I'm sure this is now the most efficient way to produce high quality translations of Wikipedia articles en masse.
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
/Brian
On Tue, Jun 9, 2009 at 23:42, BrianBrian.Mingus@colorado.edu wrote:
Google has built in support for using its machine translation technology to help bootstrap human translations of Wikipedia articles.
http://translate.google.com/toolkit/docupload
The benefit to Google is clear - they need sentence-aligned text in multiple languages in order to bootstrap their automated system.
This is a great example of machines helping people help machines help people, etc... I'm sure this is now the most efficient way to produce high quality translations of Wikipedia articles en masse.
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
Machine translation in its current status is so useless for anything beyond ordering Opera Garnier tickets, that the copyright status of its output is not quite relevant and i don't expect this to change in the next fifty years.
On what basis do you make this extremely negative assessment?
Readability is the the same thing as ability to read.
On Tue, Jun 9, 2009 at 3:13 PM, Amir E. Aharoni amir.aharoni@gmail.comwrote:
On Tue, Jun 9, 2009 at 23:42, BrianBrian.Mingus@colorado.edu wrote:
Google has built in support for using its machine translation technology
to
help bootstrap human translations of Wikipedia articles.
http://translate.google.com/toolkit/docupload
The benefit to Google is clear - they need sentence-aligned text in
multiple
languages in order to bootstrap their automated system.
This is a great example of machines helping people help machines help people, etc... I'm sure this is now the most efficient way to produce
high
quality translations of Wikipedia articles en masse.
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
Machine translation in its current status is so useless for anything beyond ordering Opera Garnier tickets, that the copyright status of its output is not quite relevant and i don't expect this to change in the next fifty years.
-- אמיר אלישע אהרוני
heb: http://haharoni.wordpress.com | eng: http://aharoni.wordpress.com cat: http://aprenent.wordpress.com | rus: http://amire80.livejournal.com
"We're living in pieces, I want to live in peace." - T. Moore
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Tue, Jun 9, 2009 at 3:13 PM, Amir E. Aharoni wrote:
Machine translation in its current status is so useless for anything beyond ordering Opera Garnier tickets, that the copyright status of its output is not quite relevant and i don't expect this to change in the next fifty years.
Brian wrote:
On what basis do you make this extremely negative assessment?
Readability is the the same thing as ability to read.
No, readability is the ability to BE read.
For the most part machine translation is rarely reliable, and often hilarious.
Ec
Honestly, I should have learned by now to ignore comments like this. Google is the leading world expert on machine translation and they think it's a good idea. I understand why they think it's a good idea, you don't. You're shooting straight from the gut.
On Tue, Jun 9, 2009 at 3:13 PM, Amir E. Aharoni amir.aharoni@gmail.comwrote:
On Tue, Jun 9, 2009 at 23:42, BrianBrian.Mingus@colorado.edu wrote:
Google has built in support for using its machine translation technology
to
help bootstrap human translations of Wikipedia articles.
http://translate.google.com/toolkit/docupload
The benefit to Google is clear - they need sentence-aligned text in
multiple
languages in order to bootstrap their automated system.
This is a great example of machines helping people help machines help people, etc... I'm sure this is now the most efficient way to produce
high
quality translations of Wikipedia articles en masse.
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
Machine translation in its current status is so useless for anything beyond ordering Opera Garnier tickets, that the copyright status of its output is not quite relevant and i don't expect this to change in the next fifty years.
-- אמיר אלישע אהרוני
heb: http://haharoni.wordpress.com | eng: http://aharoni.wordpress.com cat: http://aprenent.wordpress.com | rus: http://amire80.livejournal.com
"We're living in pieces, I want to live in peace." - T. Moore
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Tue, Jun 9, 2009 at 5:26 PM, BrianBrian.Mingus@colorado.edu wrote:
Honestly, I should have learned by now to ignore comments like this. Google is the leading world expert on machine translation and they think it's a good idea. I understand why they think it's a good idea, you don't. You're shooting straight from the gut.
On Tue, Jun 9, 2009 at 3:13 PM, Amir E. Aharoni amir.aharoni@gmail.comwrote:
On Tue, Jun 9, 2009 at 23:42, BrianBrian.Mingus@colorado.edu wrote:
Google has built in support for using its machine translation technology
to
help bootstrap human translations of Wikipedia articles.
http://translate.google.com/toolkit/docupload
The benefit to Google is clear - they need sentence-aligned text in
multiple
languages in order to bootstrap their automated system.
This is a great example of machines helping people help machines help people, etc... I'm sure this is now the most efficient way to produce
high
quality translations of Wikipedia articles en masse.
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
Machine translation in its current status is so useless for anything beyond ordering Opera Garnier tickets, that the copyright status of its output is not quite relevant and i don't expect this to change in the next fifty years.
-- אמיר אלישע אהרוני
heb: http://haharoni.wordpress.com | eng: http://aharoni.wordpress.com cat: http://aprenent.wordpress.com | rus: http://amire80.livejournal.com
"We're living in pieces, I want to live in peace." - T. Moore
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
For what it's worth, Google's language tools have drastically improved over the years. They're getting really good, honestly.
That all being said, they're not perfect and a machine translation is still no substitute for a human translator.
-Chad
On Wed, Jun 10, 2009 at 00:26, BrianBrian.Mingus@colorado.edu wrote:
Honestly, I should have learned by now to ignore comments like this. Google is the leading world expert on machine translation and they think it's a good idea. I understand why they think it's a good idea, you don't. You're shooting straight from the gut.
Not quite - i am finishing a degree in Linguistics and i work as an NLP programmer, so i know the field a little.
Google is the leading world expert in searching vast amounts of text in English, a language with next to no morphology. They aren't as good at searching in Hebrew, Spanish and Russian. And their translation software doesn't even cover Persian, a language with a relatively simple morphology.
Google appear to assume that the statistical approach to machine translation is the only one that matters and that their leadership in search technologies makes them the leaders in machine translation. They are wrong. The statistical approach helps, but humans don't think only statistically. The grammars of even the best-researched languages - English, French, German - are ridiculously far from being described completely. When i say "grammar", i refer to the whole language system: morphology, syntax, semantics, discourse analysis, typography, prosody, phonology and more. We can't teach computers grammar, because we don't really understand it ourselves, and without teaching computers proper grammar, the statistical approach is very limited.
Google improved their translation software a little in the last couple of years but they are many, many years away from being able to translate a real text. Google translation paired with something like [[Universal Networking Language]] or maybe OmegaWiki may yield better results, but it will take many more years to complete. Of course, something may change and Big Companies may start pouring a lot of money into dictionary and grammar book writers. Until that happens, expect improvements in machine translation to be Very Slow.
This is a theory. Google has a different theory that is backed up by results. The size of the sentence-aligned corpus determines the quality of the translation. The algorithms are entirely secondary.
In the absence of a sentence aligned corpus one must be created. People want good machine translations but such translations require people to first do part of the work. It's a perfectly reasonable symbiotic relationship. There is no reason to expect that this project 1) won't help Google and 2) won't help Wikipedia.
On Tue, Jun 9, 2009 at 3:57 PM, Amir E. Aharoni amir.aharoni@gmail.comwrote:
On Wed, Jun 10, 2009 at 00:26, BrianBrian.Mingus@colorado.edu wrote:
Honestly, I should have learned by now to ignore comments like this.
is the leading world expert on machine translation and they think it's a good idea. I understand why they think it's a good idea, you don't.
You're
shooting straight from the gut.
Not quite - i am finishing a degree in Linguistics and i work as an NLP programmer, so i know the field a little.
Google is the leading world expert in searching vast amounts of text in English, a language with next to no morphology. They aren't as good at searching in Hebrew, Spanish and Russian. And their translation software doesn't even cover Persian, a language with a relatively simple morphology.
Google appear to assume that the statistical approach to machine translation is the only one that matters and that their leadership in search technologies makes them the leaders in machine translation. They are wrong. The statistical approach helps, but humans don't think only statistically. The grammars of even the best-researched languages
- English, French, German - are ridiculously far from being described
completely. When i say "grammar", i refer to the whole language system: morphology, syntax, semantics, discourse analysis, typography, prosody, phonology and more. We can't teach computers grammar, because we don't really understand it ourselves, and without teaching computers proper grammar, the statistical approach is very limited.
Google improved their translation software a little in the last couple of years but they are many, many years away from being able to translate a real text. Google translation paired with something like [[Universal Networking Language]] or maybe OmegaWiki may yield better results, but it will take many more years to complete. Of course, something may change and Big Companies may start pouring a lot of money into dictionary and grammar book writers. Until that happens, expect improvements in machine translation to be Very Slow.
-- אמיר אלישע אהרוני Amir Elisha Aharoni
"We're living in pieces, I want to live in peace." - T. Moore
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Brian wrote:
In the absence of a sentence aligned corpus one must be created.
It would be nice if such a corpus (or rather, the resulting dictionary of translated words, phrases and sentences) could also be "open content". Are you in talks with Google about this, Brian? Would they be interested in providing open content output in exchange for open content input?
In talks with Google? Oh I wish ;)
There are lots of algorithms that do sentence alignment automatically. The different language articles don't have to be identical for Google to align them. So we've basically already got what they've got in terms of Wikipedia data.
On Wed, Jun 10, 2009 at 1:05 AM, Lars Aronsson lars@aronsson.se wrote:
Brian wrote:
In the absence of a sentence aligned corpus one must be created.
It would be nice if such a corpus (or rather, the resulting dictionary of translated words, phrases and sentences) could also be "open content". Are you in talks with Google about this, Brian? Would they be interested in providing open content output in exchange for open content input?
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Machine translations are not new work, neither derivatives, as it is done by machines and not by humans.
Also Google will have a hard time claiming that because some unidentified person added text or an url to a open service they now has the right to do whatever they want with the text.
I guess what they try to say in the TOS is that the text will be used to build the statistical engine and you give Google the right to do so. That is, they provide the translation and you provide the corrections which is then released to them.
John
Compare such text to a photo of a painting changed by some automatic algorithm. The copyright of the painting is unchanged and the algorithm gets no part of any new copyright, yet the person applying the tool _can_ have a part in the copyright for the new derived work.
If you translate a work through the use of some tool, the tool gets no part of the copyright, the person may get a part of the copyright for the derived work but then he must do something in addition to running the tool, unless the tool is so extremely difficult to use that running it is sufficient.
John
John at Darkstar skrev:
Machine translations are not new work, neither derivatives, as it is done by machines and not by humans.
Also Google will have a hard time claiming that because some unidentified person added text or an url to a open service they now has the right to do whatever they want with the text.
I guess what they try to say in the TOS is that the text will be used to build the statistical engine and you give Google the right to do so. That is, they provide the translation and you provide the corrections which is then released to them.
John
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Wed, Jun 10, 2009 at 11:57 PM, John at Darkstar vacuum@jeb.no wrote:
Machine translations are not new work, neither derivatives, as it is done by machines and not by humans.
This is probably the correct argument to make.
There are two trends in machine translations; rule based translations and statistical translations. Both have pros and cons. Rule based translations seems to be possible to integrate with Wiktionary in such a way that it can support Wikipedia. Statistical translations seems to be possible to integrate more directly with Wikipedia. Both methods can use the history of the translated article to identify where the translation engine fails; for a rule based translation engine that usually means there are some missing transfer rules, for a statistical translation engine that means the engine has failed to adapt to some type of sentence.
Google previously used Systrans engine, but now uses their own. Sort of, there are some rumors about them using a open source statistical translation engine. http://googlesystem.blogspot.com/2007/10/google-translate-switches-to-google...
Microsoft also uses a statistical translation engine. http://blogs.msdn.com/translation/archive/2008/08/22/statistical-machine-tra...
One very promising free rule based translation engine is Apertium http://wiki.apertium.org/wiki/Main_Page
A very well known free statistical engine is Moses http://www.statmt.org/moses/
Sorry for my english, its actually not a machine translation even if it looks like that! ;p
John
John at Darkstar skrev:
There are two trends in machine translations; rule based translations and statistical translations. Both have pros and cons. Rule based translations seems to be possible to integrate with Wiktionary in such a way that it can support Wikipedia. Statistical translations seems to be possible to integrate more directly with Wikipedia. Both methods can use the history of the translated article to identify where the translation engine fails; for a rule based translation engine that usually means there are some missing transfer rules, for a statistical translation engine that means the engine has failed to adapt to some type of sentence.
Google previously used Systrans engine, but now uses their own. Sort of, there are some rumors about them using a open source statistical translation engine. http://googlesystem.blogspot.com/2007/10/google-translate-switches-to-google...
Microsoft also uses a statistical translation engine. http://blogs.msdn.com/translation/archive/2008/08/22/statistical-machine-tra...
One very promising free rule based translation engine is Apertium http://wiki.apertium.org/wiki/Main_Page
A very well known free statistical engine is Moses http://www.statmt.org/moses/
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Thu, Jun 11, 2009 at 09:37, John at Darkstarvacuum@jeb.no wrote:
Google previously used Systrans engine, but now uses their own. Sort of, there are some rumors about them using a open source statistical translation engine. http://googlesystem.blogspot.com/2007/10/google-translate-switches-to-google...
I couldn't find those rumors at the link you gave. Where did you see them?
That would be interesting. If it is open source, Wikipedia can just use it, and more importantly - improve it, by itself, without Google's help
The link is about Google Translate, I'm not sure about the rumor.
Probably a rule based solution is the easiest to get up and running for small wikis, while a statistical solution will work for larger wikis. That will make the system work sufficiently well that users will build upon the initial machine translation thereby enabling the statistical engine to learn from the errors. Its like an automatic classifier with some a priori knowledge.
John
Amir E. Aharoni skrev:
On Thu, Jun 11, 2009 at 09:37, John at Darkstarvacuum@jeb.no wrote:
Google previously used Systrans engine, but now uses their own. Sort of, there are some rumors about them using a open source statistical translation engine. http://googlesystem.blogspot.com/2007/10/google-translate-switches-to-google...
I couldn't find those rumors at the link you gave. Where did you see them?
That would be interesting. If it is open source, Wikipedia can just use it, and more importantly - improve it, by itself, without Google's help
FYI,
Don't know if this is relevant....
Gordo
>>>>>>>>>
From: Allen Gunn gunner@aspirationtech.org To: "icommons@lists.ibiblio.org" icommons@lists.ibiblio.org X-Enigmail-Version: 0.95.7 Subject: [Icommons] Open Translation Tools 2009 - Call for Participants
Howdy iCommons friends,
If you are involved with the open source tools and distributed processes behind the translation of open content, we'd love you to consider joining us in Amsterdam in late June for Open Translation Tools 2009.
And please help us spread the word to those who might be interested - blog it, post it to other lists, tweet it, Facebook it. We thank you for your help in bringing together people passionate about the translation of open knowledge.
And a shout-out to Ahrash Bissell, who has been wonderfully supportive in helping us shape the vision for the event.
Full event blurbage is pasted below, and also available at
http://www.aspirationtech.org/events/opentranslation/2009
We hope to see you in Amsterdam at the end of June!
thanks & peace, gunner
Open Translation Tools 2009 - Call for Participants!
http://www.aspirationtech.org/events/opentranslation/2009
Aspiration is delighted to announce Open Translation Tools 2009 (OTT09), to be held in Amsterdam, The Netherlands, from 22-24 June, 2009. The event will be followed by an Open Translation "Book Sprint" which will produce a first-of-its-kind volume on tools and best practices in the field of Open Translation. Both events are being co-organized in partnership with FLOSSManuals.net and Translate.org.za, and generously supported by the Open Society Institute.
Agenda partners for the event include Creative Commons, Global Voices Online, WorldWide Lexicon, Meedan, and DotSUB.
OTT09 will build upon the work and collaboration from Open Translation Tools 2007 (http://www.aspirationtech.org/events/opentranslation). The event will convene stakeholders in the field of open content translation to assess the state of software tools that support translation of content that is licensed under free or open content licenses such as Creative Commons or Free Document License. The event will serve to map out what's available, what's missing, who's doing what, and to recommend strategic next steps to address those needs, with a particular focus on delivering value to open education, open knowledge, and human rights blogging communities.
Primary focus will be placed on supporting and enabling distributed human translation of content, but the role of machine translation will also be considered. "Open content" will encompass a range of resource types, from educational materials to books to manuals to documents to blog content to video and multimedia.
We invite all prospective participants to answer the Open Translation 2009 Call for Participants.
The agenda goals of the 2009 event will be several:
- Addressing the Translation Challenges Faced by the Open Education,
Open Content, and human rights blogging communities, and mapping requirements to available open solutions.
- Building on the vision and exploring new use cases for the Global
Voices Lingua Translation Exchange
- Documenting the state of the art in distributed human translation, and
discussing how to further tap the tremendous translation potential of the net
- Making tools talk better: realizing a standards-driven approach to
open translation
- Exploring and sketching out Open Translation API Designs, building on
existing work and models
- Documenting workflow requirements for missing open translation tools
- Match-making between open source tools and open content projects
- Mapping of available tools to open translation use cases
See the Agenda Overview (http://www.aspirationtech.org/events/opentranslation/2009/agenda/overview) for elaboration and more details about what is being planned.
Most importantly, the agenda will center on the needs and knowledge of the participating projects, structuring sessions and collaborations to focus on designing appropriate processes and selecting appropriate tools to support open content projects and inform further development of open source translation tools.
In addition, OTT09 will continue the knowledge sharing for the open translation community, and continue discussion on other identified needs from OTT07. The agenda for this event will be greatly informed by open education, open content and human rights blogging projects with specific translation needs, and a number of sessions will be structured to both characterize requirements and propose solutions to respective projects' translation requirements.
OTT07 mapped out a hefty list Open Translation Tools (http://www.aspirationtech.org/papers/ott07/tools). Participants at OTT09 will survey what has changed over the past 18 months, and assess the most pressing remaining gaps.
If OTT09 sounds like your kind of event, we invite you to answer the Open Translation 2009 Call for Participants!
http://www.aspirationtech.org/events/opentranslation/2009
-- Allen Gunn Executive Director, Aspiration +1.415.216.7252 www.aspirationtech.org
Aspiration: "Better Tools for a Better World"
Icommons mailing list Icommons@lists.ibiblio.org http://lists.ibiblio.org/mailman/listinfo/icommons
2009/6/9 Brian Brian.Mingus@colorado.edu:
Google has built in support for using its machine translation technology to help bootstrap human translations of Wikipedia articles.
http://translate.google.com/toolkit/docupload
The benefit to Google is clear - they need sentence-aligned text in multiple languages in order to bootstrap their automated system.
This is a great example of machines helping people help machines help people, etc... I'm sure this is now the most efficient way to produce high quality translations of Wikipedia articles en masse.
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
/Brian
Under Google's TOS you cannot enter CC or GFDL produced by someone else into the translation tool.
I thought there would be some caveat.
They might be willing to fix this for us. We'd want to contact the translation team directly since they are the ones who created the interface to Wikipedia.
On Tue, Jun 9, 2009 at 3:54 PM, geni geniice@gmail.com wrote:
2009/6/9 Brian Brian.Mingus@colorado.edu:
Google has built in support for using its machine translation technology
to
help bootstrap human translations of Wikipedia articles.
http://translate.google.com/toolkit/docupload
The benefit to Google is clear - they need sentence-aligned text in
multiple
languages in order to bootstrap their automated system.
This is a great example of machines helping people help machines help people, etc... I'm sure this is now the most efficient way to produce
high
quality translations of Wikipedia articles en masse.
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
/Brian
Under Google's TOS you cannot enter CC or GFDL produced by someone else into the translation tool.
-- geni
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Wed, Jun 10, 2009 at 00:54, genigeniice@gmail.com wrote:
2009/6/9 Brian Brian.Mingus@colorado.edu:
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
/Brian
Under Google's TOS you cannot enter CC or GFDL produced by someone else into the translation tool.
Where exactly do the TOS say it? I couldn't find it.
They would never find out about it anyway. In the current state of things, any machine-translated text has to be edited manually and thus it is not very different from translating a text using a dictionary - and i believe that a human translator doesn't have to pay per-word royalties to the dictionary publisher.
An unedited machine-translated text is likely to be speedily deleted as patent nonsense, before copyvio is even considered.
I couldn't dwelve into the TOS, but as I see it you start with a GFDL text and end up uploading a text directly to Wikipedia; which implies that Google is okay with their text being used that way (you don't have to copy-paste, google uploads the text for you, although it is saved under your username, the edit summary and the text linking back to the oiginal soure article). I guess, what's more interesting than adhering to Wikimedia's licensing terms (which is implicit in the process) is what rights does Google gain to your improved sentence-by-sentence translations. (They certainly use it as translation suggestions, for one).
Best regards, Bence Damokos
On Wed, Jun 10, 2009 at 12:01 AM, Amir E. Aharoni amir.aharoni@gmail.comwrote:
On Wed, Jun 10, 2009 at 00:54, genigeniice@gmail.com wrote:
2009/6/9 Brian Brian.Mingus@colorado.edu:
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
/Brian
Under Google's TOS you cannot enter CC or GFDL produced by someone else into the translation tool.
Where exactly do the TOS say it? I couldn't find it.
They would never find out about it anyway. In the current state of things, any machine-translated text has to be edited manually and thus it is not very different from translating a text using a dictionary - and i believe that a human translator doesn't have to pay per-word royalties to the dictionary publisher.
An unedited machine-translated text is likely to be speedily deleted as patent nonsense, before copyvio is even considered.
-- אמיר אלישע אהרוני Amir Elisha Aharoni
"We're living in pieces, I want to live in peace." - T. Moore
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
2009/6/9 Amir E. Aharoni amir.aharoni@gmail.com:
On Wed, Jun 10, 2009 at 00:54, genigeniice@gmail.com wrote:
2009/6/9 Brian Brian.Mingus@colorado.edu:
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
/Brian
Under Google's TOS you cannot enter CC or GFDL produced by someone else into the translation tool.
Where exactly do the TOS say it? I couldn't find it.
"By submitting your content through the Service, you grant Google the permission to use your content permanently to promote, improve or offer the Services"
"By submitting, posting or displaying the content you give Google a perpetual, irrevocable, worldwide, royalty-free, and non-exclusive licence to reproduce, adapt, modify, translate, publish, publicly perform, publicly display and distribute any Content which you submit, post or display on or through, the Services."
You can't grant those rights if the copyright is held by a third party. As a result you can't feed someone elses CC or GFDL content into the system. There are probably other issues but the TOS is unclear.
I don't agree with this interpretation. Google provides an interface whereby the user enters the URL to a Wikipedia article and Google imports the text into their own service. The user does no importing.
On Tue, Jun 9, 2009 at 4:47 PM, geni geniice@gmail.com wrote:
2009/6/9 Amir E. Aharoni amir.aharoni@gmail.com:
On Wed, Jun 10, 2009 at 00:54, genigeniice@gmail.com wrote:
2009/6/9 Brian Brian.Mingus@colorado.edu:
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
/Brian
Under Google's TOS you cannot enter CC or GFDL produced by someone else into the translation tool.
Where exactly do the TOS say it? I couldn't find it.
"By submitting your content through the Service, you grant Google the permission to use your content permanently to promote, improve or offer the Services"
"By submitting, posting or displaying the content you give Google a perpetual, irrevocable, worldwide, royalty-free, and non-exclusive licence to reproduce, adapt, modify, translate, publish, publicly perform, publicly display and distribute any Content which you submit, post or display on or through, the Services."
You can't grant those rights if the copyright is held by a third party. As a result you can't feed someone elses CC or GFDL content into the system. There are probably other issues but the TOS is unclear.
-- geni
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
2009/6/9 Brian Brian.Mingus@colorado.edu:
I don't agree with this interpretation. Google provides an interface whereby the user enters the URL to a Wikipedia article and Google imports the text into their own service. The user does no importing.
I think the odds of you successfully arguing that that does not fall under submitting are pretty much zilch. In any case there are likely other issues but that is just the most straightforward one.
In the absence of a specific argument against my argument, my argument holds - Google imports the data into their own service and there is no contradiction.
Suppose however that my argument did not hold - that when Google download's data to their own servers on behalf of a user this section of the ToS becomes a legally binding contract between Google and the user. Is there a contradiction between the ToS and Wikipedia's copyright policy?
On the one hand we have Google's ToS which states that when a user imports data they grant Google rights that, legally, the user cannot grant. On the other hand Google has clearly created a service that is meant to assist Wikipedian's in translating articles from one language to another so that the data might be imported back into Wikipedia. The very existence of such a service, created for the express purpose of operating on GFDL/CC-BY-SA text, automatically voids the statement in the ToS because it is nonsensical. If Google were to try to make a legal claim on the content, which they would not, they would have no legal basis on which to do so.
Regardless, I started this thread as a PSA. As another user pointed out, no one would ever know if someone used Google Translate to translate a Wikipedia article, so the whole conversation is largely pointless.
On Tue, Jun 9, 2009 at 5:03 PM, geni geniice@gmail.com wrote:
2009/6/9 Brian Brian.Mingus@colorado.edu:
I don't agree with this interpretation. Google provides an interface
whereby
the user enters the URL to a Wikipedia article and Google imports the
text
into their own service. The user does no importing.
I think the odds of you successfully arguing that that does not fall under submitting are pretty much zilch. In any case there are likely other issues but that is just the most straightforward one.
-- geni
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Wed, Jun 10, 2009 at 1:14 AM, BrianBrian.Mingus@colorado.edu wrote:
In the absence of a specific argument against my argument, my argument holds
- Google imports the data into their own service and there is no
contradiction.
Suppose however that my argument did not hold - that when Google download's data to their own servers on behalf of a user this section of the ToS becomes a legally binding contract between Google and the user. Is there a contradiction between the ToS and Wikipedia's copyright policy?
On the one hand we have Google's ToS which states that when a user imports data they grant Google rights that, legally, the user cannot grant. On the other hand Google has clearly created a service that is meant to assist Wikipedian's in translating articles from one language to another so that the data might be imported back into Wikipedia. The very existence of such a service, created for the express purpose of operating on GFDL/CC-BY-SA text, automatically voids the statement in the ToS because it is nonsensical. If Google were to try to make a legal claim on the content, which they would not, they would have no legal basis on which to do so.
I do not see your argument... There is a contract between Google and the user, granting Google certain rights. Why does the fact that the user (and/or Google) intends to use the material for something else void this contract?
Google and the user entered into a completely different contract by agreeing to operate on freely licensed content.
On Tue, Jun 9, 2009 at 5:25 PM, Andre Engels andreengels@gmail.com wrote:
On Wed, Jun 10, 2009 at 1:14 AM, BrianBrian.Mingus@colorado.edu wrote:
In the absence of a specific argument against my argument, my argument
holds
- Google imports the data into their own service and there is no
contradiction.
Suppose however that my argument did not hold - that when Google
download's
data to their own servers on behalf of a user this section of the ToS becomes a legally binding contract between Google and the user. Is there
a
contradiction between the ToS and Wikipedia's copyright policy?
On the one hand we have Google's ToS which states that when a user
imports
data they grant Google rights that, legally, the user cannot grant. On
the
other hand Google has clearly created a service that is meant to assist Wikipedian's in translating articles from one language to another so that the data might be imported back into Wikipedia. The very existence of
such a
service, created for the express purpose of operating on GFDL/CC-BY-SA
text,
automatically voids the statement in the ToS because it is nonsensical.
If
Google were to try to make a legal claim on the content, which they would not, they would have no legal basis on which to do so.
I do not see your argument... There is a contract between Google and the user, granting Google certain rights. Why does the fact that the user (and/or Google) intends to use the material for something else void this contract?
-- André Engels, andreengels@gmail.com
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
2009/6/10 Brian Brian.Mingus@colorado.edu:
Google and the user entered into a completely different contract by agreeing to operate on freely licensed content.
Show me exactly where they entered into such an agreement.
Sane, non evil TOS service are not Google's strong point. Remember the chrome mess?
You're choosing not to get it. I can't help that.
On Tue, Jun 9, 2009 at 7:44 PM, geni geniice@gmail.com wrote:
2009/6/10 Brian Brian.Mingus@colorado.edu:
Google and the user entered into a completely different contract by
agreeing
to operate on freely licensed content.
Show me exactly where they entered into such an agreement.
Sane, non evil TOS service are not Google's strong point. Remember the chrome mess?
-- geni
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
2009/6/10 Brian Brian.Mingus@colorado.edu:
You're choosing not to get it. I can't help that.
So you can't actually back up your assertion.
Not only did you not provide a critique of my more general claim (that the user does not enter into a contract with Google regarding Wikipedia's data) but you have no provided any sort of well founded critique of this one. You've basically said, in both cases, "I don't believe that."
On Tue, Jun 9, 2009 at 8:10 PM, geni geniice@gmail.com wrote:
2009/6/10 Brian Brian.Mingus@colorado.edu:
You're choosing not to get it. I can't help that.
So you can't actually back up your assertion.
-- geni
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
That's really neat, I'm glad they worked on Wikipedia first. I'm sure they are open to working with the licensing issues, they seem to like to use a rather restrictive one as their default almost without thinking about it, which I think is what happened with chrome also. I'm sure they will be open to changing it.
The tool itself looks really nice too!
2009/6/10 Brian Brian.Mingus@colorado.edu:
Not only did you not provide a critique of my more general claim (that the user does not enter into a contract with Google regarding Wikipedia's data) but you have no provided any sort of well founded critique of this one. You've basically said, in both cases, "I don't believe that."
Thatys because you've provided zero evidence to back your position. Have you even rad the TOS:
"By using Google Translator Toolkit (the “Service”), you agree to be bound by our Google Terms of Services located at http://www.google.com/accounts/TOS as well as these additional terms."
"1. Your relationship with Google
1.1 Your use of Google’s products, software, services and web sites (referred to collectively as the “Services” in this document and excluding any services provided to you by Google under a separate written agreement) is subject to the terms of a legal agreement between you and Google. "
"2.1 In order to use the Services, you must firstly agree to the Terms. You may not use the Services if you do not accept the Terms."
"2.3 You may not use the Services and may not accept the Terms if (a) you are not of legal age to form a binding contract with Google, or (b) you are a person barred from receiving the Services under the laws of the United States or other countries including the country in which you are resident or from which you use the Services."
". By submitting, posting or displaying the content you give Google a perpetual, irrevocable, worldwide, royalty-free, and non-exclusive licence to reproduce, adapt, modify, translate, publish, publicly perform, publicly display and distribute any Content which you submit, post or display on or through, the Services."
If if we took your highly non standard position that providing Google with a URL is not submitting the content the output is displayed by Google and you have no way to grant them the above rights over it for third party CC-BY-SA content.
On Tue, Jun 9, 2009 at 6:01 PM, Amir E. Aharoniamir.aharoni@gmail.com wrote:
An unedited machine-translated text is likely to be speedily deleted as patent nonsense, before copyvio is even considered.
-- אמיר אלישע אהרוני Amir Elisha Aharoni
If it is deleted as nonsense, that will be a gross error by the administrator, at least in enWP. It is usually possible to roughly understand what is meant in a Google translation. That's enough to defeat speedy deletion. What these texts need is revision. I think of them essentially as an automated dictionary.
If I have any understanding of the subject at all, my quite elementary knowledge of French or German lets me compare the translation with the original, and then rewrite the article into acceptable English much more rapidly than if I had only the original text and a conventional dictionary--essentially as I would do of texts translated into English by someone with a good knowledge of the original language but a very minimal knowledge of the target language, English and no grasp of English idiom.
What I usually find in such translations is that only part of the article is translated--sometimes only the lede paragraph, but rarely including the references or figure legends or the like--which often causes these articles to be nominated for deletion as non notable and unsourced, by people too lazy to follow the interlanguage link. .Almost never is there any search for the correct internal wikilinks
Even in languages I cannot actually read, such as the other Romance languages , or Russian, I can generally at least add the references section and fix some of the internal links, and thus preserve the article for someone who can do better.
David Goodman, Ph.D, M.L.S. http://en.wikipedia.org/wiki/User_talk:DGG
On Wed, Jun 10, 2009 at 06:22, David Goodmandgoodmanny@gmail.com wrote:
On Tue, Jun 9, 2009 at 6:01 PM, Amir E. Aharoniamir.aharoni@gmail.com wrote:
An unedited machine-translated text is likely to be speedily deleted as patent nonsense, before copyvio is even considered.
If it is deleted as nonsense, that will be a gross error by the administrator, at least in enWP. It is usually possible to roughly understand what is meant in a Google translation. That's enough to defeat speedy deletion. What these texts need is revision. I think of them essentially as an automated dictionary.
According to the dry letter of the policies it may be an error, but the deletion logs show that it happens quite often.
current level of sophistication of translation tools, especialy of languages that do not belog to the same group as english, german, french, etc. is completely useless.
Machine translations into slavic languages are to be deleted from wiki immediatealy.
masti
W dniu 09.06.2009 22:42, Brian pisze:
Google has built in support for using its machine translation technology to help bootstrap human translations of Wikipedia articles.
http://translate.google.com/toolkit/docupload
The benefit to Google is clear - they need sentence-aligned text in multiple languages in order to bootstrap their automated system.
This is a great example of machines helping people help machines help people, etc... I'm sure this is now the most efficient way to produce high quality translations of Wikipedia articles en masse.
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
/Brian _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
2009/6/9 masti mastigm@gmail.com:
current level of sophistication of translation tools, especialy of languages that do not belog to the same group as english, german, french, etc. is completely useless.
Machine translations into slavic languages are to be deleted from wiki immediatealy.
masti
Slavic is in the same group as English, French and German. They are all Indo-European. There is no lower level of relation between French (romance language) and English (germanic)
On Wed, Jun 10, 2009 at 00:54, mastimastigm@gmail.com wrote:
current level of sophistication of translation tools, especialy of languages that do not belog to the same group as english, german, french, etc. is completely useless.
Let me disagree. Hungarian is not in the same group by far, and the results make it possible to understand more than 50% of the text (sometimes I'd say above 90%). While this is far from proper translation it is by no means _useless_, since its obvious use is to understand a completely foreign text to some extents.
And I'd like to second that the quality has been really improving, whether the state of the art linguistic science backs its theory up or not. This is observation, and not theory.
But I see this is an exaggeration contest, so I'll go back to the shadow. :-)
grin
Let me agree with it completely (out of the shadow ;). This feature's aim is obviously to help understand totally "alien" texts to a certain [at least minimal?] extent. This whole thing has absolutely nothing to do with 'translation/interpretation' in it's proper sense. It's a pair of crutches for those, who are otherwise helpless. ;)
B.
-----Original Message----- From: foundation-l-bounces@lists.wikimedia.org [mailto:foundation-l-bounces@lists.wikimedia.org] On Behalf Of Peter Gervai Sent: Wednesday, June 10, 2009 1:28 PM To: Wikimedia Foundation Mailing List Subject: Re: [Foundation-l] Google Translate now assists with humantranslations of Wikipedia articles
On Wed, Jun 10, 2009 at 00:54, mastimastigm@gmail.com wrote:
current level of sophistication of translation tools, especialy of languages that do not belog to the same group as english, german, french, etc. is completely useless.
Let me disagree. Hungarian is not in the same group by far, and the results make it possible to understand more than 50% of the text (sometimes I'd say above 90%). While this is far from proper translation it is by no means _useless_, since its obvious use is to understand a completely foreign text to some extents.
And I'd like to second that the quality has been really improving, whether the state of the art linguistic science backs its theory up or not. This is observation, and not theory.
But I see this is an exaggeration contest, so I'll go back to the shadow. :-)
grin
_______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
__________ ESET Smart Security - Vmrusdefinmciss adatbazis: 4143 (20090610) __________
Az |zenetet az ESET Smart Security ellenorizte.
__________ ESET Smart Security - Vírusdefiníciós adatbázis: 4143 (20090610) __________
Az üzenetet az ESET Smart Security ellenorizte.
What I see as a great feature in the toolkit is the translation memory: in practice (after you switch of the machine translation), common phrases in Wikipedia articles - like "external links", "notes", "history", "early life" etc. - are pretranslated once a human has already translated them; if more then one people start working on the same article separately, they can make use of the other users' translations and build upon them (without having to explicitly 'collaborate' or 'share' for this function to work).
Also, if you were to translate [[Bird species 1]], [[Bird species 2]], [[Bird species 3]], I think you would get some very useful suggestions for translating [[Bird species 4]].
Best, Bence Damokos
On Wed, Jun 10, 2009 at 1:38 PM, Bennó benno79@freemail.hu wrote:
and totally "alien" texts to a certain [at least minimal?] extent. This whole thing has absolutely nothing to do with 'translation/interpretation' in it's proper sense. It's a pair of crutches for those, who are otherwise helpless. ;)
On Wed, Jun 10, 2009 at 14:46, Bence Damokosbdamokos@gmail.com wrote:
What I see as a great feature in the toolkit is the translation memory: in practice (after you switch of the machine translation), common phrases in Wikipedia articles - like "external links", "notes", "history", "early life" etc. - are pretranslated once a human has already translated them; if more then one people start working on the same article separately, they can make use of the other users' translations and build upon them (without having to explicitly 'collaborate' or 'share' for this function to work).
Maybe, but at the very best case it can work for very short passages. Two or three sentences at most. And it would be taken out of context.
On Wed, Jun 10, 2009 at 1:56 PM, Amir E. Aharoni amir.aharoni@gmail.comwrote:
On Wed, Jun 10, 2009 at 14:46, Bence Damokosbdamokos@gmail.com wrote:
What I see as a great feature in the toolkit is the translation memory:
in
practice (after you switch of the machine translation), common phrases in Wikipedia articles - like "external links", "notes", "history", "early
life"
etc. - are pretranslated once a human has already translated them; if
more
then one people start working on the same article separately, they can
make
use of the other users' translations and build upon them (without having
to
explicitly 'collaborate' or 'share' for this function to work).
Maybe, but at the very best case it can work for very short passages. Two or three sentences at most. And it would be taken out of context.
If you were working on the very same article, it would obviously be in context...; and the short phrases tend to be common, especially, considering that Google treats the target of the links separately which allows for creating a sort of glossary.
Best, Bence
Bennó wrote:
Let me agree with it completely (out of the shadow ;). This feature's aim is obviously to help understand totally "alien" texts to a certain [at least minimal?] extent. This whole thing has absolutely nothing to do with 'translation/interpretation' in it's proper sense. It's a pair of crutches for those, who are otherwise helpless. ;)
Sure, but even with 90% accuracy (which is still very low) one needs to remain aware of the limitations of machine translation. Seeing it as a crutch is a healthy approach. What needs to be discouraged is the dangerous techno-pop attitude that there is a machine solution for every situation, and that machines can find the magic substitute for common sense.
Ec
I would just like to point out that every single critic has ignored the premise that I started this thread with:
"This is a great example of machines helping people help machines help people."
On Wed, Jun 10, 2009 at 10:53 AM, Ray Saintonge saintonge@telus.net wrote:
Bennó wrote:
Let me agree with it completely (out of the shadow ;). This feature's aim
is
obviously to help understand totally "alien" texts to a certain [at least minimal?] extent. This whole thing has absolutely nothing to do with 'translation/interpretation' in it's proper sense. It's a pair of
crutches
for those, who are otherwise helpless. ;)
Sure, but even with 90% accuracy (which is still very low) one needs to remain aware of the limitations of machine translation. Seeing it as a crutch is a healthy approach. What needs to be discouraged is the dangerous techno-pop attitude that there is a machine solution for every situation, and that machines can find the magic substitute for common sense.
Ec
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Brian wrote:
I would just like to point out that every single critic has ignored the premise that I started this thread with:
"This is a great example of machines helping people help machines help people."
I don't disagree with that point, but I often note in real life that many people who seek help want to substitute that help for any exercise of their own little grey cells.
I have no problem with using a machine translation as a starting point because these translations are uncopyrightable beyond pre-existing copyrights.
Ec
On Wed, Jun 10, 2009 at 10:53 AM, Ray Saintonge wrote:
Sure, but even with 90% accuracy (which is still very low) one needs to remain aware of the limitations of machine translation. Seeing it as a crutch is a healthy approach. What needs to be discouraged is the dangerous techno-pop attitude that there is a machine solution for every situation, and that machines can find the magic substitute for common sense.
Ec
On Wed, Jun 10, 2009 at 20:01, BrianBrian.Mingus@colorado.edu wrote:
I would just like to point out that every single critic has ignored the premise that I started this thread with:
"This is a great example of machines helping people help machines help people."
That, again, would be Wikipedia, not Google. No-one knows how these Google algorithms work, so i can't really know how helpful i am.
Let me disagree. Hungarian is not in the same group by far, and the results make it possible to understand more than 50% of the text (sometimes I'd say above 90%). While this is far from proper translation it is by no means _useless_, since its obvious use is to understand a completely foreign text to some extents.
IMHO automatic translations into Polish are useless, as they only allow rough orientation in the contents of an article. It concerns not only translations from Hungarian (in which part of the words whose Polish counterparts were unknown to the automatic translator were left untranslated or translated into English), but even translations from German. (I was trying articles on the children's literature ;-)
Picus viridis
Hoi, The quality of the translations will vary. There are many reasons for it and one of the things that will make a difference is the number of people using the translate tool as a rough first pass. Once this is done, using the translation functionality will help Google to improve the quality of the code.
This has been said before, there is no news here. What is relevant however is that in order to support the languages that have not been supported so far, there is a need for people actually using this tool to build the translation corpus that gets you this first pass functionality.
Translation is not something where a silver bullet will provide an "instant on - high quality" experience and it is the languages that are currently not supported that have the highest need for tools like this. Thanks, GerardN
2009/6/13 picus-viridis picus-viridis@o2.pl
Let me disagree. Hungarian is not in the same group by far, and the results make it possible to understand more than 50% of the text (sometimes I'd say above 90%). While this is far from proper translation it is by no means _useless_, since its obvious use is to understand a completely foreign text to some extents.
IMHO automatic translations into Polish are useless, as they only allow rough orientation in the contents of an article. It concerns not only translations from Hungarian (in which part of the words whose Polish counterparts were unknown to the automatic translator were left untranslated or translated into English), but even translations from German. (I was trying articles on the children's literature ;-)
Picus viridis
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Gerard Meijssen hett schreven:
Hoi, The quality of the translations will vary. There are many reasons for it and one of the things that will make a difference is the number of people using the translate tool as a rough first pass. Once this is done, using the translation functionality will help Google to improve the quality of the code.
This has been said before, there is no news here. What is relevant however is that in order to support the languages that have not been supported so far, there is a need for people actually using this tool to build the translation corpus that gets you this first pass functionality.
Translation is not something where a silver bullet will provide an "instant on - high quality" experience and it is the languages that are currently not supported that have the highest need for tools like this.
This is interesting. I did not know it's possible to train new languages. Is there any available information on the requirements? What requirements need to be met, to make Google support them (so they can be selected in the drop-down at the translator toolkit)? _How much_ text do they need as a basis to finally enable the translation function?
(My personal experience with the collaboratetiveness of Google is a bad one. Although Google is a multi-billion dollar company and [in a fair world] should actually _pay_ people for things like translating their interface in as much languages as possible [as Google with its 80% search engine market share is one of the most important internet access vectors and not having a search engine in your language is a big accessibility barrier] they rather choose to go the cheap way and let volunteers translate it. That not enough, they have the chutzpa to _reject_ adding any further languages [no additions since at least 2007, although they still support Elmer Fudd, bork bork bork, Klingon and pirate speak...]. At the moment Google supports the languages of roundabout 85 to 90% of the world's population and it seems, they don't care about the rest.)
Marcus Buck
Hoi, One of the most important things that is needed for adding languages to a technology like this is having a sufficiently sized corpus. For general availability, the expectation for the quality is quite high. To me this seems to be one reason why Google did not add more languages. Another reason why many corpora are not big enough is because of the problem of identifying a text for the language it is written in. When you consider that a few years ago I learned that only a small percentage of Internet content has the metadata for the language that is used.. When you then consider that something like 75% is actually wrong...
Given that Google actually supports MediaWiki, it may be that they are willing to support our language. The problem however is that many of our language have illegal and even wrong codes. The consequence is that it is not obvious to just support our "language". This issue will not be resolved because people are under the impression that the "community" has the final word about the names of our languages. This is naive as well as problematic because it prevents the ease of the argument for Google to support our languages.. Thanks, GerardM
2009/6/15 Marcus Buck me@marcusbuck.org
Gerard Meijssen hett schreven:
Hoi, The quality of the translations will vary. There are many reasons for it
and
one of the things that will make a difference is the number of people
using
the translate tool as a rough first pass. Once this is done, using the translation functionality will help Google to improve the quality of the code.
This has been said before, there is no news here. What is relevant
however
is that in order to support the languages that have not been supported so far, there is a need for people actually using this tool to build the translation corpus that gets you this first pass functionality.
Translation is not something where a silver bullet will provide an
"instant
on - high quality" experience and it is the languages that are currently
not
supported that have the highest need for tools like this.
This is interesting. I did not know it's possible to train new languages. Is there any available information on the requirements? What requirements need to be met, to make Google support them (so they can be selected in the drop-down at the translator toolkit)? _How much_ text do they need as a basis to finally enable the translation function?
(My personal experience with the collaboratetiveness of Google is a bad one. Although Google is a multi-billion dollar company and [in a fair world] should actually _pay_ people for things like translating their interface in as much languages as possible [as Google with its 80% search engine market share is one of the most important internet access vectors and not having a search engine in your language is a big accessibility barrier] they rather choose to go the cheap way and let volunteers translate it. That not enough, they have the chutzpa to _reject_ adding any further languages [no additions since at least 2007, although they still support Elmer Fudd, bork bork bork, Klingon and pirate speak...]. At the moment Google supports the languages of roundabout 85 to 90% of the world's population and it seems, they don't care about the rest.)
Marcus Buck
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Gerard Meijssen hett schreven:
Hoi, One of the most important things that is needed for adding languages to a technology like this is having a sufficiently sized corpus. For general availability, the expectation for the quality is quite high. To me this seems to be one reason why Google did not add more languages. Another reason why many corpora are not big enough is because of the problem of identifying a text for the language it is written in. When you consider that a few years ago I learned that only a small percentage of Internet content has the metadata for the language that is used.. When you then consider that something like 75% is actually wrong...
Given that Google actually supports MediaWiki, it may be that they are willing to support our language. The problem however is that many of our language have illegal and even wrong codes. The consequence is that it is not obvious to just support our "language". This issue will not be resolved because people are under the impression that the "community" has the final word about the names of our languages. This is naive as well as problematic because it prevents the ease of the argument for Google to support our languages.. Thanks, GerardM
Your old ISO code hobby horse ;-) I guess, if Google wanted to, they would be able recognize the languages of our projects. Just like all our users do too.
One of the most important things that is needed for adding languages to a technology like this is having a sufficiently sized corpus.
Yes, that was basically my main question: What is sufficiently? How much pages or MB of text? At least the order of magnitude.
Marcus Buck
Hoi, The proper use of language codes is indeed a recurring theme. Calling it a hobby horse gives the impression that it does not have a real world application. It does have a real world application and one of the problems with language is that it is truly hard to recognise languages confidently. Suggesting that Google can because of its size is too easy. I am sure they would have if they could. Thanks, GerardM
2009/6/15 Marcus Buck me@marcusbuck.org
Gerard Meijssen hett schreven:
Hoi, One of the most important things that is needed for adding languages to a technology like this is having a sufficiently sized corpus. For general availability, the expectation for the quality is quite high. To me this seems to be one reason why Google did not add more languages. Another
reason
why many corpora are not big enough is because of the problem of
identifying
a text for the language it is written in. When you consider that a few
years
ago I learned that only a small percentage of Internet content has the metadata for the language that is used.. When you then consider that something like 75% is actually wrong...
Given that Google actually supports MediaWiki, it may be that they are willing to support our language. The problem however is that many of our language have illegal and even wrong codes. The consequence is that it is not obvious to just support our "language". This issue will not be
resolved
because people are under the impression that the "community" has the
final
word about the names of our languages. This is naive as well as
problematic
because it prevents the ease of the argument for Google to support our languages.. Thanks, GerardM
Your old ISO code hobby horse ;-) I guess, if Google wanted to, they would be able recognize the languages of our projects. Just like all our users do too.
One of the most important things that is needed for adding languages to a technology like this is having a sufficiently sized corpus.
Yes, that was basically my main question: What is sufficiently? How much pages or MB of text? At least the order of magnitude.
Marcus Buck
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Gerard Meijssen hett schreven:
Hoi, The proper use of language codes is indeed a recurring theme. Calling it a hobby horse gives the impression that it does not have a real world application. It does have a real world application and one of the problems with language is that it is truly hard to recognise languages confidently. Suggesting that Google can because of its size is too easy. I am sure they would have if they could. Thanks, GerardM
Let's assume Google wants to build an Alemannic translation tool. They are searching for an Alemannic text corpus. Will they fail to find the Alemannic Wikipedia cause 'als' stands for a form of Albanian? I don't think so.
Don't understand me wrong, I am _pro_ the use of correct codes and I would reject the opinion, that projects have the right to decide to stick to a wrong code. But I also reject to switch projects to codes that don't match the project ('gsw' for example is no proper substitute for 'als') and I reject code switches that do harm to the projects (that means that the old code has to be a redirect to the new code at least for several years). And most importantly I think, that the question of ISO codes is not related to Google's operations. If Google wants to use Wikipedia content to improve their tools it should be really easy for them to do the code mapping (e.g. 'no'->'nb').
So does anybody know how big a corpus must be to be helpful to Google?
Marcus Buck
It depends on how much a priori knowledge you have about the languages. For the moment people tend to go into two camps, those who want to use statistical engines and those who want to go for rule based engines. According to one person there are some activity to include rules into statistical engines and vica verca but it still needs a lot of work.
Identifying a language isn't that difficult in itself, most search engines are quite good at that. Many engines can even be told to interpret the text according to a specific language so the problem is basically non existent for us.
Still, because our articles has a lot of text that isn't part of a single language, and in addition there are also specialized markup, there should be done some kind of parsing before the translation engine starts processing the text.
After some discussions last winter I am quite sure a rule based engine work best for small languages, but that a working solution should use some kind of self learning mechanism to refine the translation or at least identify errors.
Our idea was to use statistics to identify cases where existing rules failed, and let people define the new rules. Failing rules would be detected by checking which translated sentences got changed afterwards. Actually it is a bit more difficult than this,.. ;)
And no, I'm not a linguist...
John
One of the most important things that is needed for adding languages to a technology like this is having a sufficiently sized corpus.
Yes, that was basically my main question: What is sufficiently? How much pages or MB of text? At least the order of magnitude.
Marcus Buck
Actually, Google added... Pirate and Montenegrin.
Mark
On Mon, Jun 15, 2009 at 4:43 AM, Marcus Buckme@marcusbuck.org wrote:
Gerard Meijssen hett schreven:
Hoi, The quality of the translations will vary. There are many reasons for it and one of the things that will make a difference is the number of people using the translate tool as a rough first pass. Once this is done, using the translation functionality will help Google to improve the quality of the code.
This has been said before, there is no news here. What is relevant however is that in order to support the languages that have not been supported so far, there is a need for people actually using this tool to build the translation corpus that gets you this first pass functionality.
Translation is not something where a silver bullet will provide an "instant on - high quality" experience and it is the languages that are currently not supported that have the highest need for tools like this.
This is interesting. I did not know it's possible to train new languages. Is there any available information on the requirements? What requirements need to be met, to make Google support them (so they can be selected in the drop-down at the translator toolkit)? _How much_ text do they need as a basis to finally enable the translation function?
(My personal experience with the collaboratetiveness of Google is a bad one. Although Google is a multi-billion dollar company and [in a fair world] should actually _pay_ people for things like translating their interface in as much languages as possible [as Google with its 80% search engine market share is one of the most important internet access vectors and not having a search engine in your language is a big accessibility barrier] they rather choose to go the cheap way and let volunteers translate it. That not enough, they have the chutzpa to _reject_ adding any further languages [no additions since at least 2007, although they still support Elmer Fudd, bork bork bork, Klingon and pirate speak...]. At the moment Google supports the languages of roundabout 85 to 90% of the world's population and it seems, they don't care about the rest.)
Marcus Buck
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Mark Williamson hett schreven:
Actually, Google added... Pirate and Montenegrin.
Mark
I first asked them in 2007 to add my language. They told me, no further languages would be added at the moment and they would inform me, if that changed. I asked them again in 2008 and 2009. One time they answered not at all and the other time they said nothing had changed. Pirate of course is an important addition... And Montenegrin surely was a good measure to endear oneself to the Montenegrin government.
Marcus Buck
Дана Saturday 13 June 2009 18:20:36 picus-viridis написа:
IMHO automatic translations into Polish are useless, as they only allow rough orientation in the contents of an article. It concerns not only
How is rough orientation in the contents of an article useless?
2009/6/21 Nikola Smolenski smolensk@eunet.yu:
Дана Saturday 13 June 2009 18:20:36 picus-viridis написа:
IMHO automatic translations into Polish are useless, as they only allow rough orientation in the contents of an article. It concerns not only
How is rough orientation in the contents of an article useless?
It's not useless, but it's not all that useful. I find when translating from other Wikipedias to add to the English version of an article that it's the subtle and important details that get mashed to uncertainty.
- d.
It also depends on the language pair. For Chinese to English, I wouldn't even bother with such a process (having a machine translate and then correct the errors); for Spanish to English I do this very frequently and it's a great timesaver.
Mark
skype: node.ue
On Sun, Jun 21, 2009 at 7:05 AM, David Gerarddgerard@gmail.com wrote:
2009/6/21 Nikola Smolenski smolensk@eunet.yu:
Дана Saturday 13 June 2009 18:20:36 picus-viridis написа:
IMHO automatic translations into Polish are useless, as they only allow rough orientation in the contents of an article. It concerns not only
How is rough orientation in the contents of an article useless?
It's not useless, but it's not all that useful. I find when translating from other Wikipedias to add to the English version of an article that it's the subtle and important details that get mashed to uncertainty.
- d.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
current level of sophistication of translation tools, especialy of languages that do not belog to the same group as english, german, french, etc. is completely useless.
Machine translations into slavic languages are to be deleted from wiki immediatealy.
masti
Just to confirm, yesterday I needed to translate a piece from Bulgariam Wikipedia article into Russian. I ended up with the manual translation even though I do no speak a word of Bulgarian (Russian is my mothertongue). The output of Google Language Tools (Bulgarian into English) was on substandard level.
Cheers Yaroslav
On Tue, Jun 9, 2009 at 23:42, BrianBrian.Mingus@colorado.edu wrote:
Google has built in support for using its machine translation technology to help bootstrap human translations of Wikipedia articles.
http://translate.google.com/toolkit/docupload
The benefit to Google is clear - they need sentence-aligned text in multiple languages in order to bootstrap their automated system.
This is a great example of machines helping people help machines help people, etc... I'm sure this is now the most efficient way to produce high quality translations of Wikipedia articles en masse.
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
OK, after a bit of drama in this discussion, i actually tried this toolkit.
First i tried to translate the Hebrew article [[שלום גד]] into English (that's Shalom Gad, one of my favorite Israeli musicians). Apparently, it can only translate from English. I am more interested in translating Wikipedia articles from Hebrew into English, so it was quite disappointing, but they'll probably fix it soon enough.
Then i tried to translate [[Art critic]] from English into Hebrew. There were a few pleasant surprises, but on the whole the machine translation was bad to the point of being unusable. It is much easier to translate it using vi.
Google want side by side translations. It is not quite possible. A grammar of a language is not just subjects, objects, tenses and adjectives. Google seem to ignore [[Text linguistics]] - rules which apply way beyond the word and the sentence. And these are *grammar rules*, not just "style". (Disclaimer: The Department of Linguistics in the Hebrew University of Jerusalem, where i study, is very keen on this subject.)
I *had* to make very deep changes to paragraph structure - not to mention sentence structure -, and not just because the Hebrew Wikipedia has a different MOS, but because it's the basis of the Hebrew language. A text without these changes would be next to unreadable. I doubt that a document which is changed so deeply is very useful to Google at this point. I certainly know that it is not useful to me - i gave up after two paragraphs.
So yes, Google can revise the legalese of their TOS, but this is not a very urgent problem. The uselessness of the technology makes the TOS pretty irrelevant.
Amir E. Aharoni wrote:
On Tue, Jun 9, 2009 at 23:42, BrianBrian.Mingus@colorado.edu wrote:
Google has built in support for using its machine translation technology to help bootstrap human translations of Wikipedia articles.
http://translate.google.com/toolkit/docupload
The benefit to Google is clear - they need sentence-aligned text in multiple languages in order to bootstrap their automated system.
This is a great example of machines helping people help machines help people, etc... I'm sure this is now the most efficient way to produce high quality translations of Wikipedia articles en masse.
We should take the ToS to make sure the translated text can be CC-BY-SA licensed.
OK, after a bit of drama in this discussion, i actually tried this toolkit.
Then i tried to translate [[Art critic]] from English into Hebrew. There were a few pleasant surprises, but on the whole the machine translation was bad to the point of being unusable. It is much easier to translate it using vi.
I tried translating [[Astronomy]] and [[Eothyrididae]] (at least, the part of it that is in English) to Serbian and was pleasantly surprised. Sure, literally every sentence needed major corrections, but for me it was still much easier to do that than to translate from scratch.
I *had* to make very deep changes to paragraph structure - not to mention sentence structure -, and not just because the Hebrew Wikipedia has a different MOS, but because it's the basis of the
This is then apparently the case of English→Hebrew translation working worse than English→Serbian (possibly due to Hebrew being a non-indo-european language)? I have never had to make any changes to paragraph structure, only occasionally changes to sentence structure (I'd say there were about 10% of sentences I had to change the structure of and another 10% that had uncommon structure but I let them slide).
Hebrew language. A text without these changes would be next to unreadable. I doubt that a document which is changed so deeply is very
While I would probably delete an article that would be dumped straight from a machine translation, I still find it fully understandable.
To illustrate:
Then i tried to translate [[Art critic]] from English into Hebrew. There were a few pleasant surprises, but on the whole the machine translation was bad to the point of being unusable. It is much easier to translate it using vi.
translates to:
Tada sam pokušao prevesti [[umetnički kritičar]] sa engleskog na hebrejskom. Bilo je nekoliko ugodnih iznenađenja, nego na ceo mašina prevod je loš do tačke da je neupotrebljiva. To je mnogo lakše prevesti preko VI.
I would retranslate this to broken English li:
Then i tried to translate [[Art critic]] from English into Hebrew's. There were a few pleasant surprises, than on entire machine's translation was bad to the point of being unusably. Much easier translated via VI.
and the correct would be (I highlighted the changes):
Tada sam pokušao prevesti [[umetnički kritičar]] sa engleskog na *hebrejski*. Bilo je nekoliko ugodnih iznenađenja, *ali u celini* *mašinski* prevod je loš do tačke da je *neupotrebljiv*. *Mnogo je* lakše prevesti *ga* *pomoću vi-ja*.
Such an approach has an critical flaw. I don’t know whether this applies to, say, English—French translations, but it is known to be present for cyrillic languages. Statistical approach sometimes discovers false connections that result in factual errors. Examples of “translating”, say, “50 USD” as “50 000 UAH” within a particular context are known; more of such things can arise unexpectedly. So, at least a good understanding both of the topic and the source language is a crucial prerequisite, and there should be a warning about it.
I really don’t like the way they write “Wikipedia™” instead of simply “Wikipedia” — do they really have to emphasize the trademark status?
Perhaps, after some time goes by, I will be able to make a tool to select all translations made that way on a wiki, which may help deleting purely nonsensical ones.
— Kalan
Kalan wrote:
present for cyrillic languages. Statistical approach sometimes discovers false connections that result in factual errors. Examples of “translating”, say, “50 USD” as “50 000 UAH” within a particular context are known; more of such things can arise unexpectedly. So, at
The funniest example I noticed is that "flew" was translated to Serbian as "MaudDib" :) (this has been corrected since).
And yet I can not stress enough how much I find this service useful, both for personal use and to ease translation.
Sometimes cities are "translated" - "Koper" was translated to English from Slovene as "Chicago" and "Kranj" as "Miami"... of course Kranj is 100km inland and Miami is largely beachfront and the opposite with Chicago and Koper.
"Ljubljana" was translated to English in earlier phases of the software as "rape"... In Italian to English, "L'Italia" became "Canada"; in Tagalog to English, "Pilipinas" became "Japan" - when they first debuted the Tagalog language capability, I tested it with the tl.wp article on Manila which informed me that Manila is the capital of Japan...
Mark
On Wed, Jun 10, 2009 at 7:33 AM, Nikola Smolenskismolensk@eunet.yu wrote:
Kalan wrote:
present for cyrillic languages. Statistical approach sometimes discovers false connections that result in factual errors. Examples of “translating”, say, “50 USD” as “50 000 UAH” within a particular context are known; more of such things can arise unexpectedly. So, at
The funniest example I noticed is that "flew" was translated to Serbian as "MaudDib" :) (this has been corrected since).
And yet I can not stress enough how much I find this service useful, both for personal use and to ease translation.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Of course these are now things that you are able to fix and which can be shared with everyone.
On Wed, Jun 10, 2009 at 9:32 AM, Mark Williamson node.ue@gmail.com wrote:
Sometimes cities are "translated" - "Koper" was translated to English from Slovene as "Chicago" and "Kranj" as "Miami"... of course Kranj is 100km inland and Miami is largely beachfront and the opposite with Chicago and Koper.
"Ljubljana" was translated to English in earlier phases of the software as "rape"... In Italian to English, "L'Italia" became "Canada"; in Tagalog to English, "Pilipinas" became "Japan" - when they first debuted the Tagalog language capability, I tested it with the tl.wp article on Manila which informed me that Manila is the capital of Japan...
Mark
On Wed, Jun 10, 2009 at 7:33 AM, Nikola Smolenskismolensk@eunet.yu wrote:
Kalan wrote:
present for cyrillic languages. Statistical approach sometimes discovers false connections that result in factual errors. Examples of “translating”, say, “50 USD” as “50 000 UAH” within a particular context are known; more of such things can arise unexpectedly. So, at
The funniest example I noticed is that "flew" was translated to Serbian as "MaudDib" :) (this has been corrected since).
And yet I can not stress enough how much I find this service useful, both for personal use and to ease translation.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Wed, Jun 10, 2009 at 19:29, BrianBrian.Mingus@colorado.edu wrote:
Of course these are now things that you are able to fix and which can be shared with everyone.
Unfortunately it's Google, not Wikipedia. There's mysterious Google code behind it all; not MediaWiki, whose code everyone is free to study and fix.
Not evil - just mysterious. And overhyped.
Brian wrote:
Of course these are now things that you are able to fix and which can be shared with everyone.
Sure, the funny errors are the most obvious and most easily fixed. The problematic ones are more subtle, remain unnoticed, and more readily spread misunderstanding.
Ec
On Wed, Jun 10, 2009 at 9:32 AM, Mark Williamson node.ue@gmail.com wrote:
Sometimes cities are "translated" - "Koper" was translated to English from Slovene as "Chicago" and "Kranj" as "Miami"... of course Kranj is 100km inland and Miami is largely beachfront and the opposite with Chicago and Koper.
Дана Wednesday 10 June 2009 17:32:00 Mark Williamson написа:
"Ljubljana" was translated to English in earlier phases of the software as "rape"... In Italian to English, "L'Italia" became
Well that is a correct translation :)
Thanks Nikola, I just laughed enough to last me for the rest of the week.
Mark
On Wed, Jun 10, 2009 at 9:49 AM, Nikola Smolenskismolensk@eunet.yu wrote:
Дана Wednesday 10 June 2009 17:32:00 Mark Williamson написа:
"Ljubljana" was translated to English in earlier phases of the software as "rape"... In Italian to English, "L'Italia" became
Well that is a correct translation :)
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Sometimes cities are "translated" - "Koper" was translated to English from Slovene as "Chicago" and "Kranj" as "Miami"... of course Kranj is 100km inland and Miami is largely beachfront and the opposite with Chicago and Koper.
"Ljubljana" was translated to English in earlier phases of the software as "rape"... In Italian to English, "L'Italia" became "Canada"; in Tagalog to English, "Pilipinas" became "Japan" - when they first debuted the Tagalog language capability, I tested it with the tl.wp article on Manila which informed me that Manila is the capital of Japan...
Mark
I have got Нови Трг (Novy Trg) as New York from Bulgarian. Looks like a systematic error.
Cheers Yaroslav
wikimedia-l@lists.wikimedia.org