Hi Lars,
Thanks for the detailed feedback. Some comments inline.
Mike
On Thu, Aug 5, 2010 at 1:39 PM, Lars Aronsson lars@aronsson.se wrote:
On 08/05/2010 03:12 PM, Michael Galvez wrote:
Sorry for coming into this discussion a bit late. I'm one of the members
of
Google's translation team, and I wanted to make myself available for feedback/questions.
This is an unusual and most welcome step for Google. When I first learned about GTTK in June 2009, I used it to translate a handful of articles from English to Swedish. I'm glad that it's now also possible to translate into English, but some of the errors are still there.
It's a great tool, and should be used more. We have a common interest in improving it. But for this, wikipedians need feedback. Which language pairs are most active? What words or phrases does GTTK find problematic, and can we somehow improve that? Google could benefit so much from collaborating with wikipedians. Ultimately, Google could share some translation dictionaries, so we could include them in Wiktionary, the free dictionary.
Several points here: 1. Re: language pairs, let me check with comms to see what we can share and how. One possibility is to periodically post the statistics and link to them from the Google Translate blog. Will keep you posted.
2. I'm not sure what you mean by "words or phrases" that are problematic. Can you clarify?
3. We acquire dictionaries on limited licenses from other parties. In general, while we can surface this content on our own sites (e.g., Google Translate, Google Dictionary, Google Translator Toolkit), we don't have permission to donate that data to other sites.
Users of Gmail or Google Apps want their privacy, but users who translate Wikipedia articles are already sharing their results, so Google could help us to find each other and make us collaborate. Translations that start from a Wikipedia article could by default be put in a shared pool where other wikipedians can find them.
Yes, this is one of the things we'd love to do and we're working on it.
To some details:
I need a way to mark in the original text that a phrase is a quote, book title or noun proper that shouldn't be translated, but copied literally. And in the statistics, those words should not be counted as untranslated and block me from publishing the result. Optimally, GTTK would learn over time where such literal phrases occur, e.g. text in italics under the ==Bibliography== section.
For HTML files, both Translate and Translator Toolkit support the tag
class="notranslate"
to exclude text from translation. ( http://translate.google.com/support/toolkit/bin/answer.py?hl=en&answer=1... )
If you tell us what MediaWiki tags you'd like for us to treat the same way, we can do the same for Wikipedia.
English ==References== corresponds to Swedish ==Källor==, even though the two words are not direct translations. GTTK was pretty quick to pick this up. However, the different styles we use for the opening paragraph of biographic articles, using parenthesis around the birth and death dates in the English Wikipedia, but not in some other languages, is something GTTK has not yet learned.
At the most basic level, this is how the Translator Toolkit "learns" from translations:
1. When a translator uploads a WIkipedia article into Translator Toolkit, we divide the article into segments. (sentences, section headings, etc.)
2. For each segment, we look for the highest-rated translation in the global, shared translation memory or TM --- a big database of human translations previously shared by other users. a. If we find a translation for that segment in the TM, we will "pre-translate" the segment with the highest-rated translation. b. If we don't find a translation for that segment in the TM, we will "pre-translate" that segment with machine translation (MT).
3. When the translator corrects these pre-translated segments, we save their corrections into the global, shared TM.
4. When a new user asks for the translation of a segment previously corrected by another user, we will recall that previous, human translation and prefer it over MT.
At a higher level, we also incorporate these previous, human translations into our MT engine, improving its quality over time.
In this case, it just so happened that someone had translated ==References== into ==Källor== --- that's why we're surfacing that corrected translation from the TM. In contrast, the other segments are probably coming back as MT.
Categories should not be translated, but GTTK should follow the interwiki links for categories. If none exist, perhaps suggest a parent category.
Following interwiki links and suggesting parent categories is a bit of work and unlikely to be implemented soon. We can disable category translation if that helps - can you confirm if that's OK?
Even for articles that already exist in the target language, we often need to translate another section. For example, the Swedish Wikipedia might have an article about Afghanistan with a good section about its geography, but the history section needs improvement, and could be translated from another language. The work-around is to begin a translation of the whole article, but only translate the relevant part and then cut-and-paste into the target without submitting through GTTK. Perhaps GTTK could bring up both articles side by side and suggest which sections are in most dire need of improvement?
The solution that we had previously discussed with the WMF is section-level translations. However, we haven't gotten started on this yet.
At a high level, we're working on two things:
1. We can help users seed Wikipedia content in small languages. In this case, entire articles are translated from one Wikipedia into another --- not sections.
2. Once the articles exist in multiple languages, the articles take on a life of their own and become out of sync. If Wikipedians want to keep those articles in sync, we would like to help them by enabling section-level translation.
Right now, we are busy with feature requests and bugs for the first objective --- helping users seed Wikipedia in small languages. Once we get the problems with unwanted red links, community collaboration, etc., out of the way, we can move to the second problem with section-level translations.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l