Thanks for the detailed feedback. Some comments inline.
On Thu, Aug 5, 2010 at 1:39 PM, Lars Aronsson <lars(a)aronsson.se> wrote:
On 08/05/2010 03:12 PM, Michael Galvez wrote:
Sorry for coming into this discussion a bit late.
I'm one of the members
Google's translation team, and I wanted to
make myself available for
This is an unusual and most welcome step for Google. When I first
learned about GTTK in June 2009, I used it to translate a handful of
articles from English to Swedish. I'm glad that it's now also possible
to translate into English, but some of the errors are still there.
It's a great tool, and should be used more. We have a common
interest in improving it. But for this, wikipedians need feedback.
Which language pairs are most active? What words or phrases
does GTTK find problematic, and can we somehow improve that?
Google could benefit so much from collaborating with wikipedians.
Ultimately, Google could share some translation dictionaries, so
we could include them in Wiktionary, the free dictionary.
Several points here:
1. Re: language pairs, let me check with comms to see what we can share and
how. One possibility is to periodically post the statistics and link to
them from the Google Translate blog. Will keep you posted.
2. I'm not sure what you mean by "words or phrases" that are problematic.
Can you clarify?
3. We acquire dictionaries on limited licenses from other parties. In
general, while we can surface this content on our own sites (e.g., Google
Translate, Google Dictionary, Google Translator Toolkit), we don't have
permission to donate that data to other sites.
Users of Gmail or Google Apps want their privacy, but users who
translate Wikipedia articles are already sharing their results, so
Google could help us to find each other and make us collaborate.
Translations that start from a Wikipedia article could by default
be put in a shared pool where other wikipedians can find them.
Yes, this is one of the things we'd love to do and we're working on it.
To some details:
I need a way to mark in the original text that a phrase is a quote,
book title or noun proper that shouldn't be translated, but copied
literally. And in the statistics, those words should not be counted
as untranslated and block me from publishing the result. Optimally,
GTTK would learn over time where such literal phrases occur, e.g.
text in italics under the ==Bibliography== section.
For HTML files, both Translate and Translator Toolkit support the tag
to exclude text from translation. (
If you tell us what MediaWiki tags you'd like for us to treat the same way,
we can do the same for Wikipedia.
English ==References== corresponds to Swedish
even though the two words are not direct translations. GTTK was
pretty quick to pick this up. However, the different styles we use
for the opening paragraph of biographic articles, using parenthesis
around the birth and death dates in the English Wikipedia, but not
in some other languages, is something GTTK has not yet learned.
At the most basic level, this is how the Translator Toolkit "learns" from
1. When a translator uploads a WIkipedia article into Translator Toolkit, we
divide the article into segments. (sentences, section headings, etc.)
2. For each segment, we look for the highest-rated translation in the
global, shared translation memory or TM --- a big database of human
translations previously shared by other users.
a. If we find a translation for that segment in the TM, we will
"pre-translate" the segment with the highest-rated translation.
b. If we don't find a translation for that segment in the TM, we will
"pre-translate" that segment with machine translation (MT).
3. When the translator corrects these pre-translated segments, we save their
corrections into the global, shared TM.
4. When a new user asks for the translation of a segment previously
corrected by another user, we will recall that previous, human translation
and prefer it over MT.
At a higher level, we also incorporate these previous, human translations
into our MT engine, improving its quality over time.
In this case, it just so happened that someone had translated ==References==
into ==Källor== --- that's why we're surfacing that corrected translation
from the TM. In contrast, the other segments are probably coming back as
Categories should not be translated, but GTTK should
interwiki links for categories. If none exist, perhaps suggest a
Following interwiki links and suggesting parent categories is a bit of work
and unlikely to be implemented soon. We can disable category translation if
that helps - can you confirm if that's OK?
Even for articles that already exist in the target language, we often
need to translate another section. For example, the Swedish Wikipedia
might have an article about Afghanistan with a good section about its
geography, but the history section needs improvement, and could
be translated from another language. The work-around is to begin
a translation of the whole article, but only translate the relevant
part and then cut-and-paste into the target without submitting
through GTTK. Perhaps GTTK could bring up both articles side by
side and suggest which sections are in most dire need of improvement?
The solution that we had previously discussed with the WMF is section-level
translations. However, we haven't gotten started on this yet.
At a high level, we're working on two things:
1. We can help users seed Wikipedia content in small languages. In this
case, entire articles are translated from one Wikipedia into another --- not
2. Once the articles exist in multiple languages, the articles take on a
life of their own and become out of sync. If Wikipedians want to keep those
articles in sync, we would like to help them by enabling section-level
Right now, we are busy with feature requests and bugs for the first
objective --- helping users seed Wikipedia in small languages. Once we get
the problems with unwanted red links, community collaboration, etc., out of
the way, we can move to the second problem with section-level translations.
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
foundation-l mailing list