[Foundation-l] Push translation

Michael Galvez michaelcg at gmail.com
Fri Aug 6 17:47:07 UTC 2010


Hi Lars,

Thanks for the detailed feedback.  Some comments inline.

Mike

On Thu, Aug 5, 2010 at 1:39 PM, Lars Aronsson <lars at aronsson.se> wrote:

> On 08/05/2010 03:12 PM, Michael Galvez wrote:
> > Sorry for coming into this discussion a bit late.  I'm one of the members
> of
> > Google's translation team, and I wanted to make myself available for
> > feedback/questions.
>
> This is an unusual and most welcome step for Google. When I first
> learned about GTTK in June 2009, I used it to translate a handful of
> articles from English to Swedish. I'm glad that it's now also possible
> to translate into English, but some of the errors are still there.
>
> It's a great tool, and should be used more. We have a common
> interest in improving it. But for this, wikipedians need feedback.
> Which language pairs are most active? What words or phrases
> does GTTK find problematic, and can we somehow improve that?
> Google could benefit so much from collaborating with wikipedians.
> Ultimately, Google could share some translation dictionaries, so
> we could include them in Wiktionary, the free dictionary.
>

Several points here:
1. Re: language pairs, let me check with comms to see what we can share and
how.  One possibility is to periodically post the statistics and link to
them from the Google Translate blog.  Will keep you posted.

2. I'm not sure what you mean by "words or phrases" that are problematic.
 Can you clarify?

3. We acquire dictionaries on limited licenses from other parties.  In
general, while we can surface this content on our own sites (e.g., Google
Translate, Google Dictionary, Google Translator Toolkit), we don't have
permission to donate that data to other sites.


>
> Users of Gmail or Google Apps want their privacy, but users who
> translate Wikipedia articles are already sharing their results, so
> Google could help us to find each other and make us collaborate.
> Translations that start from a Wikipedia article could by default
> be put in a shared pool where other wikipedians can find them.
>

Yes, this is one of the things we'd love to do and we're working on it.


>
> To some details:
>
> I need a way to mark in the original text that a phrase is a quote,
> book title or noun proper that shouldn't be translated, but copied
> literally. And in the statistics, those words should not be counted
> as untranslated and block me from publishing the result. Optimally,
> GTTK would learn over time where such literal phrases occur, e.g.
> text in italics under the ==Bibliography== section.
>

For HTML files, both Translate and Translator Toolkit support the tag

class="notranslate"

to exclude text from translation.  (
http://translate.google.com/support/toolkit/bin/answer.py?hl=en&answer=147838
)

If you tell us what MediaWiki tags you'd like for us to treat the same way,
we can do the same for Wikipedia.


> English ==References== corresponds to Swedish ==Källor==,
> even though the two words are not direct translations. GTTK was
> pretty quick to pick this up. However, the different styles we use
> for the opening paragraph of biographic articles, using parenthesis
> around the birth and death dates in the English Wikipedia, but not
> in some other languages, is something GTTK has not yet learned.
>

At the most basic level, this is how the Translator Toolkit "learns" from
translations:

1. When a translator uploads a WIkipedia article into Translator Toolkit, we
divide the article into segments.  (sentences, section headings, etc.)

2. For each segment, we look for the highest-rated translation in the
global, shared translation memory or TM --- a big database of human
translations previously shared by other users.
 a. If we find a translation for that segment in the TM, we will
"pre-translate" the segment with the highest-rated translation.
 b. If we don't find a translation for that segment in the TM, we will
"pre-translate" that segment with machine translation (MT).

3. When the translator corrects these pre-translated segments, we save their
corrections into the global, shared TM.

4. When a new user asks for the translation of a segment previously
corrected by another user, we will recall that previous, human translation
and prefer it over MT.

At a higher level, we also incorporate these previous, human translations
into our MT engine, improving its quality over time.

In this case, it just so happened that someone had translated ==References==
into ==Källor== --- that's why we're surfacing that corrected translation
from the TM.  In contrast, the other segments are probably coming back as
MT.


> Categories should not be translated, but GTTK should follow the
> interwiki links for categories. If none exist, perhaps suggest a
> parent category.
>

Following interwiki links and suggesting parent categories is a bit of work
and unlikely to be implemented soon.  We can disable category translation if
that helps - can you confirm if that's OK?


>
> Even for articles that already exist in the target language, we often
> need to translate another section. For example, the Swedish Wikipedia
> might have an article about Afghanistan with a good section about its
> geography, but the history section needs improvement, and could
> be translated from another language. The work-around is to begin
> a translation of the whole article, but only translate the relevant
> part and then cut-and-paste into the target without submitting
> through GTTK. Perhaps GTTK could bring up both articles side by
> side and suggest which sections are in most dire need of improvement?
>
>
The solution that we had previously discussed with the WMF is section-level
translations.  However, we haven't gotten started on this yet.

At a high level, we're working on two things:

1. We can help users seed Wikipedia content in small languages.  In this
case, entire articles are translated from one Wikipedia into another --- not
sections.

2. Once the articles exist in multiple languages, the articles take on a
life of their own and become out of sync.  If Wikipedians want to keep those
articles in sync, we would like to help them by enabling section-level
translation.

Right now, we are busy with feature requests and bugs for the first
objective --- helping users seed Wikipedia in small languages.  Once we get
the problems with unwanted red links, community collaboration, etc., out of
the way, we can move to the second problem with section-level translations.



> --
>   Lars Aronsson (lars at aronsson.se)
>   Aronsson Datateknik - http://aronsson.se
>
>
>
> _______________________________________________
> foundation-l mailing list
> foundation-l at lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
>


More information about the foundation-l mailing list