[Foundation-l] Push translation

Lars Aronsson lars at aronsson.se
Sun Aug 8 03:30:10 UTC 2010


On 08/06/2010 07:47 PM, Michael Galvez wrote:
> 3. We acquire dictionaries on limited licenses from other parties.  In
> general, while we can surface this content on our own sites (e.g., Google
> Translate, Google Dictionary, Google Translator Toolkit), we don't have
> permission to donate that data to other sites.

Google, as any large company, uses many sources. For example,
Google Maps used to buy all its maps, but later started to drive
around to build its own maps (and street images). With time, I'm
certain you will use Google Books as a parallel corpus and derive
translations of words and phrases from translated books, and
some day you might be able to build Google Translate without
relying on external dictionary sources. I don't know if this is one
month or one year away, but it should take less than one decade.
Expecting this development, you could keep collaboration with open
content movements, such as Wikipedia/Wiktionary in mind.

> For HTML files, both Translate and Translator Toolkit support the tag
>
> class="notranslate"
>
> to exclude text from translation.  (
> http://translate.google.com/support/toolkit/bin/answer.py?hl=en&answer=147838
> )
>
> If you tell us what MediaWiki tags you'd like for us to treat the same way,
> we can do the same for Wikipedia.

There is no such tag, unfortunately. But in the GTTK user interface,
it would be useful to have a way to mark where in the original text
(left-hand side) those tags should have been. If it is any help to the
pretranslator, other kinds of marks could also be manually added,
such as whether a phrase is a figure of speech or should be read
literally. If the text says "kill two birds with one stone", that should
be translated into Swedish as "hit two flies with one swat". But if
David slays Goliath with a stone, that should remain a stone.

>   a. If we find a translation for that segment in the TM, we will
> "pre-translate" the segment with the highest-rated translation.

But when you have two or more candidates, each with a reasonable
probability, the choice could be presented to the human translator.

> 1. When a translator uploads a WIkipedia article into Translator Toolkit, we
> divide the article into segments.  (sentences, section headings, etc.)

This means you do recognize some wiki markup, such as [[links]]
and ==headings==. But recognition of that markup is apparently
hard-wired and takes place before any learning. Now, consider
the case when

'''John Doe''' (May 1, 1733 - April 5, 1799) was a British colonel

is translated, according to our manual of style, as:

'''John Doe,''' född 1 maj 1733, död 5 april 1799, var en brittisk överste

where the parentheses are replaced with commas and the words född
(born) and död (died) have been added. It would be nice if the
translation memory could learn not only the words (colonel = överste)
but also to recognize this transformation of style. It is very
context sensitive (this example only applies to the opening paragraph
of biographic articles) and would need lots of translations to
provide good results. And including dashes, commas and parentheses
along with words as the elements of translated phrases is perhaps
a major shift in what machine translation is supposed to do.
(But it could open the door to translating template calls.)

> Following interwiki links and suggesting parent categories is a bit of work
> and unlikely to be implemented soon.  We can disable category translation if
> that helps - can you confirm if that's OK?

I think you should keep it as it is, until you get around to do
that "bit of work".


-- 
   Lars Aronsson (lars at aronsson.se)
   Aronsson Datateknik - http://aronsson.se





More information about the foundation-l mailing list