We're tracking source/destination pairs generated by the ContentTranslation tool, right? Could someone point me to that dataset? (I'm playing around with some machine translation stuff to see if i can prototype a suggester tool that would translate edits on wiki A to corresponding edits on wiki B.) --scott
PS. There's some cool work being done on "zero-shot translation"; aka bootstrapping translation tools for small languages by pre-training them on a related language (or even an unrelated language). Apparently that works! (Cf https://arxiv.org/pdf/1611.04558.pdf) It can greatly reduce the amount of data required to build a translation model for the small language.
Is there a candidate "small wiki" that's been wanting to use ContentTranslation which would be a good candidate for experimentation?
Hi C. Scott,
Information about the APIs to get the list of translations and the parallel corpora (that includes examples of human translation, machine translation and the corrections people did to them) is available at https://www.mediawiki.org/wiki/Content_translation/Published_translations People in the team more familiar with the technical details may provide more details if needed.
With "a candidate "small wiki" that's been wanting to use ContentTranslation", do you mean a wiki with heavy use of Content Translation but lacking Machine Translation support? (I'm asking because Content Translation is available in all wikis, although some lack automatic translation support). The CX Stats page https://en.wikipedia.org/wiki/Special:ContentTranslationStats can give you an idea on how much Content Translation has been used for translation on each wiki, and automatic translation support can be found here https://www.mediawiki.org/wiki/Content_translation/Machine_Translation.
--Pau
On Fri, Sep 15, 2017 at 6:14 PM C. Scott Ananian cananian@wikimedia.org wrote:
We're tracking source/destination pairs generated by the ContentTranslation tool, right? Could someone point me to that dataset? (I'm playing around with some machine translation stuff to see if i can prototype a suggester tool that would translate edits on wiki A to corresponding edits on wiki B.) --scott
PS. There's some cool work being done on "zero-shot translation"; aka bootstrapping translation tools for small languages by pre-training them on a related language (or even an unrelated language). Apparently that works! (Cf https://arxiv.org/pdf/1611.04558.pdf) It can greatly reduce the amount of data required to build a translation model for the small language.
Is there a candidate "small wiki" that's been wanting to use ContentTranslation which would be a good candidate for experimentation?
-- (http://cscott.net) _______________________________________________ Mediawiki-i18n mailing list Mediawiki-i18n@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
On Fri, Sep 15, 2017 at 12:30 PM, Pau Giner pginer@wikimedia.org wrote:
With "a candidate "small wiki" that's been wanting to use ContentTranslation", do you mean a wiki with heavy use of Content Translation but lacking Machine Translation support? (I'm asking because Content Translation is available in all wikis, although some lack automatic translation support). The CX Stats page https://en.wikipedia.org/wiki/Special:ContentTranslationStats can give you an idea on how much Content Translation has been used for translation on each wiki, and automatic translation support can be found here https://www.mediawiki.org/wiki/Content_translation/Machine_Translation.
I was thinking of a wiki w/o machine translation support but for which the community had been lobbying for it, with an added bonus if the language in question was related in some way to a larger language family. For example, perhaps since Catalan is a romance language, perhaps a model trained on French and Spanish would be able to pretrain for Catalan. (But on the other hand, Latvian is pretty isolated as a language, so would only be worth cross-training with Lithuanian.)
The data is probably buried in those two pages you cited for me, I've just got to dig for it a bit. One odd thing that jumps out: why do we support en->zh but not zh->en ? --scott
Well, the method seems interesting, now I would be interested to see some concrete translations, if you have some links.
What do you call small languages?
Le 15/09/2017 à 18:14, C. Scott Ananian a écrit :
We're tracking source/destination pairs generated by the ContentTranslation tool, right? Could someone point me to that dataset? (I'm playing around with some machine translation stuff to see if i can prototype a suggester tool that would translate edits on wiki A to corresponding edits on wiki B.) --scott
PS. There's some cool work being done on "zero-shot translation"; aka bootstrapping translation tools for small languages by pre-training them on a related language (or even an unrelated language). Apparently that works! (Cf https://arxiv.org/pdf/1611.04558.pdf) It can greatly reduce the amount of data required to build a translation model for the small language.
Is there a candidate "small wiki" that's been wanting to use ContentTranslation which would be a good candidate for experimentation?
-- (http://cscott.net)
Mediawiki-i18n mailing list Mediawiki-i18n@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
I'm using "small languages" here to describe languages without much available training data (manually created translation pairs). Nothing is implied about the size of the speaker base, language vocabulary, or wiki project. For instance, until recently I would have called both Mandarin Chinese and Latvian "small language"s because training data for then were absent from most of the academic translation work. Only this year were standard training data for these languages included in http://www.statmt.org/wmt17/translation-task.html for instance, after collaboration with the University of Latvia and "Nanjing University, Xiamen University, The Institutes of Computing Technology and of Automation, Chinese Academy of Science, Northeastern University (China) and Datum Data Co., Ltd". --scott
On Fri, Sep 15, 2017 at 6:04 PM, mathieu stumpf guntz < psychoslave@culture-libre.org> wrote:
Well, the method seems interesting, now I would be interested to see some concrete translations, if you have some links.
What do you call small languages?
Le 15/09/2017 à 18:14, C. Scott Ananian a écrit :
We're tracking source/destination pairs generated by the ContentTranslation tool, right? Could someone point me to that dataset? (I'm playing around with some machine translation stuff to see if i can prototype a suggester tool that would translate edits on wiki A to corresponding edits on wiki B.) --scott
PS. There's some cool work being done on "zero-shot translation"; aka bootstrapping translation tools for small languages by pre-training them on a related language (or even an unrelated language). Apparently that works! (Cf https://arxiv.org/pdf/1611.04558.pdf) It can greatly reduce the amount of data required to build a translation model for the small language.
Is there a candidate "small wiki" that's been wanting to use ContentTranslation which would be a good candidate for experimentation?
-- (http://cscott.net)
Mediawiki-i18n mailing listMediawiki-i18n@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
mediawiki-i18n@lists.wikimedia.org