I'm using "small languages" here to describe languages without much available training data (manually created translation pairs).  Nothing is implied about the size of the speaker base, language vocabulary, or wiki project.  For instance, until recently I would have called both Mandarin Chinese and Latvian "small language"s because training data for then were absent from most of the academic translation work.  Only this year were standard training data for these languages included in http://www.statmt.org/wmt17/translation-task.html for instance, after collaboration with the University of Latvia and "Nanjing University, Xiamen University, The Institutes of Computing Technology and of Automation, Chinese Academy of Science, Northeastern University (China) and Datum Data Co., Ltd".
  --scott

On Fri, Sep 15, 2017 at 6:04 PM, mathieu stumpf guntz <psychoslave@culture-libre.org> wrote:

Well, the method seems interesting, now I would be interested to see some concrete translations, if you have some links.

What do you call small languages?


Le 15/09/2017 à 18:14, C. Scott Ananian a écrit :
We're tracking source/destination pairs generated by the ContentTranslation tool, right? Could someone point me to that dataset?  (I'm playing around with some machine translation stuff to see if i can prototype a suggester tool that would translate edits on wiki A to corresponding edits on wiki B.)
  --scott

PS. There's some cool work being done on "zero-shot translation"; aka bootstrapping translation tools for small languages by pre-training them on a related language (or even an unrelated language).  Apparently that works! (Cf https://arxiv.org/pdf/1611.04558.pdf) It can greatly reduce the amount of data required to build a translation model for the small language.

Is there a candidate "small wiki" that's been wanting to use ContentTranslation which would be a good candidate for experimentation?

--


_______________________________________________
Mediawiki-i18n mailing list
Mediawiki-i18n@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n




--
(http://cscott.net)