I'm using "small languages" here to describe languages without much
available training data (manually created translation pairs). Nothing is
implied about the size of the speaker base, language vocabulary, or wiki
project. For instance, until recently I would have called both Mandarin
Chinese and Latvian "small language"s because training data for then were
absent from most of the academic translation work. Only this year were
standard training data for these languages included in
http://www.statmt.org/wmt17/translation-task.html for instance, after
collaboration with the University of Latvia and "Nanjing University, Xiamen
University, The Institutes of Computing Technology and of Automation,
Chinese Academy of Science, Northeastern University (China) and Datum Data
Co., Ltd".
--scott
On Fri, Sep 15, 2017 at 6:04 PM, mathieu stumpf guntz <
psychoslave(a)culture-libre.org> wrote:
Well, the method seems interesting, now I would be
interested to see some
concrete translations, if you have some links.
What do you call small languages?
Le 15/09/2017 à 18:14, C. Scott Ananian a écrit :
We're tracking source/destination pairs generated by the
ContentTranslation tool, right? Could someone point me to that dataset?
(I'm playing around with some machine translation stuff to see if i can
prototype a suggester tool that would translate edits on wiki A to
corresponding edits on wiki B.)
--scott
PS. There's some cool work being done on "zero-shot translation"; aka
bootstrapping translation tools for small languages by pre-training them on
a related language (or even an unrelated language). Apparently that works!
(Cf
https://arxiv.org/pdf/1611.04558.pdf) It can greatly reduce the
amount of data required to build a translation model for the small language.
Is there a candidate "small wiki" that's been wanting to use
ContentTranslation which would be a good candidate for experimentation?
--
(
http://cscott.net)
_______________________________________________
Mediawiki-i18n mailing
listMediawiki-i18n@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/mediawiki-i18n
--
(
http://cscott.net)