Re: [Wikitech-l] Collaborative machine translation for Wikipedia -- proposed strategy

28 Jul 2013

In my opinion the only thing that is going to work on short term is a
guided rule based system. We need that to be able to reuse values from
Wikidata in running text. That is a template text must be transformed
according to gender, plurality, etc, but also that the values must be
adjust to genitive, locative, illative, etc, forms.

https://meta.wikimedia.org/wiki/Wikimedia_Fellowships/Project_Ideas/Tools_f…

On Sat, Jul 27, 2013 at 5:40 PM, David Cuenca &lt;dacuetu(a)gmail.com&gt; wrote:
...
  On Sat, Jul 27, 2013 at 10:39 AM, C. Scott Ananian
 &lt;cananian(a)wikimedia.org&gt;wrote;wrote:

  My main point was just that there is a
chicken-and-egg problem here.  You
 assume that machine translation can't work because we don't have enough
 parallel texts.  But, to the extent that machine-aided translation of WP is
 successful, it creates a large amount of parallel text.   I agree that
 there are challenges.  I simply disagree, as a matter of logic, with the
 blanket dismissal of the chickens because there aren't yet any eggs.

 I think we both agree about the need and usefulness of having a copious
 amount of parallel text. The main difficulty is how to get there from
 scratch. As I see it there are several possible paths
 - volunteers creating the corpus manually (some work done, however not
 properly tagged)
 - use a statistic approach to create the base text and volunteers would
 improve that text only
 - use rules and statistics to create the base text and volunteers would
 improve the text and optionally the rules

 The end result of all options is the creation of a parallel corpus that can
 be reused for statistic translation. In my opinion, the efectivity of
 giving users the option to improve/select the rules is much larger than
 improving the text only. It complements statistic analysis rather than
 replacing it and it provides a good starting point to solve the egg-chicken
 conundrum, specially in small Wikipedias.

 Currently translatewiki is relying on external tools where we don't have
 much control, besides of being propietary and with the risk that they can
 be disabled any time.

 I think you're attributing the faults of a single implementation/UX to the
  technique as a whole.  (Which is why I felt that
"step 1" should be to
 create better tools for maintaining information about parallel structures
 in the wikidata.)

 Good call. Now that you mention it, yes, it would be great to have a place
 where to keep a parallel corpus, and it would be even more useful if it can
 be annotated with wikidata-wiktionary senses. A wikibase repo might be the
 way to go. No idea if Wikidata or Translatewiki are the right places to
 store/display it. Maybe it will be a good time to discuss it during
 Wikimania. I have added it to the "elements" section.

 In a world with an active Moore's law, WP *does* have the computing power
 to approximate this effort.  Again, the beauty of the statistical approach
 is that it scales.

 My main concern about statistic-based machine translation is that it needs
 volume to be effective, hence the proposal to use rule-based translation to
 reach the critical point faster than just using statistics on existing text
 alone.

 I'm sure we can agree to disagree here.  Probably our main differences are
 in answers to the question, "where should we start work"?  I think
 annotating parallel texts is the most interesting research question
 ("research" because I agree that wiki editing by volunteers makes the UX
 problem nontrivial).  I think your suggestion is to start work on the
 "semantic multilingual dictionary"?

 It is quite possible to have multiple developments in parallel. That a
 semantic dictionary is in development doesn't hinder the creation of a
 parallel corpus or an interface for annotating. The same applies to
 statistics/rules, they are not incompatible, in fact they complement each
 other pretty well.

  ps. note that the inter-language links in the
sidebar of wikipedia articles
 already comprise a very interesting corpus of noun translations.  I don't
 think this dataset is currently exploited fully.

 I couldn't agree more. I would ask to take a close look to CoSyne. I'm sure
 some of it can be reused:
 http://www.cosyne.eu/index.php/Main_Page

 Cheers,
 David
 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Collaborative machine translation for Wikipedia -- proposed strategy