In my opinion the only thing that is going to work on short term is a
guided rule based system. We need that to be able to reuse values from
Wikidata in running text. That is a template text must be transformed
according to gender, plurality, etc, but also that the values must be
adjust to genitive, locative, illative, etc, forms.
On Sat, Jul 27, 2013 at 5:40 PM, David Cuenca <dacuetu(a)gmail.com> wrote:
On Sat, Jul 27, 2013 at 10:39 AM, C. Scott Ananian
<cananian(a)wikimedia.org>wrote;wrote:
My main point was just that there is a
chicken-and-egg problem here. You
assume that machine translation can't work because we don't have enough
parallel texts. But, to the extent that machine-aided translation of WP is
successful, it creates a large amount of parallel text. I agree that
there are challenges. I simply disagree, as a matter of logic, with the
blanket dismissal of the chickens because there aren't yet any eggs.
I think we both agree about the need and usefulness of having a copious
amount of parallel text. The main difficulty is how to get there from
scratch. As I see it there are several possible paths
- volunteers creating the corpus manually (some work done, however not
properly tagged)
- use a statistic approach to create the base text and volunteers would
improve that text only
- use rules and statistics to create the base text and volunteers would
improve the text and optionally the rules
The end result of all options is the creation of a parallel corpus that can
be reused for statistic translation. In my opinion, the efectivity of
giving users the option to improve/select the rules is much larger than
improving the text only. It complements statistic analysis rather than
replacing it and it provides a good starting point to solve the egg-chicken
conundrum, specially in small Wikipedias.
Currently translatewiki is relying on external tools where we don't have
much control, besides of being propietary and with the risk that they can
be disabled any time.
I think you're attributing the faults of a single implementation/UX to the
technique as a whole. (Which is why I felt that
"step 1" should be to
create better tools for maintaining information about parallel structures
in the wikidata.)
Good call. Now that you mention it, yes, it would be great to have a place
where to keep a parallel corpus, and it would be even more useful if it can
be annotated with wikidata-wiktionary senses. A wikibase repo might be the
way to go. No idea if Wikidata or Translatewiki are the right places to
store/display it. Maybe it will be a good time to discuss it during
Wikimania. I have added it to the "elements" section.
In a world with an active Moore's law, WP *does* have the computing power
to approximate this effort. Again, the beauty of the statistical approach
is that it scales.
My main concern about statistic-based machine translation is that it needs
volume to be effective, hence the proposal to use rule-based translation to
reach the critical point faster than just using statistics on existing text
alone.
I'm sure we can agree to disagree here. Probably our main differences are
in answers to the question, "where should we start work"? I think
annotating parallel texts is the most interesting research question
("research" because I agree that wiki editing by volunteers makes the UX
problem nontrivial). I think your suggestion is to start work on the
"semantic multilingual dictionary"?
It is quite possible to have multiple developments in parallel. That a
semantic dictionary is in development doesn't hinder the creation of a
parallel corpus or an interface for annotating. The same applies to
statistics/rules, they are not incompatible, in fact they complement each
other pretty well.
ps. note that the inter-language links in the
sidebar of wikipedia articles
already comprise a very interesting corpus of noun translations. I don't
think this dataset is currently exploited fully.
I couldn't agree more. I would ask to take a close look to CoSyne. I'm sure
some of it can be reused:
http://www.cosyne.eu/index.php/Main_Page
Cheers,
David
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l