On Sat, Jul 27, 2013 at 10:39 AM, C. Scott Ananian cananian@wikimedia.orgwrote:
My main point was just that there is a chicken-and-egg problem here. You assume that machine translation can't work because we don't have enough parallel texts. But, to the extent that machine-aided translation of WP is successful, it creates a large amount of parallel text. I agree that there are challenges. I simply disagree, as a matter of logic, with the blanket dismissal of the chickens because there aren't yet any eggs.
I think we both agree about the need and usefulness of having a copious amount of parallel text. The main difficulty is how to get there from scratch. As I see it there are several possible paths - volunteers creating the corpus manually (some work done, however not properly tagged) - use a statistic approach to create the base text and volunteers would improve that text only - use rules and statistics to create the base text and volunteers would improve the text and optionally the rules
The end result of all options is the creation of a parallel corpus that can be reused for statistic translation. In my opinion, the efectivity of giving users the option to improve/select the rules is much larger than improving the text only. It complements statistic analysis rather than replacing it and it provides a good starting point to solve the egg-chicken conundrum, specially in small Wikipedias.
Currently translatewiki is relying on external tools where we don't have much control, besides of being propietary and with the risk that they can be disabled any time.
I think you're attributing the faults of a single implementation/UX to the
technique as a whole. (Which is why I felt that "step 1" should be to create better tools for maintaining information about parallel structures in the wikidata.)
Good call. Now that you mention it, yes, it would be great to have a place where to keep a parallel corpus, and it would be even more useful if it can be annotated with wikidata-wiktionary senses. A wikibase repo might be the way to go. No idea if Wikidata or Translatewiki are the right places to store/display it. Maybe it will be a good time to discuss it during Wikimania. I have added it to the "elements" section.
In a world with an active Moore's law, WP *does* have the computing power to approximate this effort. Again, the beauty of the statistical approach is that it scales.
My main concern about statistic-based machine translation is that it needs volume to be effective, hence the proposal to use rule-based translation to reach the critical point faster than just using statistics on existing text alone.
I'm sure we can agree to disagree here. Probably our main differences are in answers to the question, "where should we start work"? I think annotating parallel texts is the most interesting research question ("research" because I agree that wiki editing by volunteers makes the UX problem nontrivial). I think your suggestion is to start work on the "semantic multilingual dictionary"?
It is quite possible to have multiple developments in parallel. That a semantic dictionary is in development doesn't hinder the creation of a parallel corpus or an interface for annotating. The same applies to statistics/rules, they are not incompatible, in fact they complement each other pretty well.
ps. note that the inter-language links in the sidebar of wikipedia articles already comprise a very interesting corpus of noun translations. I don't think this dataset is currently exploited fully.
I couldn't agree more. I would ask to take a close look to CoSyne. I'm sure some of it can be reused: http://www.cosyne.eu/index.php/Main_Page
Cheers, David