After Erik's email about supporting open source machine translation [1], I've been researching options and having talks with several machine translation researchers about what would be the best way to integrate MT into Wikipedia. Unfortunately I couldn't find a single solution that, on its own, would fulfill all requirements (specially being open source). On the plus side, there is a set of technologies that if integrated, they could provide a positive and reliable experience. It would be a hard way to get there, and even so, it might be worth exploring.
This is the preliminary draft: https://meta.wikimedia.org/wiki/Collaborative_Machine_Translation_for_Wikipe...
It is a shame that the talk about "Supporting translation of Wikipedia content" [2] has not been accepted for WM13. Hopefully there will be enough interest to discuss this topic there anyways.
Micru
[1] http://thread.gmane.org/gmane.org.wikimedia.foundation/65605 [2] https://wikimania2013.wikimedia.org/wiki/Submissions/Supporting_translation_...
On Fri, Jul 26, 2013 at 3:25 PM, David Cuenca dacuetu@gmail.com wrote:
This is the preliminary draft:
https://meta.wikimedia.org/wiki/Collaborative_Machine_Translation_for_Wikipe...
The linked page says:
For this kind of project it is prefered to use a rule-based machine translationhttps://en.wikipedia.org/wiki/en:Rule-based_machine_translation system, because total control is wanted over the whole process and minority languages should be accounted for (not that easy with statistical-basedhttps://en.wikipedia.org/wiki/en:Statistical_machine_translation MT, where parallel corpora may be non-existing).
This statement seems rather defeatist to me. Step one of a machine translation effort should be to provide tools to annotate parallel texts in the various wikis, and to edit and maintain their parallelism. Once this is done, you have a substantial parallel corpora, which is then suitable to grow the set of translated articles. That is, minority languages ought to be accounted for by progressively expanding the number of translated articles in their encyclopedia, as we do now. As this is done, machine translation incrementally improves. If there is not enough of an editor community to translate articles, I don't see how you will succeed in the much more technically-demanding tasks of creating rules for a rule-based translation system. The beauty of the statistical approach is that little special ability is needed. --scott
On Fri, Jul 26, 2013 at 11:30 PM, C. Scott Ananian cananian@wikimedia.orgwrote:
This statement seems rather defeatist to me. Step one of a machine translation effort should be to provide tools to annotate parallel texts in the various wikis, and to edit and maintain their parallelism.
Scott, "edit and maintain" parallelism sounds wonderful on paper, until you want to implement it and then you realize that you have to freeze changes both in the source text and in the target language for it to happen, which is, IMHO against the very nature of wikis. Translate:Extension already does that in a way. I see it useful only for texts acting as a central hub for translations, like official communication. If that were to happen for all kind of content you would have to sacrifice the plurality of letting each wiki to do their own version.
Once this is done, you have a substantial parallel corpora, which is then suitable to grow the set of translated articles. That is, minority languages ought to be accounted for by progressively expanding the number of translated articles in their encyclopedia, as we do now. As this is done, machine translation incrementally improves.
The most popular statistical-based machine translation system has created its engine using texts extracted from *the whole internet*, it requires huge processing power, and that without mentioning the amount of resources that went into research and development. And having all those resources they managed to create a system that sort of works. Wikipedia doesn't have enough amount of text nor resources to follow that route, and the target number of languages is even higher. Of course statistical-based approaches should also be used as well (point 8 of the proposed workflow), however more as a supporting technology rather than the main one.
If there is not enough of an editor community to translate articles, I don't see how you will succeed in the much more technically-demanding tasks of creating rules for a rule-based translation system. The beauty of the statistical approach is that little special ability is needed.
One single researcher can create working transfer rules for a language pair in 3 months or less if there is previous work (see these GsoC [1], [2], [3]). Whichever problem the translation has, it can be understood and corrected. With statistics, you rely on bulk numbers and on the hope that you have enough coverage, and that makes improving its defects even harder. It is true that writing transfer rules is technically demanding, and so it is writing mediawiki software, which keeps being developed anyways. After seeing how their system works, I think there is room for simplifying transfer rules (first storing them as mediawiki templates, then as linked data, then having a user interface). That could lower the entry barrier for linguists and translators alike, while enabling the triangulation of rules between pairs that have a common one.
As said before, there is no single tool that can do everything, it is the combination of them what will bring the best results. The good thing is that there is no need to "marry" a technology, several can be developed in parallel and broght to a point of convergence where they work together for optimal results.
I appreciate that you took time to read the proposal :)
Thanks, David
[1] http://www.google-melange.com/gsoc/project/google/gsoc2013/akindalki/3001 [2] http://www.google-melange.com/gsoc/project/google/gsoc2013/jcentelles/20001 [3] http://www.google-melange.com/gsoc/project/google/gsoc2013/jonasfromseier/50...
On Sat, Jul 27, 2013 at 10:18 AM, David Cuenca dacuetu@gmail.com wrote:
Scott, "edit and maintain" parallelism sounds wonderful on paper, until you want to implement it and then you realize that you have to freeze changes both in the source text and in the target language for it to happen, which is, IMHO against the very nature of wikis.
Certainly not. As you yourself linked, there are 'fuzzy annotation' tools and other techniques. Just because one word in one language is changed, it shouldn't invalidate the entire parallelism. And the beauty of the statistical approach, if the changes are minor, you can still view the changed copy as a 'roughly parallel' text. After all, if I just replaced 'white' with 'pale', it doesn't necessarily mean that the translation 'blanco' is invalid. In fact, it adds more data points.
My main point was just that there is a chicken-and-egg problem here. You assume that machine translation can't work because we don't have enough parallel texts. But, to the extent that machine-aided translation of WP is successful, it creates a large amount of parallel text. I agree that there are challenges. I simply disagree, as a matter of logic, with the blanket dismissal of the chickens because there aren't yet any eggs.
Translate:Extension already does that in a way. I see it useful only for texts acting as a central hub for translations, like official communication. If that were to happen for all kind of content you would have to sacrifice the plurality of letting each wiki to do their own version.
I think you're attributing the faults of a single implementation/UX to the technique as a whole. (Which is why I felt that "step 1" should be to create better tools for maintaining information about parallel structures in the wikidata.)
The most popular statistical-based machine translation system has created its engine using texts extracted from *the whole internet*, it requires
...but its genesis used the UN corpora *only*. And in fact, the last paper I read (and please correct me if I'm wrong, I'm a dilettante not an expert) still claimed that the UN parallel text corpora was orders of magnitude more useful than the "whole internet data", because it had more reliable parallelism and was done by careful translators.
This is what WP has the potential to be.
huge processing power, and that without mentioning the amount of resources that went into research and development. And having all those resources they managed to create a system that sort of works. Wikipedia doesn't have enough amount of text nor resources to follow that route, and the target number of languages is even higher.
In a world with an active Moore's law, WP *does* have the computing power to approximate this effort. Again, the beauty of the statistical approach is that it scales.
Of course statistical-based approaches should also be used as well (point 8 of the proposed workflow), however more as a supporting technology rather than the main one.
I'm sure we can agree to disagree here. Probably our main differences are in answers to the question, "where should we start work"? I think annotating parallel texts is the most interesting research question ("research" because I agree that wiki editing by volunteers makes the UX problem nontrivial). I think your suggestion is to start work on the "semantic multilingual dictionary"?
I appreciate that you took time to read the proposal :)
And I certainly appreciate your effort to write the proposal and to work on the topic! --scott
ps. note that the inter-language links in the sidebar of wikipedia articles already comprise a very interesting corpus of noun translations. I don't think this dataset is currently exploited fully.
On Sat, Jul 27, 2013 at 10:39 AM, C. Scott Ananian cananian@wikimedia.orgwrote:
My main point was just that there is a chicken-and-egg problem here. You assume that machine translation can't work because we don't have enough parallel texts. But, to the extent that machine-aided translation of WP is successful, it creates a large amount of parallel text. I agree that there are challenges. I simply disagree, as a matter of logic, with the blanket dismissal of the chickens because there aren't yet any eggs.
I think we both agree about the need and usefulness of having a copious amount of parallel text. The main difficulty is how to get there from scratch. As I see it there are several possible paths - volunteers creating the corpus manually (some work done, however not properly tagged) - use a statistic approach to create the base text and volunteers would improve that text only - use rules and statistics to create the base text and volunteers would improve the text and optionally the rules
The end result of all options is the creation of a parallel corpus that can be reused for statistic translation. In my opinion, the efectivity of giving users the option to improve/select the rules is much larger than improving the text only. It complements statistic analysis rather than replacing it and it provides a good starting point to solve the egg-chicken conundrum, specially in small Wikipedias.
Currently translatewiki is relying on external tools where we don't have much control, besides of being propietary and with the risk that they can be disabled any time.
I think you're attributing the faults of a single implementation/UX to the
technique as a whole. (Which is why I felt that "step 1" should be to create better tools for maintaining information about parallel structures in the wikidata.)
Good call. Now that you mention it, yes, it would be great to have a place where to keep a parallel corpus, and it would be even more useful if it can be annotated with wikidata-wiktionary senses. A wikibase repo might be the way to go. No idea if Wikidata or Translatewiki are the right places to store/display it. Maybe it will be a good time to discuss it during Wikimania. I have added it to the "elements" section.
In a world with an active Moore's law, WP *does* have the computing power to approximate this effort. Again, the beauty of the statistical approach is that it scales.
My main concern about statistic-based machine translation is that it needs volume to be effective, hence the proposal to use rule-based translation to reach the critical point faster than just using statistics on existing text alone.
I'm sure we can agree to disagree here. Probably our main differences are in answers to the question, "where should we start work"? I think annotating parallel texts is the most interesting research question ("research" because I agree that wiki editing by volunteers makes the UX problem nontrivial). I think your suggestion is to start work on the "semantic multilingual dictionary"?
It is quite possible to have multiple developments in parallel. That a semantic dictionary is in development doesn't hinder the creation of a parallel corpus or an interface for annotating. The same applies to statistics/rules, they are not incompatible, in fact they complement each other pretty well.
ps. note that the inter-language links in the sidebar of wikipedia articles already comprise a very interesting corpus of noun translations. I don't think this dataset is currently exploited fully.
I couldn't agree more. I would ask to take a close look to CoSyne. I'm sure some of it can be reused: http://www.cosyne.eu/index.php/Main_Page
Cheers, David
In my opinion the only thing that is going to work on short term is a guided rule based system. We need that to be able to reuse values from Wikidata in running text. That is a template text must be transformed according to gender, plurality, etc, but also that the values must be adjust to genitive, locative, illative, etc, forms.
https://meta.wikimedia.org/wiki/Wikimedia_Fellowships/Project_Ideas/Tools_fo...
On Sat, Jul 27, 2013 at 5:40 PM, David Cuenca dacuetu@gmail.com wrote:
On Sat, Jul 27, 2013 at 10:39 AM, C. Scott Ananian cananian@wikimedia.orgwrote:
My main point was just that there is a chicken-and-egg problem here. You assume that machine translation can't work because we don't have enough parallel texts. But, to the extent that machine-aided translation of WP is successful, it creates a large amount of parallel text. I agree that there are challenges. I simply disagree, as a matter of logic, with the blanket dismissal of the chickens because there aren't yet any eggs.
I think we both agree about the need and usefulness of having a copious amount of parallel text. The main difficulty is how to get there from scratch. As I see it there are several possible paths
- volunteers creating the corpus manually (some work done, however not
properly tagged)
- use a statistic approach to create the base text and volunteers would
improve that text only
- use rules and statistics to create the base text and volunteers would
improve the text and optionally the rules
The end result of all options is the creation of a parallel corpus that can be reused for statistic translation. In my opinion, the efectivity of giving users the option to improve/select the rules is much larger than improving the text only. It complements statistic analysis rather than replacing it and it provides a good starting point to solve the egg-chicken conundrum, specially in small Wikipedias.
Currently translatewiki is relying on external tools where we don't have much control, besides of being propietary and with the risk that they can be disabled any time.
I think you're attributing the faults of a single implementation/UX to the
technique as a whole. (Which is why I felt that "step 1" should be to create better tools for maintaining information about parallel structures in the wikidata.)
Good call. Now that you mention it, yes, it would be great to have a place where to keep a parallel corpus, and it would be even more useful if it can be annotated with wikidata-wiktionary senses. A wikibase repo might be the way to go. No idea if Wikidata or Translatewiki are the right places to store/display it. Maybe it will be a good time to discuss it during Wikimania. I have added it to the "elements" section.
In a world with an active Moore's law, WP *does* have the computing power to approximate this effort. Again, the beauty of the statistical approach is that it scales.
My main concern about statistic-based machine translation is that it needs volume to be effective, hence the proposal to use rule-based translation to reach the critical point faster than just using statistics on existing text alone.
I'm sure we can agree to disagree here. Probably our main differences are in answers to the question, "where should we start work"? I think annotating parallel texts is the most interesting research question ("research" because I agree that wiki editing by volunteers makes the UX problem nontrivial). I think your suggestion is to start work on the "semantic multilingual dictionary"?
It is quite possible to have multiple developments in parallel. That a semantic dictionary is in development doesn't hinder the creation of a parallel corpus or an interface for annotating. The same applies to statistics/rules, they are not incompatible, in fact they complement each other pretty well.
ps. note that the inter-language links in the sidebar of wikipedia articles already comprise a very interesting corpus of noun translations. I don't think this dataset is currently exploited fully.
I couldn't agree more. I would ask to take a close look to CoSyne. I'm sure some of it can be reused: http://www.cosyne.eu/index.php/Main_Page
Cheers, David _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
I think we're mostly agreed now. And I agree that rule-based systems can provide valuable bootstrapping, if the requisite language experts can be found. I suspect we will find that different language pairs will favor different techniques. --scott
On 07/26/2013 09:25 PM, David Cuenca wrote:
This is the preliminary draft: https://meta.wikimedia.org/wiki/Collaborative_Machine_Translation_for_Wikipe...
Apertium, the GNU GPL software project for rule-based translation that you mention, seems quite promising. In the near term, it would make sense for WMF chapters to support its existing subprojects for language pairs, http://wiki.apertium.org/wiki/Language_and_pair_maintainer
For example, the Swedish and Danish wiki communities could team up with the existing three maintainers for the Swedish-Danish translation pair within Apertium. We could also build on our international community to help set up Apertium teams for other language pairs, e.g. Swedish-Russian or Polish-Italian.
Has this been tried?
Following this Apertium path, using their existing technology, does not exclude that we also use and study other software, or develop our own.
wikitech-l@lists.wikimedia.org