[Foundation-l] Push translation

List overview All Threads
Download

newer

older

[Foundation-l] Partecipation in...

[Foundation-l] Wiki-Conference NYC...

stevertigo

24 Jul 2010 24 Jul '10

12:57 a.m.

Translation between wikis currently exists as a largely pulling paradigm: Someone on the target wiki finds an article in another language (English for example) and then pulls it to their language wiki.

These days Google and other translate tools are good enough to use as the starting basis for an translated article, and we can consider how we make use of them in an active way. What is largely a "pull" paradigm can also be a "push" paradigm - we can use translation tools to "push" articles to other wikis.

If there are issues, they can be overcome. The fact of the matter is that the vast majority of articles in English can be "pushed" over to other languages, and fill a need for those topics in those languages.

-SC

Show replies by date

Pavlo Shevelo

24 Jul 24 Jul

8:11 p.m.

...

These days Google and other translate tools are good enough to use as the starting basis for an translated article

No, it's far not true - at least for such target language as Ukrainian etc.

So any attempt of "push" translation will be almost the disaster...

On Sat, Jul 24, 2010 at 3:57 AM, stevertigo stvrtg@gmail.com wrote:

...

Translation between wikis currently exists as a largely pulling paradigm: Someone on the target wiki finds an article in another language (English for example) and then pulls it to their language wiki.

These days Google and other translate tools are good enough to use as the starting basis for an translated article, and we can consider how we make use of them in an active way. What is largely a "pull" paradigm can also be a "push" paradigm - we can use translation tools to "push" articles to other wikis.

If there are issues, they can be overcome. The fact of the matter is that the vast majority of articles in English can be "pushed" over to other languages, and fill a need for those topics in those languages.

-SC

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Bence Damokos

8:31 p.m.

As far as push translation goes, there are languages where it could almost work and where it couldn't. (Consider the experience of the Google team with the Bengali Wikipedia - http://googletranslate.blogspot.com/2010/07/translating-wikipedia.html )

Bence

Oliver Keyes

9:28 p.m.

"If there are issues, they can be overcome. The fact of the matter is that the vast majority of articles in English can be "pushed" over to other languages, and fill a need for those topics in those languages." - if there are vast swathes in other languages that aren't filled, it's normally indicative of a small userbase - a small userbase then having to cope with copyediting hundreds or thousands of new articles all referenced in a foreign language. In addition, wikis with vast gaps and small numbers of users are likely to be those in "small" languages; an effective Chuvash translation tool, say, is hardly a massive priority for most online translators.

On Sat, Jul 24, 2010 at 9:31 PM, Bence Damokos bdamokos@gmail.com wrote:

...

As far as push translation goes, there are languages where it could almost work and where it couldn't. (Consider the experience of the Google team with the Bengali Wikipedia - http://googletranslate.blogspot.com/2010/07/translating-wikipedia.html )

Bence _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Mark Williamson

25 Jul 25 Jul

5:45 a.m.

Bence, that's a different topic - MAT (Machine Aided Translation), and in the case of Bengali, I believe simply the use of a translation memory system. Some of the comments on that page seem to be quite misinformed, ranging from people who thought Google was inserting unrevised machine translations into Wikipedia articles (that would be a disaster), to people suggesting (begging?) Google allow the user community to localize their UI (they already do - Facebook took the idea from Google!). Oh, also, somebody protesting the fact that the Spanish language was not mentioned in the post and suggesting that such an omission must mean Google hates Spain. I only saw one comment on that page that didn't make me want to bang my head on the keyboard (but such is the Internet, right?)

-m.

On Sat, Jul 24, 2010 at 1:31 PM, Bence Damokos bdamokos@gmail.com wrote:

...

As far as push translation goes, there are languages where it could almost work and where it couldn't. (Consider the experience of the Google team with the Bengali Wikipedia - http://googletranslate.blogspot.com/2010/07/translating-wikipedia.html )

Bence _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Casey Brown

24 Jul 24 Jul

9:49 p.m.

On Sat, Jul 24, 2010 at 4:11 PM, Pavlo Shevelo pavlo.shevelo@gmail.com wrote:

...

...
These days Google and other translate tools are good enough to use as the starting basis for an translated article

No, it's far not true - at least for such target language as Ukrainian etc.

So any attempt of "push" translation will be almost the disaster...

...and we need to remember that most articles are *not* translations of the English article, but are home-grown on the wiki and use their own sources in their own language.

-- Casey Brown Cbrown1023

Cristian Consonni

11:06 p.m.

2010/7/24 Casey Brown lists@caseybrown.org:

...

On Sat, Jul 24, 2010 at 4:11 PM, Pavlo Shevelo pavlo.shevelo@gmail.com wrote:

...
...
These days Google and other translate tools are good enough to use as the starting basis for an translated article

No, it's far not true - at least for such target language as Ukrainian etc.

So any attempt of "push" translation will be almost the disaster...

...and we need to remember that most articles are *not* translations of the English article, but are home-grown on the wiki and use their own sources in their own language.

Also don't forget that the same subject can be treated very differently among different cultures (even if they are not distant, think to French and English).

An article in the English Wikipedia can be a very good basis to start a new article, but I don't think that an automated "flooding" of the other Wikipedias is a good thing in *any* way.

Cristian

Oliver Keyes

25 Jul 25 Jul

12:54 a.m.

Agreed. There's one wiki which artificially inflated the number of articles it had via a bot (I forget the specific language). That's not a way to increase the wiki's strength. There's an old phrase used on en-wiki; "africa is not a redlink". It means that because we have articles on a lot of common things, the ways in which people can contribute have been reduced - they can't write an article on africa, say. As such, the community growth is slowing (one theory, not one I subscribe to). If you want to grow an active userbase, which is the only way for sustained and non-artificial growth that can respond to the concerns of its readers, you need an active userbase. And for that, there has to be something they can write; there has to be a redlink for Africa, or physics, or Britain. There has to be something where the reader goes "I could fix that" and becomes an editor.

On Sun, Jul 25, 2010 at 12:06 AM, Cristian Consonni <kikkocristian@gmail.com

...

wrote:

...

2010/7/24 Casey Brown lists@caseybrown.org:

...
On Sat, Jul 24, 2010 at 4:11 PM, Pavlo Shevelo pavlo.shevelo@gmail.com

wrote:

...
...
...
These days Google and other translate tools are good enough to use as the starting basis for an translated article

No, it's far not true - at least for such target language as Ukrainian

etc.

...
...
So any attempt of "push" translation will be almost the disaster...

...and we need to remember that most articles are *not* translations of the English article, but are home-grown on the wiki and use their own sources in their own language.

Also don't forget that the same subject can be treated very differently among different cultures (even if they are not distant, think to French and English).

An article in the English Wikipedia can be a very good basis to start a new article, but I don't think that an automated "flooding" of the other Wikipedias is a good thing in *any* way.

Cristian

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Mark Williamson

5:39 a.m.

Wikipedias are not for _cultures_, they are for languages. If I and 1,000 other Americans suddenly learnt French (to the point of native-level fluency) and decided to read and edit the French Wikipedia, it would "belong" to us just as much as to anybody else. This came up recently in the debate about the Acehnese Wikipedia. Some people said that all Acehnese were Muslim (not true - there is a small community of Acehnese Christians). They said that if anyone is Christian, they'd be ejected from Acehnese society and therefore no longer Acehnese. However, they'd not stop speaking the Acehnese language.

Nobody claims the English WP is for US/Commonwealth cultures only... this is reasonable when a Wiki is tiny, but as it grows large it's important that NPOV mean "neutral point of view for EVERYBODY", not just "a point of view that everybody in OUR country can agree upon", etc.

-m.

On Sat, Jul 24, 2010 at 4:06 PM, Cristian Consonni kikkocristian@gmail.com wrote:

...

2010/7/24 Casey Brown lists@caseybrown.org:

...
On Sat, Jul 24, 2010 at 4:11 PM, Pavlo Shevelo pavlo.shevelo@gmail.com wrote:

...
...
These days Google and other translate tools are good enough to use as the starting basis for an translated article

No, it's far not true - at least for such target language as Ukrainian etc.

So any attempt of "push" translation will be almost the disaster...

...and we need to remember that most articles are *not* translations of the English article, but are home-grown on the wiki and use their own sources in their own language.

Also don't forget that the same subject can be treated very differently among different cultures (even if they are not distant, think to French and English).

An article in the English Wikipedia can be a very good basis to start a new article, but I don't think that an automated "flooding" of the other Wikipedias is a good thing in *any* way.

Cristian

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Casey Brown

6:03 a.m.

On Sun, Jul 25, 2010 at 1:39 AM, Mark Williamson node.ue@gmail.com wrote:

...

Wikipedias are not for _cultures_, they are for languages. If I and

I'm surprised to hear that coming from someone who I thought to be a student of languages. I think you might want to read an article from today's Wall Street Journal, about how language influences culture (and, one would extrapolate, Wikipedia articles). http://online.wsj.com/article/SB10001424052748703467304575383131592767868.html

...

1,000 other Americans suddenly learnt French (to the point of native-level fluency) and decided to read and edit the French Wikipedia, it would "belong" to us just as much as to anybody else. This came up recently in the debate about the Acehnese Wikipedia. Some people said that all Acehnese were Muslim (not true - there is a small community of Acehnese Christians). They said that if anyone is Christian, they'd be ejected from Acehnese society and therefore no longer Acehnese. However, they'd not stop speaking the Acehnese language.

Nobody claims the English WP is for US/Commonwealth cultures only... this is reasonable when a Wiki is tiny, but as it grows large it's important that NPOV mean "neutral point of view for EVERYBODY", not just "a point of view that everybody in OUR country can agree upon", etc.

No one suggested that it was about "a point of view that everyone in OUR country can agree upon". No one's suggesting that anyone "owns" a wiki or that you're not welcome to contribute. It's just that different wikis/languages are different and have different articles. Some focus on different topics based on what they usually do, some try to tackle the subject scholarly, some probably don't focus on blame (see the article's commentary on Japanese/Spanish views of accidental events), etc.

-- Casey Brown Cbrown1023

Mark Williamson

10:02 a.m.

On Sat, Jul 24, 2010 at 11:03 PM, Casey Brown lists@caseybrown.org wrote:

...

On Sun, Jul 25, 2010 at 1:39 AM, Mark Williamson node.ue@gmail.com wrote:

...
Wikipedias are not for _cultures_, they are for languages. If I and

I'm surprised to hear that coming from someone who I thought to be a student of languages. I think you might want to read an article from today's Wall Street Journal, about how language influences culture (and, one would extrapolate, Wikipedia articles). http://online.wsj.com/article/SB10001424052748703467304575383131592767868.html

Casey, that's nothing new, nor is it anything I was unaware of. The debate about whether language influences thought (or vice versa) has long been a debate within the scholarly community. Please see http://en.wikipedia.org/wiki/Linguistic_relativity for a more detailed treatment of the subject - there's still no consensus.

Nobody's arguing here that language and culture have no relationship. What I'm saying is that language does not equal culture. Many people speak French who are not part of the culture of France, for example the cities of Libreville and Abidjan in Africa. Many (many!) people who speak English are not part of the culture of England (or even the rest of the UK, the United States, Canada, Australia or New Zealand), including hundreds of thousands, if not millions, of native speakers.

Languages are certainly cultural artifacts, but that does not mean that they are equivalent. Imagine tomorrow morning everybody in Japan spoke French and only French and that all Japanese literature and text suddenly was printed only in French. Would Japanese culture cease to exist? Not at all. The customs, attitudes, rituals, beliefs and even the food would not be changed (attitudes is debatable perhaps, but I'm not a believer of the Sapir-Whorf hypothesis). Yes, something great would be lost, an irreplaceable _part_ of the Japanese culture, but cultures have sometimes persisted in spite language death. Ritual prayers are sometimes translated to the new language, other times fossilized in a language now rendered incomprehensible by time, same goes for geographic and personal names...

I think it's pretty clear at this point that, for example, all 4 of the regular users of ace.wp are offended by certain images on en.wp. I don't think it would be a stretch to say that many - probably the vast majority - of Acehnese speakers would find those images similarly offensive. Now let's say I've got a toddler and he has an Acehnese caretaker. This caretaker is monolingual in Acehnese, but they've been expressly forbidden from mentioning religion.

When this toddler grows up, he'll probably be good enough at Acehnese and have spoken it early enough in life to be considered a native speaker... but will he automatically have any inclinations one way or another about the pictures? Of course not. So, just because the vast majority of speakers of a language share a cultural background does NOT mean that the language could only ever be spoken by people who belong to that culture. Wikipedia versions are very clearly for languages. the Estonian Wikipedia is the Wikipedia in the Estonian language, not the Wikipedia for Estonian Culture.

As an example of this, I have a good friend who grew up speaking Akan, having had a nanny from West Africa. Is my friend a member of the Akan culture? Not really... does that mean she couldn't be a productive member of the Akan Wikipedia (if she wanted to be :-( )? No.

If Wikipedias were for cultures, the edits of Macedonians, Chinese, Italians or Congolese people to en.wp would be somehow less valid that those of native speakers of English in predominantly Anglophone societies. Of course, this is not the case.

That's one of the things I like about en.wp - the fact that people who do not speak English as their primary language form a large portion of our editors means that things are likely to come out a bit more balanced. Argentine editors can edit [[Falkland Islands]], for example. In my humble opinion, this is the way it should be. Language is a troublesome barrier. Who is to ensure that Turkish or Greek articles about Cyprus are neutral? I'm not an advocate of a one-world language, but if we had perfect MT tech, I would be in favor of everybody collaborating on one massive international WP.

...

...
1,000 other Americans suddenly learnt French (to the point of native-level fluency) and decided to read and edit the French Wikipedia, it would "belong" to us just as much as to anybody else. This came up recently in the debate about the Acehnese Wikipedia. Some people said that all Acehnese were Muslim (not true - there is a small community of Acehnese Christians). They said that if anyone is Christian, they'd be ejected from Acehnese society and therefore no longer Acehnese. However, they'd not stop speaking the Acehnese language.

Nobody claims the English WP is for US/Commonwealth cultures only... this is reasonable when a Wiki is tiny, but as it grows large it's important that NPOV mean "neutral point of view for EVERYBODY", not just "a point of view that everybody in OUR country can agree upon", etc.

No one suggested that it was about "a point of view that everyone in OUR country can agree upon". No one's suggesting that anyone "owns" a wiki or that you're not welcome to contribute. It's just that different wikis/languages are different and have different articles. Some focus on different topics based on what they usually do, some try to tackle the subject scholarly, some probably don't focus on blame (see the article's commentary on Japanese/Spanish views of accidental events), etc.

I'll ignore your "probably" here for the time being. However, I don't see any good reason that different Wikis need to have different contents. Different communities have agreed on different policies, articles are allowed on en.wp that would be speedily deleted on de.wp, but these policies are not necessarily based on some intrinsic aspect of the language, but rather the opinions of the people who make up that particular community. Please, please, please note that I am not saying everybody should follow the same policy, I'm just saying that policy differences or differences in article content are not a result of some sort of intrinsic property of the language in which they are written. A _good_ translation of any _good_ article from any language into any other language should be perfectly acceptable, with allowances for local policy differences (which again are set by the Wiki community, NOT by every single speaker of that language).

-m.

Ray Saintonge

9:59 a.m.

stevertigo wrote:

...

Translation between wikis currently exists as a largely pulling paradigm: Someone on the target wiki finds an article in another language (English for example) and then pulls it to their language wiki.

These days Google and other translate tools are good enough to use as the starting basis for an translated article, and we can consider how we make use of them in an active way. What is largely a "pull" paradigm can also be a "push" paradigm - we can use translation tools to "push" articles to other wikis.

If there are issues, they can be overcome. The fact of the matter is that the vast majority of articles in English can be "pushed" over to other languages, and fill a need for those topics in those languages.

This is well suited for the dustbin of terrible ideas. It ranks right up there with the notion that the European colonization of Africa was for the sole purpose of civilizing the savages.

Key to the growth of Wikipedias in minority languages is respect for the cultures that they encompass, not flooding them with the First-World Point of View. What might be a Neutral Point of View on the English Wikipedia is limited by the contributions of English writers. Those who do not understand English may arrive at a different neutrality. We have not yet arrived at a Metapedia that would synthesize a single neutrality from all projects.

In addition to bludgeoning these cultures with an imposed neutrality, there is also the risk of overwhelming them with sheer volume. I remember only too well the uproar when the large quantity of articles on every small community in the United States were botted into en-wp. Neutrality was not an issue in that case, but the quantity of unchecked material was even if it came from a reliable source.

It's important for the minority language projects to choose what is important to them, and what is relevant to their culture. As useful and uncontroversial as many English articles may be in our eyes they may still not yet be notable for minority languages.

Ray

Mark Williamson

10:05 a.m.

I would like to add to this that I think the worst part of this idea is the assumption that other languages should take articles from en.wp.

I would be in favor of an international, language-free Wikipedia if/when perfect (or 99.99% accurate) MT software exists, but that is not currently the case. My point here is that rather than forcing English articles on other languages, everybody everywhere speaking any language should be able to modify the same article and view it in their native language.

-m.

On Sun, Jul 25, 2010 at 2:59 AM, Ray Saintonge saintonge@telus.net wrote:

...

stevertigo wrote:

...
Translation between wikis currently exists as a largely pulling paradigm: Someone on the target wiki finds an article in another language (English for example) and then pulls it to their language wiki.

These days Google and other translate tools are good enough to use as the starting basis for an translated article, and we can consider how we make use of them in an active way. What is largely a "pull" paradigm can also be a "push" paradigm - we can use translation tools to "push" articles to other wikis.

If there are issues, they can be overcome. The fact of the matter is that the vast majority of articles in English can be "pushed" over to other languages, and fill a need for those topics in those languages.

This is well suited for the dustbin of terrible ideas. It ranks right up there with the notion that the European colonization of Africa was for the sole purpose of civilizing the savages.

Key to the growth of Wikipedias in minority languages is respect for the cultures that they encompass, not flooding them with the First-World Point of View. What might be a Neutral Point of View on the English Wikipedia is limited by the contributions of English writers. Those who do not understand English may arrive at a different neutrality. We have not yet arrived at a Metapedia that would synthesize a single neutrality from all projects.

In addition to bludgeoning these cultures with an imposed neutrality, there is also the risk of overwhelming them with sheer volume. I remember only too well the uproar when the large quantity of articles on every small community in the United States were botted into en-wp. Neutrality was not an issue in that case, but the quantity of unchecked material was even if it came from a reliable source.

It's important for the minority language projects to choose what is important to them, and what is relevant to their culture. As useful and uncontroversial as many English articles may be in our eyes they may still not yet be notable for minority languages.

Ray

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

stevertigo

26 Jul 26 Jul

2:43 a.m.

Mark Williamson node.ue@gmail.com wrote:

...

I would like to add to this that I think the worst part of this idea is the assumption that other languages should take articles from en.wp.

The idea is that most of en.wp's articles are well-enough written, and written in accord with NPOV to a sufficient degree to overcome any such criticism of 'imperial encyclopedism.'

Mark Williamson node.ue@gmail.com wrote:

...

Nobody's arguing here that language and culture have no relationship. What I'm saying is that language does not equal culture. Many people speak French who are not part of the culture of France, for example the cities of Libreville and Abidjan in Africa.

Africa is an unusual case given that it was so linguistically diverse to begin with, and that its even moreso in the post-colonial era, when Arabic, French, English, and Dutch remain prominent marks of imperialistic influence.

Ray Saintonge saintonge@telus.net wrote:

...

This is well suited for the dustbin of terrible ideas. It ranks right up there with the notion that the European colonization of Africa was for the sole purpose of civilizing the savages.

This is the 'encyclopedic imperialism' counterargument. I thought I'd throw it out there. As Bendt noted above, Google has already been working on it for two years and has had both success and failure. It bears mentioning that their tools have been improving quite steadily. A simple test such as /English -> Arabic -> English/ will show that.

Note that colonialism isnt the issue. It still remains for example a high priority to teach English in Africa, for the simple reason that language is almost entirely a tool for communication, and English is quite good for that purpose. Its notable that the smaller colonial powers such as the French were never going to be successful at linguistic imperialism in Africa, for the simple reason that French has not actually been the lingua franca for a long time now.

...

Key to the growth of Wikipedias in minority languages is respect for the cultures that they encompass, not flooding them with the First-World Point of View. What might be a Neutral Point of View on the English Wikipedia is limited by the contributions of English writers. Those who do not understand English may arrive at a different neutrality. We have not yet arrived at a Metapedia that would synthesize a single neutrality from all projects.

I strongly disagree. Neutral point of view has worked on en.wp because its a universalist concept. The cases where other language wikis reject English content appear to come due to POV, and thus a violation of NPOV, not because - as you seem to suggest - the POV in such countries must be considered "NPOV."

Casey Brown lists@caseybrown.org wrote:

...

I'm surprised to hear that coming from someone who I thought to be a student of languages. I think you might want to read an article from today's Wall Street Journal, about how language influences culture (and, one would extrapolate, Wikipedia articles).

I had just a few days ago read Boroditsky's piece in Edge, and it covers a lot of interesting little bits of evidence. As Mark was saying, linguistic relativity (or the Sapir-Whorf hypothesis) has been around for most of a century, and its wider conjectures were strongly contradicted by Chomsky et al. Yes there is compelling evidence that language does "channel" certain kinds of thought, but this should not be overstated. Like in other sciences, linguistics can sometimes make the mistake of making *qualitative judgments based on a field of *quantitative evidence. This was essentially important back in the 40s and 50s when people were still putting down certain quasi-scientific conjectures from the late 1800s.

Still there are cultures which claim their languages to be superior in certain ways simply because they are more sonorous or emotive, or otherwise expressive, and that's the essential paradigm that some linguists are working in.

-SC

Oliver Keyes

27 Jul 27 Jul

3:52 a.m.

"The idea is that most of en.wp's articles are well-enough written, and written in accord with NPOV to a sufficient degree to overcome any such criticism of 'imperial encyclopedism.' - really? It's a) not particularly well-written, mostly and b) referenced overwhelmingly to English-language sources, most of which are, you guessed it.. Western in nature.

On Mon, Jul 26, 2010 at 3:43 AM, stevertigo stvrtg@gmail.com wrote:

...

Mark Williamson node.ue@gmail.com wrote:

...
I would like to add to this that I think the worst part of this idea is the assumption that other languages should take articles from en.wp.

The idea is that most of en.wp's articles are well-enough written, and written in accord with NPOV to a sufficient degree to overcome any such criticism of 'imperial encyclopedism.'

Mark Williamson node.ue@gmail.com wrote:

...
Nobody's arguing here that language and culture have no relationship. What I'm saying is that language does not equal culture. Many people speak French who are not part of the culture of France, for example the cities of Libreville and Abidjan in Africa.

Africa is an unusual case given that it was so linguistically diverse to begin with, and that its even moreso in the post-colonial era, when Arabic, French, English, and Dutch remain prominent marks of imperialistic influence.

Ray Saintonge saintonge@telus.net wrote:

...
This is well suited for the dustbin of terrible ideas. It ranks right up there with the notion that the European colonization of Africa was for the sole purpose of civilizing the savages.

This is the 'encyclopedic imperialism' counterargument. I thought I'd throw it out there. As Bendt noted above, Google has already been working on it for two years and has had both success and failure. It bears mentioning that their tools have been improving quite steadily. A simple test such as /English -> Arabic -> English/ will show that.

Note that colonialism isnt the issue. It still remains for example a high priority to teach English in Africa, for the simple reason that language is almost entirely a tool for communication, and English is quite good for that purpose. Its notable that the smaller colonial powers such as the French were never going to be successful at linguistic imperialism in Africa, for the simple reason that French has not actually been the lingua franca for a long time now.

...
Key to the growth of Wikipedias in minority languages is respect for the cultures that they encompass, not flooding them with the First-World Point of View. What might be a Neutral Point of View on the English Wikipedia is limited by the contributions of English writers. Those who do not understand English may arrive at a different neutrality. We have not yet arrived at a Metapedia that would synthesize a single neutrality from all projects.

I strongly disagree. Neutral point of view has worked on en.wp because its a universalist concept. The cases where other language wikis reject English content appear to come due to POV, and thus a violation of NPOV, not because - as you seem to suggest - the POV in such countries must be considered "NPOV."

Casey Brown lists@caseybrown.org wrote:

...
I'm surprised to hear that coming from someone who I thought to be a student of languages. I think you might want to read an article from today's Wall Street Journal, about how language influences culture (and, one would extrapolate, Wikipedia articles).

I had just a few days ago read Boroditsky's piece in Edge, and it covers a lot of interesting little bits of evidence. As Mark was saying, linguistic relativity (or the Sapir-Whorf hypothesis) has been around for most of a century, and its wider conjectures were strongly contradicted by Chomsky et al. Yes there is compelling evidence that language does "channel" certain kinds of thought, but this should not be overstated. Like in other sciences, linguistics can sometimes make the mistake of making *qualitative judgments based on a field of *quantitative evidence. This was essentially important back in the 40s and 50s when people were still putting down certain quasi-scientific conjectures from the late 1800s.

Still there are cultures which claim their languages to be superior in certain ways simply because they are more sonorous or emotive, or otherwise expressive, and that's the essential paradigm that some linguists are working in.

-SC

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Shiju Alex

4:04 a.m.

...

really? It's a) not particularly well-written, mostly and b) referenced overwhelmingly to

...
English-language sources, most of which are, you guessed it.. Western in

nature.

Very much true. Now English Wikipedians want some one to translate and use the exact copy of en:wp in all other language wikipedias. And they have the support of Google for that.

On Tue, Jul 27, 2010 at 5:52 AM, Oliver Keyes scire.facias@gmail.comwrote:

...

"The idea is that most of en.wp's articles are well-enough written, and written in accord with NPOV to a sufficient degree to overcome any such criticism of 'imperial encyclopedism.' - really? It's a) not particularly well-written, mostly and b) referenced overwhelmingly to English-language sources, most of which are, you guessed it.. Western in nature.

On Mon, Jul 26, 2010 at 3:43 AM, stevertigo stvrtg@gmail.com wrote:

...
Mark Williamson node.ue@gmail.com wrote:

...
I would like to add to this that I think the worst part of this idea is the assumption that other languages should take articles from en.wp.

The idea is that most of en.wp's articles are well-enough written, and written in accord with NPOV to a sufficient degree to overcome any such criticism of 'imperial encyclopedism.'

Mark Williamson node.ue@gmail.com wrote:

...
Nobody's arguing here that language and culture have no relationship. What I'm saying is that language does not equal culture. Many people speak French who are not part of the culture of France, for example the cities of Libreville and Abidjan in Africa.

Africa is an unusual case given that it was so linguistically diverse to begin with, and that its even moreso in the post-colonial era, when Arabic, French, English, and Dutch remain prominent marks of imperialistic influence.

Ray Saintonge saintonge@telus.net wrote:

...
This is well suited for the dustbin of terrible ideas. It ranks right up there with the notion that the European colonization of Africa was for the sole purpose of civilizing the savages.

This is the 'encyclopedic imperialism' counterargument. I thought I'd throw it out there. As Bendt noted above, Google has already been working on it for two years and has had both success and failure. It bears mentioning that their tools have been improving quite steadily. A simple test such as /English -> Arabic -> English/ will show that.

Note that colonialism isnt the issue. It still remains for example a high priority to teach English in Africa, for the simple reason that language is almost entirely a tool for communication, and English is quite good for that purpose. Its notable that the smaller colonial powers such as the French were never going to be successful at linguistic imperialism in Africa, for the simple reason that French has not actually been the lingua franca for a long time now.

...
Key to the growth of Wikipedias in minority languages is respect for

the

...
...
cultures that they encompass, not flooding them with the First-World Point of View. What might be a Neutral Point of View on the English Wikipedia is limited by the contributions of English writers. Those

who

...
...
do not understand English may arrive at a different neutrality. We

have

...
...
not yet arrived at a Metapedia that would synthesize a single

neutrality

...
...
from all projects.

I strongly disagree. Neutral point of view has worked on en.wp because its a universalist concept. The cases where other language wikis reject English content appear to come due to POV, and thus a violation of NPOV, not because - as you seem to suggest - the POV in such countries must be considered "NPOV."

Casey Brown lists@caseybrown.org wrote:

...
I'm surprised to hear that coming from someone who I thought to be a student of languages. I think you might want to read an article from today's Wall Street Journal, about how language influences culture (and, one would extrapolate, Wikipedia articles).

I had just a few days ago read Boroditsky's piece in Edge, and it covers a lot of interesting little bits of evidence. As Mark was saying, linguistic relativity (or the Sapir-Whorf hypothesis) has been around for most of a century, and its wider conjectures were strongly contradicted by Chomsky et al. Yes there is compelling evidence that language does "channel" certain kinds of thought, but this should not be overstated. Like in other sciences, linguistics can sometimes make the mistake of making *qualitative judgments based on a field of *quantitative evidence. This was essentially important back in the 40s and 50s when people were still putting down certain quasi-scientific conjectures from the late 1800s.

Still there are cultures which claim their languages to be superior in certain ways simply because they are more sonorous or emotive, or otherwise expressive, and that's the essential paradigm that some linguists are working in.

-SC

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Mark Williamson

6:39 a.m.

Shiju Alex,

Stevertigo is just one en.wikipedian.

As far as using exact copies goes, I don't know about the policy at your home wiki, but in many Wikipedias this sort of back-and-forth translation and trading and sharing of articles has been going on since day one, not just with English but with other languages as well. If I see a good article on any Wikipedia in a language I understand that is lacking in another, I'll happily translate it. I have never seen this cause problems provided I use proper spelling and grammar and do not use templates or images that leave red links.

I started out at en.wp in 2001, so I don't think it's unreasonable to call myself an English Wikipedian (although I'd prefer to think of myself as an international Wikipedian, with lots of edits at wikis such as Serbo-Croatian, Spanish, Navajo, Haitian and Moldovan). I am not at all in favor of pushing any sort of articles on anybody, if a community discusses and reaches consensus to disallow translations (even ones made by humans, including professionals), that is absolutely their right, although I don't think it's wise to disallow people from using material from other Wikipedias.

Google Translator Toolkit is particularly problematic because it messes up the existing article formatting (one example, it messes up internal links by putting punctuation marks before double brackets when they should be after) and it includes incompatible formatting such as redlinked templates. It also doesn't help that many editors don't stick around to fix their articles afterwards.

-m.

On Mon, Jul 26, 2010 at 9:04 PM, Shiju Alex shijualexonline@gmail.com wrote:

...

...
really? It's a) not particularly well-written, mostly and b) referenced overwhelmingly to

...
English-language sources, most of which are, you guessed it.. Western in

nature.

Very much true. Now English Wikipedians want some one to translate and use the exact copy of en:wp in all other language wikipedias. And they have the support of Google for that.

On Tue, Jul 27, 2010 at 5:52 AM, Oliver Keyes scire.facias@gmail.comwrote:

...
"The idea is that most of en.wp's articles are well-enough written, and written in accord with NPOV to a sufficient degree to overcome any such criticism of 'imperial encyclopedism.' - really? It's a) not particularly well-written, mostly and b) referenced overwhelmingly to English-language sources, most of which are, you guessed it.. Western in nature.

On Mon, Jul 26, 2010 at 3:43 AM, stevertigo stvrtg@gmail.com wrote:

...
Mark Williamson node.ue@gmail.com wrote:

...
I would like to add to this that I think the worst part of this idea is the assumption that other languages should take articles from en.wp.

The idea is that most of en.wp's articles are well-enough written, and written in accord with NPOV to a sufficient degree to overcome any such criticism of 'imperial encyclopedism.'

Mark Williamson node.ue@gmail.com wrote:

...
Nobody's arguing here that language and culture have no relationship. What I'm saying is that language does not equal culture. Many people speak French who are not part of the culture of France, for example the cities of Libreville and Abidjan in Africa.

Africa is an unusual case given that it was so linguistically diverse to begin with, and that its even moreso in the post-colonial era, when Arabic, French, English, and Dutch remain prominent marks of imperialistic influence.

Ray Saintonge saintonge@telus.net wrote:

...
This is well suited for the dustbin of terrible ideas. It ranks right up there with the notion that the European colonization of Africa was for the sole purpose of civilizing the savages.

This is the 'encyclopedic imperialism' counterargument. I thought I'd throw it out there. As Bendt noted above, Google has already been working on it for two years and has had both success and failure. It bears mentioning that their tools have been improving quite steadily. A simple test such as /English -> Arabic -> English/ will show that.

Note that colonialism isnt the issue. It still remains for example a high priority to teach English in Africa, for the simple reason that language is almost entirely a tool for communication, and English is quite good for that purpose. Its notable that the smaller colonial powers such as the French were never going to be successful at linguistic imperialism in Africa, for the simple reason that French has not actually been the lingua franca for a long time now.

...
Key to the growth of Wikipedias in minority languages is respect for

the

...
...
cultures that they encompass, not flooding them with the First-World Point of View. What might be a Neutral Point of View on the English Wikipedia is limited by the contributions of English writers. Those

who

...
...
do not understand English may arrive at a different neutrality. We

have

...
...
not yet arrived at a Metapedia that would synthesize a single

neutrality

...
...
from all projects.

I strongly disagree. Neutral point of view has worked on en.wp because its a universalist concept. The cases where other language wikis reject English content appear to come due to POV, and thus a violation of NPOV, not because - as you seem to suggest - the POV in such countries must be considered "NPOV."

Casey Brown lists@caseybrown.org wrote:

...
I'm surprised to hear that coming from someone who I thought to be a student of languages. I think you might want to read an article from today's Wall Street Journal, about how language influences culture (and, one would extrapolate, Wikipedia articles).

I had just a few days ago read Boroditsky's piece in Edge, and it covers a lot of interesting little bits of evidence. As Mark was saying, linguistic relativity (or the Sapir-Whorf hypothesis) has been around for most of a century, and its wider conjectures were strongly contradicted by Chomsky et al. Yes there is compelling evidence that language does "channel" certain kinds of thought, but this should not be overstated. Like in other sciences, linguistics can sometimes make the mistake of making *qualitative judgments based on a field of *quantitative evidence. This was essentially important back in the 40s and 50s when people were still putting down certain quasi-scientific conjectures from the late 1800s.

Still there are cultures which claim their languages to be superior in certain ways simply because they are more sonorous or emotive, or otherwise expressive, and that's the essential paradigm that some linguists are working in.

-SC

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Ray Saintonge

7:47 a.m.

Mark Williamson wrote:

...

Google Translator Toolkit is particularly problematic because it messes up the existing article formatting (one example, it messes up internal links by putting punctuation marks before double brackets when they should be after) and it includes incompatible formatting such as redlinked templates. It also doesn't help that many editors don't stick around to fix their articles afterwards.

The key word is "Toolkit". The formatting anomalies, like the strange errors in OCR, should be expected by those using machine translations. I have no objection to "pull" translations; these suggest that someone *in the target project* considered the topic worth including at the current stage of the project. Speedy deleting these articles when they first appear would be too hasty. The fixup within a reasonable time after the machine translation is added is an important part of the contribution.

Ray

Shiju Alex

8:36 a.m.

...

Google Translator Toolkit is particularly problematic because it messes up the existing article formatting (one example, it messes up internal links by putting punctuation marks before double brackets when they should be after) and it includes incompatible formatting such as redlinked templates. It also doesn't help that many editors don't stick around to fix their articles afterwards.

Yes this is one of the main issue of *Google Translator Tool Kit* (GTTK). There are many points raised by Ravi regarding GTTK in his presentation at WikiMania. http://docs.google.com/present/view?id=ddpg3qwc_279ghm7kbhs

Not all wikipedias work/create articles in the same way as English Wikipedia or some other big wikipedias does. Many of the active wiki communities does not like the word-to-word translation of the English Wikipedia articles. But that doesn't mean that while developing an article, they won't refer English Wikipedia. English wikipedia article is their first point of reference most of the time. The problem starts when some one start forcing English Wikipedia articles in a language wikipedia. Here it is Google, using the Google Translate Tool Kit (GTTK) .

Most of the active wiki communities (especially non-Latin wikis) are not interested in the word-to-word translation of the English Wikipedia articles. Also many of them are not willing to to go through the big articles (with lot of issues) created using GTTK and rewrite the entire article to bring it to the wiki style. They will better prefer to start the article from the scratch.

One of the main issue is that the Google/Google translators are not communicating with the wiki community (of each language) before they start the project in a wikipedia. For example, Tamil wikipedia community came to know about Google efforts only 6 months after they started the project in that wiki.

Wiki communities like the biological growth of the wikipedia articles in their wiki. Why English Wikipedia did not start building wikipedia articles using *Encyclopedia Britannica 1911* edition which was available in the public domain?

Personally, I am not against GTTK or against Google. At least this effort is good for the online version of a language (even if some argue that it is not good for wikipedia). But this effort needs to be executed in a different way so that wikipedia of that language will benefit from it. Some of the solutions that are coming to my mind:

1. Ban the project of Google as done by the Bengali wiki community (Bad solution, and I am personally against this solution) 2. Ask Google to engage wiki community (As happened in the case of Tamil) to find out a working solution. But if there is no active wiki community what Google can do. But does this mean that Google can continue with the project as they want? (Very difficult solution if there is no active wiki community) 3. Find some other solution. For example, Is it possible to upload the translated articles in a separate name space, for example, Google: Let the community decides what needs to be taken to the main/article namespace. 4. .........

If some solution is not found soon, Google's effort is going to create problem in many language wikipedias. The worst result of this effort would be the rift between the wiki community and the Google translators (speakers of the same language) :(

Shiju

On Tue, Jul 27, 2010 at 8:39 AM, Mark Williamson node.ue@gmail.com wrote:

...

Shiju Alex,

Stevertigo is just one en.wikipedian.

As far as using exact copies goes, I don't know about the policy at your home wiki, but in many Wikipedias this sort of back-and-forth translation and trading and sharing of articles has been going on since day one, not just with English but with other languages as well. If I see a good article on any Wikipedia in a language I understand that is lacking in another, I'll happily translate it. I have never seen this cause problems provided I use proper spelling and grammar and do not use templates or images that leave red links.

I started out at en.wp in 2001, so I don't think it's unreasonable to call myself an English Wikipedian (although I'd prefer to think of myself as an international Wikipedian, with lots of edits at wikis such as Serbo-Croatian, Spanish, Navajo, Haitian and Moldovan). I am not at all in favor of pushing any sort of articles on anybody, if a community discusses and reaches consensus to disallow translations (even ones made by humans, including professionals), that is absolutely their right, although I don't think it's wise to disallow people from using material from other Wikipedias.

Google Translator Toolkit is particularly problematic because it messes up the existing article formatting (one example, it messes up internal links by putting punctuation marks before double brackets when they should be after) and it includes incompatible formatting such as redlinked templates. It also doesn't help that many editors don't stick around to fix their articles afterwards.

-m.

On Mon, Jul 26, 2010 at 9:04 PM, Shiju Alex shijualexonline@gmail.com wrote:

...
...
really? It's a) not particularly well-written, mostly and b) referenced overwhelmingly to

...
English-language sources, most of which are, you guessed it.. Western

in

...
...
...
nature.

Very much true. Now English Wikipedians want some one to translate and

use

...
the exact copy of en:wp in all other language wikipedias. And they have

the

...
support of Google for that.

On Tue, Jul 27, 2010 at 5:52 AM, Oliver Keyes <scire.facias@gmail.com wrote:

...
"The idea is that most of en.wp's articles are well-enough written, and written in accord with NPOV to a sufficient degree to overcome any such criticism of 'imperial encyclopedism.' - really? It's a) not particularly well-written, mostly and b) referenced overwhelmingly to English-language sources, most of which are, you guessed it.. Western in nature.

On Mon, Jul 26, 2010 at 3:43 AM, stevertigo stvrtg@gmail.com wrote:

...
Mark Williamson node.ue@gmail.com wrote:

...
I would like to add to this that I think the worst part of this idea is the assumption that other languages should take articles from en.wp.

The idea is that most of en.wp's articles are well-enough written, and written in accord with NPOV to a sufficient degree to overcome any such criticism of 'imperial encyclopedism.'

Mark Williamson node.ue@gmail.com wrote:

...
Nobody's arguing here that language and culture have no

relationship.

...
...
...
...
What I'm saying is that language does not equal culture. Many people speak French who are not part of the culture of France, for example the cities of Libreville and Abidjan in Africa.

Africa is an unusual case given that it was so linguistically diverse to begin with, and that its even moreso in the post-colonial era, when Arabic, French, English, and Dutch remain prominent marks of imperialistic influence.

Ray Saintonge saintonge@telus.net wrote:

...
This is well suited for the dustbin of terrible ideas. It ranks

right

...
...
...
...
up there with the notion that the European colonization of Africa

was

...
...
...
...
for the sole purpose of civilizing the savages.

This is the 'encyclopedic imperialism' counterargument. I thought I'd throw it out there. As Bendt noted above, Google has already been working on it for two years and has had both success and failure. It bears mentioning that their tools have been improving quite steadily. A simple test such as /English -> Arabic -> English/ will show that.

Note that colonialism isnt the issue. It still remains for example a high priority to teach English in Africa, for the simple reason that language is almost entirely a tool for communication, and English is quite good for that purpose. Its notable that the smaller colonial powers such as the French were never going to be successful at linguistic imperialism in Africa, for the simple reason that French has not actually been the lingua franca for a long time now.

...
Key to the growth of Wikipedias in minority languages is respect for

the

...
...
cultures that they encompass, not flooding them with the First-World Point of View. What might be a Neutral Point of View on the English Wikipedia is limited by the contributions of English writers. Those

who

...
...
do not understand English may arrive at a different neutrality. We

have

...
...
not yet arrived at a Metapedia that would synthesize a single

neutrality

...
...
from all projects.

I strongly disagree. Neutral point of view has worked on en.wp because its a universalist concept. The cases where other language wikis reject English content appear to come due to POV, and thus a violation of NPOV, not because - as you seem to suggest - the POV in such countries must be considered "NPOV."

Casey Brown lists@caseybrown.org wrote:

...
I'm surprised to hear that coming from someone who I thought to be a student of languages. I think you might want to read an article from today's Wall Street Journal, about how language influences culture (and, one would extrapolate, Wikipedia articles).

I had just a few days ago read Boroditsky's piece in Edge, and it covers a lot of interesting little bits of evidence. As Mark was saying, linguistic relativity (or the Sapir-Whorf hypothesis) has been around for most of a century, and its wider conjectures were strongly contradicted by Chomsky et al. Yes there is compelling evidence that language does "channel" certain kinds of thought, but this should not be overstated. Like in other sciences, linguistics can sometimes make the mistake of making *qualitative judgments based on a field of *quantitative evidence. This was essentially important back in the 40s and 50s when people were still putting down certain quasi-scientific conjectures from the late 1800s.

Still there are cultures which claim their languages to be superior in certain ways simply because they are more sonorous or emotive, or otherwise expressive, and that's the essential paradigm that some linguists are working in.

-SC

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe:

https://lists.wikimedia.org/mailman/listinfo/foundation-l

...
...
...

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Mark Williamson

9:42 a.m.

On Tue, Jul 27, 2010 at 1:36 AM, Shiju Alex shijualexonline@gmail.com wrote:

...

1. Ban the project of Google as done by the Bengali wiki community (Bad solution, and I am personally against this solution) 2. Ask Google to engage wiki community (As happened in the case of Tamil) to find out a working solution. But if there is no active wiki community what Google can do. But does this mean that Google can continue with the project as they want? (Very difficult solution if there is no active wiki community) 3. Find some other solution. For example, Is it possible to upload the translated articles in a separate name space, for example, Google: Let the community decides what needs to be taken to the main/article namespace. 4. .........

If some solution is not found soon, Google's effort is going to create problem in many language wikipedias. The worst result of this effort would be the rift between the wiki community and the Google translators (speakers of the same language) :(

Shiju

Shiju,

I think you have made some great suggestions here. I'd like to add a couple of my own:

1) Fix some of the formatting errors with GTTK. Would this really be so difficult? It seems to me that the breaking of links is a bug that needs fixing by Google. 2) Implement spelling and punctuation check automatically within GTTK before posting of the articles. 3) Have GTTK automatically remove broken templates and images, or require users to translate any templates before a page may be posted. 4) Include a list of most needed articles for people to create, rather than random articles that will be of little use to local readers. Some articles, such as those on local topics, have the added benefit of encouraging more edits and community participation since they tend to generate more interest from speakers of a language in my experience.

3 of these are things for Google to work on, one is something for us to work on. I think this is a potentially valuable resource, the problem is channeling the efforts and energies of these well-meaning people in the right direction so that local Wikipedias don't end up full of low-quality, unreadable articles with little hope for improvement. I'm curious to hear your thoughts.

-m.

Shiju Alex

10:55 a.m.

On Tue, Jul 27, 2010 at 11:42 AM, Mark Williamson node.ue@gmail.com wrote:

...

Include a list of most needed articles for people to create, rather

than random articles that will be of little use to local readers. Some articles, such as those on local topics, have the added benefit of encouraging more edits and community participation since they tend to generate more interest from speakers of a language in my experience.

This list will automatically come if Google engage the wiki community for their project for a particular language. But for some wikipedias there is no active wiki community. So how this issue can be solved?

Selection of the articles for translation is an important part for this project. Definitely http://en.wikipedia.org/wiki/Wikipedia:Vital_articles is a good choice for this. Community might also be interested in some other important articles (important with respect to the social/ cultural/geography of the speakers of that language). So engaging local wiki community is most important

~Shiju

Aphaia

7:42 p.m.

GT fails. At least for Japanese, it sucks. And that is why I don't support it. GT may fit to SVO languages, but for SOV languages, it is nothing but a crap.

Imagine to fix a 4000 words of documents whose all lines are sort of "all your base is belong to us". It's not a simple thing as you imagine - "spelling and punctuation". I admit it has been improved (now Free Tibet from English to Japanese is "Furi Tibetto", not former "muryo tibetto" (Tibet for gratis) in two years ago - but craps are still craps and I don't want to spend my hours for the for-profit giant.

On Tue, Jul 27, 2010 at 6:42 PM, Mark Williamson node.ue@gmail.com wrote:

...

On Tue, Jul 27, 2010 at 1:36 AM, Shiju Alex shijualexonline@gmail.com wrote:

...
1. Ban the project of Google as done by the Bengali wiki community (Bad solution, and I am personally against this solution) 2. Ask Google to engage wiki community (As happened in the case of Tamil) to find out a working solution. But if there is no active wiki community what Google can do. But does this mean that Google can continue with the project as they want? (Very difficult solution if there is no active wiki community) 3. Find some other solution. For example, Is it possible to upload the translated articles in a separate name space, for example, Google: Let the community decides what needs to be taken to the main/article namespace. 4. .........

If some solution is not found soon, Google's effort is going to create problem in many language wikipedias. The worst result of this effort would be the rift between the wiki community and the Google translators (speakers of the same language) :(

Shiju

Shiju,

I think you have made some great suggestions here. I'd like to add a couple of my own:

Fix some of the formatting errors with GTTK. Would this really be

so difficult? It seems to me that the breaking of links is a bug that needs fixing by Google. 2) Implement spelling and punctuation check automatically within GTTK before posting of the articles. 3) Have GTTK automatically remove broken templates and images, or require users to translate any templates before a page may be posted. 4) Include a list of most needed articles for people to create, rather than random articles that will be of little use to local readers. Some articles, such as those on local topics, have the added benefit of encouraging more edits and community participation since they tend to generate more interest from speakers of a language in my experience.

3 of these are things for Google to work on, one is something for us to work on. I think this is a potentially valuable resource, the problem is channeling the efforts and energies of these well-meaning people in the right direction so that local Wikipedias don't end up full of low-quality, unreadable articles with little hope for improvement. I'm curious to hear your thoughts.

-m.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- KIZU Naoko http://d.hatena.ne.jp/Britty (in Japanese) Quote of the Day (English): http://en.wikiquote.org/wiki/WQ:QOTD

Mark Williamson

7:44 p.m.

Aphaia, Shiju Alex and I are referring to Google Translator Toolkit, not Google Translate. If the person using the Toolkit uses it as it was _meant_ to be used, the results should be as good as a human translation because they've been reviewed and corrected by a human.

-m.

On Tue, Jul 27, 2010 at 12:42 PM, Aphaia aphaia@gmail.com wrote:

...

GT fails. At least for Japanese, it sucks. And that is why I don't support it. GT may fit to SVO languages, but for SOV languages, it is nothing but a crap.

Imagine to fix a 4000 words of documents whose all lines are sort of "all your base is belong to us". It's not a simple thing as you imagine - "spelling and punctuation". I admit it has been improved (now Free Tibet from English to Japanese is "Furi Tibetto", not former "muryo tibetto" (Tibet for gratis) in two years ago - but craps are still craps and I don't want to spend my hours for the for-profit giant.

On Tue, Jul 27, 2010 at 6:42 PM, Mark Williamson node.ue@gmail.com wrote:

...
On Tue, Jul 27, 2010 at 1:36 AM, Shiju Alex shijualexonline@gmail.com wrote:

...
1. Ban the project of Google as done by the Bengali wiki community (Bad solution, and I am personally against this solution) 2. Ask Google to engage wiki community (As happened in the case of Tamil) to find out a working solution. But if there is no active wiki community what Google can do. But does this mean that Google can continue with the project as they want? (Very difficult solution if there is no active wiki community) 3. Find some other solution. For example, Is it possible to upload the translated articles in a separate name space, for example, Google: Let the community decides what needs to be taken to the main/article namespace. 4. .........

If some solution is not found soon, Google's effort is going to create problem in many language wikipedias. The worst result of this effort would be the rift between the wiki community and the Google translators (speakers of the same language) :(

Shiju

Shiju,

I think you have made some great suggestions here. I'd like to add a couple of my own:

Fix some of the formatting errors with GTTK. Would this really be

so difficult? It seems to me that the breaking of links is a bug that needs fixing by Google. 2) Implement spelling and punctuation check automatically within GTTK before posting of the articles. 3) Have GTTK automatically remove broken templates and images, or require users to translate any templates before a page may be posted. 4) Include a list of most needed articles for people to create, rather than random articles that will be of little use to local readers. Some articles, such as those on local topics, have the added benefit of encouraging more edits and community participation since they tend to generate more interest from speakers of a language in my experience.

3 of these are things for Google to work on, one is something for us to work on. I think this is a potentially valuable resource, the problem is channeling the efforts and energies of these well-meaning people in the right direction so that local Wikipedias don't end up full of low-quality, unreadable articles with little hope for improvement. I'm curious to hear your thoughts.

-m.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- KIZU Naoko http://d.hatena.ne.jp/Britty (in Japanese) Quote of the Day (English): http://en.wikiquote.org/wiki/WQ:QOTD

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Casey Brown

8:15 p.m.

On Tue, Jul 27, 2010 at 3:44 PM, Mark Williamson node.ue@gmail.com wrote:

...

Aphaia, Shiju Alex and I are referring to Google Translator Toolkit, not Google Translate. If the person using the Toolkit uses it as it was _meant_ to be used, the results should be as good as a human translation because they've been reviewed and corrected by a human.

But if the program were being used by a human who speaks the language, wouldn't it be *pull* translation and not *push* translation?

-- Casey Brown Cbrown1023

Aphaia

9:35 p.m.

Ah, I omitted T, and I meant Toolkit. A toolkit with garbage could be called toolkit, but it doesn't change it is useless; it cannot deal with syntax properly, i.e. conjugation etc. at this moment. Intended to be "reviewed and corrected by a human" doesn't assure it was really "reviewed and corrected by a human" to a sufficient extent. It could be enough for your target language, but not for mine. Thanks.

On Wed, Jul 28, 2010 at 5:15 AM, Casey Brown lists@caseybrown.org wrote:

...

On Tue, Jul 27, 2010 at 3:44 PM, Mark Williamson node.ue@gmail.com wrote:

...
Aphaia, Shiju Alex and I are referring to Google Translator Toolkit, not Google Translate. If the person using the Toolkit uses it as it was _meant_ to be used, the results should be as good as a human translation because they've been reviewed and corrected by a human.

But if the program were being used by a human who speaks the language, wouldn't it be *pull* translation and not *push* translation?

-- Casey Brown Cbrown1023

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- KIZU Naoko http://d.hatena.ne.jp/Britty (in Japanese) Quote of the Day (English): http://en.wikiquote.org/wiki/WQ:QOTD

Michael Snow

10:26 p.m.

Aphaia wrote:

...

Ah, I omitted T, and I meant Toolkit. A toolkit with garbage could be called toolkit, but it doesn't change it is useless; it cannot deal with syntax properly, i.e. conjugation etc. at this moment. Intended to be "reviewed and corrected by a human" doesn't assure it was really "reviewed and corrected by a human" to a sufficient extent. It could be enough for your target language, but not for mine. Thanks.

I think then it's not just about the capabilities of the tool or the qualities of the language, but also the abilities of the human being who is counted on to "intervene" in the translation. As with Wikipedia editing generally, we don't really have a good mechanism to ensure that a given individual has a particular skill level, we rely on their mistakes being corrected by others. The only guarantee that the editor of an article understands its subject matter (or even, in this case, knows the language in which it is written) is for each of us to be aware of our own limitations.

It's quite likely that for some languages, current translation tools are not usable. It's possible that in some cases they never will be usable. Speakers of a given language should evaluate and decide for themselves. But it's certain that some people shouldn't be using these tools, if they're not doing enough to clean up the machine translation word salad. I know that I'd hesitate to use them in languages that I've studied but am not particularly fluent in, like Spanish or Italian (not that those Wikipedias need this kind of contribution from me anyway). If the tools are being used indiscriminately, it might be best to persuade people that they should work in areas they understand, not simply reject the tool outright.

--Michael Snow

Cool Hand Luke

11:27 p.m.

Mass machine translations ("pushing" them onto other projects that may or may not want them) is a very bad idea.

Beginning in 2004-05, a non-native speaker on en.wp decided that he should import slightly-cleaned babelfish translations of foreign language articles that did not have articles on the English wikipedia. They were almost uniformly horrid, and required many volunteer hours to clean up (I believe some were simply deleted). The user had to be restrained from importing additional articles in this manner.

I would not want to impose cleanup jobs upon users who did not volunteer for them. In other words, I think the "pull" method of translation is not a bug--it's a feature--it ensures that a competent native speaker is willing and able to satisfactorily port an article to a different language.

That said, if someone wants to develop a tool to aid "pullers" by making a first-pass translation of the wikitext of an article, I think that would be an unambiguously good thing.

Frank

Andreas Kolbe

30 Jul 30 Jul

12:31 a.m.

Having tried it tonight, I don't find the Google translator toolkit all that useful, at least not at this present level of development. To sum up:

First you read their translation.

Then you scratch your head: What the deuce is that supposed to mean ...?

Then you check the original language version.

Then you compare the two.

Then you start wondering: How did *this* turn into *that*?

Then you shake your head.

(Note: everything up to this point is unproductive time.)

Then you look at the original again and try to translate it.

As you do, you invariably end up leaving the Google shite where it is and writing your own text.

In the end, you delete the Google shite, and then, as you do so, you kick yourself because there were two words in there that you needn't have typed yourself.

Epic fail.

--- On Wed, 28/7/10, Cool Hand Luke User.CoolHandLuke@gmail.com wrote:

...

From: Cool Hand Luke User.CoolHandLuke@gmail.com Subject: Re: [Foundation-l] Push translation To: "Wikimedia Foundation Mailing List" foundation-l@lists.wikimedia.org Date: Wednesday, 28 July, 2010, 0:27 Mass machine translations ("pushing" them onto other projects that may or may not want them) is a very bad idea.

Beginning in 2004-05, a non-native speaker on en.wp decided that he should import slightly-cleaned babelfish translations of foreign language articles that did not have articles on the English wikipedia. They were almost uniformly horrid, and required many volunteer hours to clean up (I believe some were simply deleted). The user had to be restrained from importing additional articles in this manner.

I would not want to impose cleanup jobs upon users who did not volunteer for them. In other words, I think the "pull" method of translation is not a bug--it's a feature--it ensures that a competent native speaker is willing and able to satisfactorily port an article to a different language.

That said, if someone wants to develop a tool to aid "pullers" by making a first-pass translation of the wikitext of an article, I think that would be an unambiguously good thing.

Frank _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Nikola Smolenski

31 Jul 31 Jul

12:52 p.m.

Дана Friday 30 July 2010 02:31:44 Andreas Kolbe написа:

...

Having tried it tonight, I don't find the Google translator toolkit all that useful, at least not at this present level of development. To sum up:

First you read their translation.

Then you scratch your head: What the deuce is that supposed to mean ...?

Then you check the original language version.

Then you compare the two.

Then you start wondering: How did *this* turn into *that*?

Then you shake your head.

(Note: everything up to this point is unproductive time.)

Then you look at the original again and try to translate it.

As you do, you invariably end up leaving the Google shite where it is and writing your own text.

In the end, you delete the Google shite, and then, as you do so, you kick yourself because there were two words in there that you needn't have typed yourself.

Interestingly, I have had a completely opposite experiences. When reading a Google translation, it is easy for me to decipher what does it mean even if it is not gramatically correct. When translating, I often hang on deciding what sentence structure to use, or on remembering how a specific words translates. GTT solves both problems. My estimate is that I retain half and rewrite half of every sentence it produces.

stevertigo

1 Aug 1 Aug

12:33 a.m.

Nikola Smolenski smolensk@eunet.rs wrote:

...

Interestingly, I have had a completely opposite experiences. When reading a Google translation, it is easy for me to decipher what does it mean even if it is not gramatically correct. When translating, I often hang on deciding what sentence structure to use, or on remembering how a specific words translates. GTT solves both problems. My estimate is that I retain half and rewrite half of every sentence it produces.

Good points. The real purpose and functionality of GTT and MTT is not just to base more formal translations on, but to give people a quick look into what people in another language are saying. If people in their own respective languages learn to be more formal and less idiomatic in their writing, they can consider that at least the basic gist of their writing can be suitably read by almost anyone.

-SC

Andreas Kolbe

7 Aug 7 Aug

12:23 a.m.

--- On Sat, 31/7/10, Nikola Smolenski smolensk@eunet.rs wrote:

...

Interestingly, I have had a completely opposite experiences. When reading a Google translation, it is easy for me to decipher what does it mean even if it is not gramatically correct. When translating, I often hang on deciding what sentence structure to use, or on remembering how a specific words translates. GTT solves both problems. My estimate is that I retain half and rewrite half of every sentence it produces.

I'm afraid if that is how you proceed, you are already up the creek without a paddle. You say, "When reading a Google translation, it is easy for me to decipher what does it mean even if it is not gramatically correct."

If you are translating, you should not be able to decipher what the Google output means, you should be able to decipher what the *original* says, *from looking at the original*.

Because the Google Translator Toolkit, at times, translates "there is one such system" as "there is no such system", or it translates "A is governed by B" as "A governs B". Don't ask me why, it does. Even in mainstream language pairs like English and German. I shudder to think what it does in Hindi and Tamil.

So when you are working on a text about maths, or physics, that is supposed to go into an encyclopedia, deducing the meaning of the original from the Google translation is really quite fatal.

And to someone who is fully fluent in the source language, and wants to compose a text in the target language, Google Translator Toolkit is, at present, worthless. Word-processing the Google output to arrive at a readable, written text creates more work than it saves.

Remember, we are talking translation here: that means composing a well-written, correctly formatted text for others to read. We are not talking about "figuring out what it probably means."

If Google want to build up their translation memory, I suggest they pay publishers for permission to analyse existing, published translations, and read those into their memory. This will give them a database of translations that the market judged good enough to publish, written by people who (presumably) understood the subject matter they were working in.

This seems a much better idea than to pay for and collect memories from haphazard Wikipedia translations done by amateurs which, judging by the feedback from the relevant Wikipedia communities, are garbage. Why feed that garbage into the system?

There should be alarm bells ringing at Google here.

...

Andreas Kolbe написа:

...
Having tried it tonight, I don't find the Google

translator toolkit all

...
that useful, at least not at this present level of

development. To sum up:

...
First you read their translation.

Then you scratch your head: What the deuce is that

supposed to mean ...?

...
Then you check the original language version.

Then you compare the two.

Then you start wondering: How did *this* turn into

*that*?

...
Then you shake your head.

(Note: everything up to this point is unproductive

time.)

...
Then you look at the original again and try to

translate it.

...
As you do, you invariably end up leaving the Google

shite where it is and

...
writing your own text.

In the end, you delete the Google shite, and then, as

you do so, you kick

...
yourself because there were two words in there that

you needn't have typed

...
yourself.

Federico Leva (Nemo)

8:25 a.m.

Andreas Kolbe, 07/08/2010 02:23:

...

If Google want to build up their translation memory, I suggest they pay publishers for permission to analyse existing, published translations, and read those into their memory. This will give them a database of translations that the market judged good enough to publish, written by people who (presumably) understood the subject matter they were working in.

Good idea. EU would be a good start, see their great dictionary/internal translation memory, one of the most useful language tools out there: http://iate.europa.eu/iatediff/ But I suppose they won't release such data to one company. We may ask them to release a database of paired translations under a copyleft license and use it to improve free translation memories and to suggest Google to adopt a free license.

Currently, if you want to translate something, machine translation is useful only to translate a bunch of /words/ (especially nouns and adjectives) at once, if you don't want to check dictionary multiple times to remember their general meaning. And anyway on GT or GTT you won't find many translations that you can find e.g. on IATE, so it's often completely useless even for this "simple" task.

Nemo

Lars Aronsson

8 Aug 8 Aug

4:10 a.m.

New subject: [Foundation-l] Parallel text alignment (was: Push translation)

On 08/07/2010 02:23 AM, Andreas Kolbe wrote:

...

Word-processing the Google output to arrive at a readable, written text creates more work than it saves.

This is where our experience differs. I'm working faster with the Google Translator Toolkit than without.

...

If Google want to build up their translation memory, I suggest they pay publishers for permission to analyse existing, published translations, and read those into their memory. This will give them a database of translations that the market judged good enough to publish, written by people who (presumably) understood the subject matter they were working in.

If we forget Google for a while, this is actually something that we could do on our own. There are enough texts in Wikisource (out of copyright books) that are available in more than one language. In some cases, we will run into old spelling and use of language, but it will be better than nothing. The result could be good input to Wiktionary.

Here is the Norwegian original of Nansen's Eskimoliv, http://no.wikisource.org/wiki/Indeks:Nansen-Eskimoliv.djvu

And here is the Swedish translation, both from 1891, http://sv.wikisource.org/wiki/Index:Eskim%C3%A5lif.djvu

Norwegian: Grønland er paa en eiendommelig vis knyttet til vort land og folk.

Swedish: Grönland är på ett egendomligt sätt knutet till vårt land och vårt folk.

As you can see, there is one difference already in this first sentence: The original ends "to our country and people", while the translation ends "to our country and our people".

Is there any good free software for aligning parallel texts and extracting translations? Looking around, I found NAtools, TagAligner, and Bitextor, but they require texts to be marked up already. Are these the best and most modern tools available?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Andreas Kolbe

12:06 p.m.

New subject: [Foundation-l] Parallel text alignment (was: Push translation)

--- On Sun, 8/8/10, Lars Aronsson lars@aronsson.se wrote:

...

This is where our experience differs. I'm working faster with the Google Translator Toolkit than without.

Whether "faster" or not is a function of a number of variables:

- How well do you know the languages and subject matter concerned? Do you rely on GTTK to give you some words you would otherwise have to look up? Do you trust GTTK in these cases, or do you double-check?

- How important is it to you that the target text reads well? How many compromises in terms of style and readability are you prepared to make, in order to be able to keep more Google output unchanged?

- How fast can you type? Is it faster for you to type five words at the current cursor position, or is it faster for you to move your cursor position, take existing words in the GTTK output into the clipboard, and move them from the place where they are to the place where they belong?

- What kind of text are you dealing with? How complex is the sentence structure? GTTK will do better on simple sentences: "Take two eggs. Add 200g of flour and 150g of margarine. Mix everything in a bowl, using an electric mixer." However, the typical WP article does not have such simple sentence structures.

In the tests I made, I found I delete more than half, and even what I keep often has to be moved to a different place, does not have the right word order, or has wrong inflectional endings, making it quicker for me to type it from scratch than to copy it to the right place.

You won't find many professional translators using GTTK for their work.

Where WP communities have complained about poor sentence structure and stilted expressions, this is likely due to translators editing the GTTK output as little as possible, so they can do more words per day, and, frankly, earn more money.

If they are paid according to the number of words by Google, rather than the end customer, there is very little incentive for them to produce a text that the end customer will like, because the end customer isn't the one who's paying them.

Mark Williamson

8:40 p.m.

New subject: [Foundation-l] Parallel text alignment (was: Push translation)

...

You won't find many professional translators using GTTK for their work.

[citation needed]

-m.

Andreas Kolbe

9 Aug 9 Aug

12:50 a.m.

New subject: [Foundation-l] Parallel text alignment (was: Push translation)

--- On Sun, 8/8/10, Mark Williamson node.ue@gmail.com wrote:

...

...
You won't find many professional

translators using GTTK for their work.

[citation needed]

Professional translators and translation agencies tend to use systems like Trados or Wordfast, building their own translation memories relevant to the (repetitive) work they are doing. Machine translation of any kind, as included in GTTK, is not much used, except in niches.

Another thing: GTTK, even if it were more developed than it is, would not be an option for most professionals, because using it would involve a breach of client confidentiality.

Here is an illustrative article that may be of interest:

http://www.proz.com/translation-articles/articles/270/1/Reflections-of-a-Hum...;

It includes an example of a machine translation "obtained from an online search service offering among other things machine translation to its customers". Compare the finished text against the initial output.

Here are some discussions from translator forums: http://urd.proz.com/forum/poll_discussion/155006-poll:_have_you_tried_google...

http://bel.proz.com/forum/internet_for_translators/137161-attention_all_tran...

Mark Williamson

12:56 a.m.

New subject: [Foundation-l] Parallel text alignment (was: Push translation)

I read that thread and noticed a lot of confusion. One translator admitted she never even tried it, but still had lots of negative stuff to say; more than one person said they found it useful (see Esperantisto's response), and other people seemed to not realize there was a difference between Google Translate and Google Translator Toolkit.

As I said before, I make my living as a translator, these are not foreign concepts to me. I use GTTK every day in my work and find it to be very useful. Perhaps you don't find it useful, alright, but some of us have used it to increase efficiency.

-m.

On Sun, Aug 8, 2010 at 5:50 PM, Andreas Kolbe jayen466@yahoo.com wrote:

...

--- On Sun, 8/8/10, Mark Williamson node.ue@gmail.com wrote:

...
...
You won't find many professional

translators using GTTK for their work.

[citation needed]

Professional translators and translation agencies tend to use systems like Trados or Wordfast, building their own translation memories relevant to the (repetitive) work they are doing. Machine translation of any kind, as included in GTTK, is not much used, except in niches.

Another thing: GTTK, even if it were more developed than it is, would not be an option for most professionals, because using it would involve a breach of client confidentiality.

Here is an illustrative article that may be of interest:

http://www.proz.com/translation-articles/articles/270/1/Reflections-of-a-Hum...;

It includes an example of a machine translation "obtained from an online search service offering among other things machine translation to its customers". Compare the finished text against the initial output.

Here are some discussions from translator forums: http://urd.proz.com/forum/poll_discussion/155006-poll:_have_you_tried_google...

http://bel.proz.com/forum/internet_for_translators/137161-attention_all_tran...

A.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Andreas Kolbe

1:37 a.m.

New subject: [Foundation-l] Parallel text alignment (was: Push translation)

...

I read that thread and noticed a lot of confusion. One translator admitted she never even tried it, but still had lots of negative stuff to say; more than one person said they found it useful (see Esperantisto's response), and other people seemed to not realize there was a difference between Google Translate and Google Translator Toolkit.

GTTK allows you to create your own translation memories, much like Trados or Wordfast. If there is nothing in your memory to correspond, however, you get pretty much the same translation that you get in Google Translate.

In that sense, Google Translate gives us a good indication of what Google's translators get when they start on a Wikipedia article.

You can all try this: go to a random Japanese or German or Hindi WP article, and paste the text into Google Translate to have it translated into your language.

http://translate.google.com/

This is what the translator will have to start from.

Andreas Kolbe

4:10 a.m.

New subject: [Foundation-l] Parallel text alignment (was: Push translation)

P.S. Chatting to Mark off-list about GTTK, and having experimented with other languages, it appears that GTTK quality varies widely depending on the language pair, and probably the source/target direction.

German, Hindi and Japanese are definitely handled poorly; some other language combinations seem to do much better.

--- On Mon, 9/8/10, Andreas Kolbe jayen466@yahoo.com wrote:

...

From: Andreas Kolbe jayen466@yahoo.com Subject: Re: [Foundation-l] Parallel text alignment (was: Push translation) To: "Wikimedia Foundation Mailing List" foundation-l@lists.wikimedia.org Date: Monday, 9 August, 2010, 2:37

...
I read that thread and noticed a

lot of confusion. One translator

...
admitted she never even tried it, but still had lots

of negative stuff

...
to say; more than one person said they found it useful

(see

...
Esperantisto's response), and other people seemed to

not realize there

...
was a difference between Google Translate and Google

Translator

...
Toolkit.

GTTK allows you to create your own translation memories, much like Trados or Wordfast. If there is nothing in your memory to correspond, however, you get pretty much the same translation that you get in Google Translate.

In that sense, Google Translate gives us a good indication of what Google's translators get when they start on a Wikipedia article.

You can all try this: go to a random Japanese or German or Hindi WP article, and paste the text into Google Translate to have it translated into your language.

http://translate.google.com/

This is what the translator will have to start from.

A.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

John Vandenberg

1:31 a.m.

New subject: [Foundation-l] [Wikisource-l] Parallel text alignment (was: Push translation)

On Sun, Aug 8, 2010 at 2:10 PM, Lars Aronsson lars@aronsson.se wrote:

...

... Is there any good free software for aligning parallel texts and extracting translations? Looking around, I found NAtools, TagAligner, and Bitextor, but they require texts to be marked up already. Are these the best and most modern tools available?

there is a Mediawiki extension which is supposed to provide this:

http://wikisource.org/wiki/Wikisource:DoubleWiki_Extension

It is enabled on all wikisource subdomains.

http://en.wikisource.org/wiki/Crito?match=el

It doesn't work very well because our Wikisource projects have different layouts, esp. templates such as the header on each page.

-- John Vandenberg

Aphaia

28 Jul 28 Jul

1:14 a.m.

On Wed, Jul 28, 2010 at 7:26 AM, Michael Snow wikipedia@verizon.net wrote:

...

Aphaia wrote:

...
Ah, I omitted T, and I meant Toolkit. A toolkit with garbage could be called toolkit, but it doesn't change it is useless; it cannot deal with syntax properly, i.e. conjugation etc. at this moment. Intended to be "reviewed and corrected by a human" doesn't assure it was really "reviewed and corrected by a human" to a sufficient extent. It could be enough for your target language, but not for mine. Thanks.

I think then it's not just about the capabilities of the tool or the qualities of the language, but also the abilities of the human being who is counted on to "intervene" in the translation. As with Wikipedia editing generally, we don't really have a good mechanism to ensure that a given individual has a particular skill level, we rely on their mistakes being corrected by others. The only guarantee that the editor of an article understands its subject matter (or even, in this case, knows the language in which it is written) is for each of us to be aware of our own limitations.

It's quite likely that for some languages, current translation tools are not usable. It's possible that in some cases they never will be usable. Speakers of a given language should evaluate and decide for themselves. But it's certain that some people shouldn't be using these tools, if they're not doing enough to clean up the machine translation word salad. I know that I'd hesitate to use them in languages that I've studied but am not particularly fluent in, like Spanish or Italian (not that those Wikipedias need this kind of contribution from me anyway). If the tools are being used indiscriminately, it might be best to persuade people that they should work in areas they understand, not simply reject the tool outright.

True, but this thread is concerning to "push" articles with machine translation? And it implies to have others clean it up, not "work in areas they understand" as you suggested, so I'd like to point out it should never happen at least at this moment.

I don't oppose node_ue or others use those Google product just for their use (it's upon them anyway), but if they recommend them (either Google Translation Toolkit or Google Translation), I would like to stress it's no snake oil for every language at this moment, and for people like stevertigo, who think Google Translation is enough, it's quite opposite of the truth. It may happen to work in some cases, but generally cleaning up Google Translation results is nothing recommendable for volunteers. Note that even Google themselves don't use Google Translation for their Wikipedia translation project.

Cheers,

...

--Michael Snow

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- KIZU Naoko http://d.hatena.ne.jp/Britty (in Japanese) Quote of the Day (English): http://en.wikiquote.org/wiki/WQ:QOTD

Mark Williamson

3:58 p.m.

Yes, of course if it's not actually reviewed and corrected by a human it's going to be bad. What I said was that if it's used "as it was meant to be used", the results should be indistinguishable from a normal human translation, regardless of the language involved because all mistakes would be fixed by a person. People often neglect to do that, but that doesn't make the tool inherently evil.

-m.

On Tue, Jul 27, 2010 at 2:35 PM, Aphaia aphaia@gmail.com wrote:

...

Ah, I omitted T, and I meant Toolkit. A toolkit with garbage could be called toolkit, but it doesn't change it is useless; it cannot deal with syntax properly, i.e. conjugation etc. at this moment. Intended to be "reviewed and corrected by a human" doesn't assure it was really "reviewed and corrected by a human" to a sufficient extent. It could be enough for your target language, but not for mine. Thanks.

On Wed, Jul 28, 2010 at 5:15 AM, Casey Brown lists@caseybrown.org wrote:

...
On Tue, Jul 27, 2010 at 3:44 PM, Mark Williamson node.ue@gmail.com wrote:

...
Aphaia, Shiju Alex and I are referring to Google Translator Toolkit, not Google Translate. If the person using the Toolkit uses it as it was _meant_ to be used, the results should be as good as a human translation because they've been reviewed and corrected by a human.

But if the program were being used by a human who speaks the language, wouldn't it be *pull* translation and not *push* translation?

-- Casey Brown Cbrown1023

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- KIZU Naoko http://d.hatena.ne.jp/Britty (in Japanese) Quote of the Day (English): http://en.wikiquote.org/wiki/WQ:QOTD

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Amir E. Aharoni

4:14 p.m.

Is anyone from Google reading this thread?

Because of this thread i tried to play with the Google Translator Toolkit a little and found some technical problems. When i tried to send bug reports about them through the "Contact us" form, i received after a few minutes a "bounce" message from the translation-editor-support@google.com address.

I love reporting bugs, and developers are supposed to love reading them, but it looks like i'm stuck here...

2010/7/27 Mark Williamson node.ue@gmail.com

...

Aphaia, Shiju Alex and I are referring to Google Translator Toolkit, not Google Translate. If the person using the Toolkit uses it as it was _meant_ to be used, the results should be as good as a human translation because they've been reviewed and corrected by a human.

--

אָמִיר אֱלִישָׁע אַהֲרוֹנִי Amir Elisha Aharoni

http://aharoni.wordpress.com

"We're living in pieces, I want to live in peace." - T. Moore

Mark Williamson

29 Jul 29 Jul

1:52 a.m.

Google is, in my experience, very difficult for "regular" people to get in touch with. Sometimes, when a product is in beta, they give you a way to contact them. They used to have an e-mail to contact them at if you had information about bilingual corpora (I found one online from the Nunavut parliament for English and Inuktitut, but now it looks like they've removed the address) so they could use it to improve Google Translate.

I think they intentionally have a relatively small support staff. I read somewhere that that had turned out to be a huge problem for the mobile phone they produced - people might not expect great support for a huge website like Google, but when they buy electronics, they certainly do expect to have someone they can call and talk to within 24 hours.

I don't think that's completely unwise, though. I'm sure they get tons of crackpot e-mails all the time. I was reading an official blog about Google Translate, and in the post about their Wikipedia contests, someone wrote an angry comment that google "must hate Spain" because the Spanish language wasn't mentioned in that particular post. Now multiply that by millions, and that is part of the reason (or so I imagine) that Google makes it difficult to contact them.

-m.

On Wed, Jul 28, 2010 at 9:14 AM, Amir E. Aharoni amir.aharoni@mail.huji.ac.il wrote:

...

Is anyone from Google reading this thread?

Because of this thread i tried to play with the Google Translator Toolkit a little and found some technical problems. When i tried to send bug reports about them through the "Contact us" form, i received after a few minutes a "bounce" message from the translation-editor-support@google.com address.

I love reporting bugs, and developers are supposed to love reading them, but it looks like i'm stuck here...

2010/7/27 Mark Williamson node.ue@gmail.com

...
Aphaia, Shiju Alex and I are referring to Google Translator Toolkit, not Google Translate. If the person using the Toolkit uses it as it was _meant_ to be used, the results should be as good as a human translation because they've been reviewed and corrected by a human.

--

אָמִיר אֱלִישָׁע אַהֲרוֹנִי Amir Elisha Aharoni

http://aharoni.wordpress.com

"We're living in pieces, I want to live in peace." - T. Moore _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Amir E. Aharoni

7:12 a.m.

2010/7/29 Mark Williamson node.ue@gmail.com

...

I don't think that's completely unwise, though. I'm sure they get tons of crackpot e-mails all the time. I was reading an official blog about Google Translate, and in the post about their Wikipedia contests, someone wrote an angry comment that google "must hate Spain" because the Spanish language wasn't mentioned in that particular post. Now multiply that by millions, and that is part of the reason (or so I imagine) that Google makes it difficult to contact them.

Bugzilla is difficult enough :) The thing that i love the most in Free Software projects is that when you report a bug, you know exactly where it went.

At OTRS a small bunch of volunteer Wikipedians deals quite successfully with tons of pretty delusional email. Surely Google could at least give people an email that doesn't bounce.

This problem may be one of the things that makes this translation project misunderstood.

-- אָמִיר אֱלִישָׁע אַהֲרוֹנִי Amir Elisha Aharoni http://aharoni.wordpress.com "We're living in pieces, I want to live in peace." - T. Moore

Muhammad Yahia

7:56 a.m.

My 2c :

- I dont know where everyone came up with the notion that the tool produces good results. Most of the articles on both Google's projects on the Arabic wikipedia are barely intelligible, with broken sentences, weird terminology and generally can be spotted right away (see my reply to the other thread). - Even if GTTK is improving, The idea of push contradicts with GTTK. Push means that someone with no knowledge of Arabic will 'push' the en.wp (or any other one) article to ar.wp. GTTK supposedly requires a translator that will revise and rephrase what the machine translation couldn't do. - NPOV falls victim to systemic bias, on en.wp or any other wiki. If not in the representation of difference in opinion, but in the arrangement of the article and the highlighting and order of different events. The wording of paragraphs also usually gravitates towards western way of neutral expression, which may be considered biased when read by someone where English is not his first language. - Let's suppose all the above didnt matter, and that GTTK works perfectly fine, let's suppose this idea is taken to the extreme, it would be: take largest x wikipedias, clone all articles to language x, wash, rinse, repeat. Where is the community? where is the involvement and exchange of ideas and continuous evolvement of articles? where's the wiki in wikipedia? - I see it as POV to assume that wiki x has the 'perfect' article on a certain subject such that everyone in the world needs to read that version only.

-- Best Regards, Muhammad Yahia

stevertigo

5:33 p.m.

Muhammad Yahia shipmaster@gmail.com wrote:

...

Where is the community? where is the involvement and exchange of ideas and continuous evolvement of articles? where's the wiki in wikipedia? - I see it as POV to assume that wiki x has the 'perfect' article on a certain subject such that everyone in the world needs to read that version only.

You raise some important points. The understanding of community-building is a key insight, and I think everyone here appreciates it. I understand that it's not enough to just "push" an article over to Swahili, for example, if 1) the translation is not sufficiently understandable on its own, if 2) the receiving language editors aren't practiced in how to handle such content, or if 3) the sender doesn't leave a note in a "lingua franca" explaining what its purpose is.

Note that the idea is not that only English language articles will "push" over to other languages - the idea is that other languages may have articles about topics which could be "pushed" over to English as well.

There has been perhaps a natural sense that there is some kind of encyclopedic imperialism inherent to the idea. There is also an assumption among many here that language is always relativistic and of cultural essence, or that English Wikipedia's 3.4 million articles are just out of the scope of relevance to other languages. I'd like to dispel those notions, but that would be out of scope for this particular email.

-SC

David Gerard

27 Jul 27 Jul

10:42 a.m.

On 27 July 2010 09:36, Shiju Alex shijualexonline@gmail.com wrote:

...

Wiki communities like the biological growth of the wikipedia articles in their wiki. Why English Wikipedia did not start building wikipedia articles using *Encyclopedia Britannica 1911* edition which was available in the public domain?

Er, are you saying that it didn't, and that it not doing so proves your point?

Because such a statement is factually inaccurate - en:wp *did* use the 1911EB as starter material.

- d.

Magnus Manske

10:54 a.m.

On Tue, Jul 27, 2010 at 11:42 AM, David Gerard dgerard@gmail.com wrote:

...

On 27 July 2010 09:36, Shiju Alex shijualexonline@gmail.com wrote:

...
Wiki communities like the biological growth of the wikipedia articles in their wiki. Why English Wikipedia did not start building wikipedia articles using *Encyclopedia Britannica 1911* edition which was available in the public domain?

Er, are you saying that it didn't, and that it not doing so proves your point?

Because such a statement is factually inaccurate - en:wp *did* use the 1911EB as starter material.

At least for some entries...

http://en.wikipedia.org/wiki/Special:WhatLinksHere/Template:1911

John Vandenberg

10:56 a.m.

On Tue, Jul 27, 2010 at 8:42 PM, David Gerard dgerard@gmail.com wrote:

...

Because such a statement is factually inaccurate - en:wp *did* use the 1911EB as starter material.

..and for [[Accius]], with 150 views per month, not even a single word has been added after three years.

-- John Vandenberg

Ray Saintonge

8:03 a.m.

stevertigo wrote:

...

Mark Williamson node.ue@gmail.com wrote:

...
I would like to add to this that I think the worst part of this idea is the assumption that other languages should take articles from en.wp.

The idea is that most of en.wp's articles are well-enough written, and written in accord with NPOV to a sufficient degree to overcome any such criticism of 'imperial encyclopedism.'

Suppose for a minute that your proposal were implemented, and all the machine translation problems were overcome. Would English NPOV be so good that community members in the target language would be incapable of making substantive improvements? And if they did make substantive change, how would you reconcile the divergence when both versions were subsequently edited?

...

Ray Saintonge saintonge@telus.net wrote

...
Key to the growth of Wikipedias in minority languages is respect for the cultures that they encompass, not flooding them with the First-World Point of View. What might be a Neutral Point of View on the English Wikipedia is limited by the contributions of English writers. Those who do not understand English may arrive at a different neutrality. We have not yet arrived at a Metapedia that would synthesize a single neutrality from all projects.

I strongly disagree. Neutral point of view has worked on en.wp because its a universalist concept. The cases where other language wikis reject English content appear to come due to POV, and thus a violation of NPOV, not because - as you seem to suggest - the POV in such countries must be considered "NPOV.

I'm disinclined to accept your universalist conjecture. It sounds too much like intelligent design for linguistics. When I visit the bookstores in another country I am struck by the difference in emphasis that they put on different topics. This alone is bound to lead to different neutralities.

Ray

stevertigo

11:59 p.m.

Ray Saintonge saintonge@telus.net wrote:

...

Suppose for a minute that your proposal were implemented, and all the machine translation problems were overcome. Would English NPOV be so good that community members in the target language would be incapable of making substantive improvements? And if they did make substantive change, how would you reconcile the divergence when both versions were subsequently edited?

Its a good question Ray. I'll avoid treating it as a technical one, because there is no simple technical solution for it. Suffice it to say that if another language has an article which en.wp lacks, then there should be no problem with them "pushing" it to en.wp.

There are a couple implicit assumptions in your term "English NPOV:" 1) that English NPOV is some kind of acceptable POV localized to English language contexts, or that 2) that there are qualitative differences between NPOV vis-a-vis the language being used. It may be interesting to see if we can do some testing for NPOV on those languages which Google can translate. We could pick a field of articles, read them over for completeness, etc. and grade them for neutrality.

...

I'm disinclined to accept your universalist conjecture. It sounds too much like intelligent design for linguistics. When I visit the bookstores in another country I am struck by the difference in emphasis that they put on different topics. This alone is bound to lead to different neutralities.

Careful with cultural relativity: There is a difference between "emphasis" and "meaning." If you look over time at a single bookstore in your own country, you will find emphasis rapidly changing based on the local mood or interests. There's no way one can expect such things as mood and interest to be correlated.

There are a couple implicit notions in this idea of a universal NPOV: That people are in fact intelligent, and regardless of the language system they use, they can design their articles according to NPOV. I don't see what's controversial about that, or how different cultures require "different neutralities."

-SC

Ting Chen

28 Jul 28 Jul

5:38 a.m.

Hello all,

I am a heavy translator on WikiMedia projects. I would say more than 95% of my contributions on content is translation. But I am against a blind translation. For example mostly I would translate british or north american related content from en-wp to zh-wp, and not from other languages. I would translate german or general european related articles mostly from de-wp to zh-wp. French related articles I try to translate from fr-wp to zh-wp, mostly I would make a test beforehand that the language in the article is enough comprehensible for me, since my French is by far not on the lever of my German or English. But I would never translate a Japanese related article because I don't think quite trust what the German or other english speaking people write about Japan. I never translate a China related article because I am very sure that at some point they have a not correct context. Even if there is no article in zh-wp about that particular entity I would rather wait a zh-wp user who has more understanding to start an article on it. (Or, ocasionally I would do research myself and start an article from scrap.)

Natural science related articles are sort of universal, although reading through different languages you can also find here tremendous differences between different language versions. I used to select the (in my subjective opinion) more better version (before 2006 more en-wp, in the last two three years de-wp or en-wp about half half), or I would combine both versions.

Backward, I would only translate China related article from zh-wp into de-wp, because I think the deficiency in de-wp in this area is the biggest. And some zh-wp articles, especially the excellent aritcles, are really very well written and researched and source. I don't see any sense to translate a German related article from zh-wp to de-wp.

Translation should never be blind. For example en-wp articles about american towns and cities could get very detailed, down to neighbourhoods. I mostly omit too detailed parts one because I would start to do original research beginning to invent Chinese transcriptions for the names and second I think they are not really interesting for a Chinese user. You can also find all kinds of failures in the original article, if you are doing a blind translation, in the best case the failures remain in the original article, in the worst case you propagate those failures into language versions that maybe have no resource to check those facts. For example in the article I am currently working on [[:en:Mumbai culture]] there was the failure that the city has three World Heritage (actually there are only two on its territory). Most failures are wrong links because of articles got moved or disambigged and the links didn't got corrected. For example in the article mentioned above in the section Cinema Multiplex was linked to [[:en:Multiplexing]], the correct link should be [[:en:Multiplex (movie theater)]]. Such errors are quite common.

Doing more careful and considered translation can help both language versions, more than simply doing a translation.

Greetings Ting

Aphaia

27 Jul 27 Jul

7:33 p.m.

I've noticed many of English Wikipedia articles cite only English written articles even if the topics are of non-English world. And normally, specially in the developing world, the most comprehend sources are found in their own languages - how can those articles be assured in NPOV when they ignore the majority of reliable sources?

Your logic looks simply failing to me.

And Google translation fails still now, even after it is "steadily" improved.

On Mon, Jul 26, 2010 at 11:43 AM, stevertigo stvrtg@gmail.com wrote:

...

Mark Williamson node.ue@gmail.com wrote:

...
I would like to add to this that I think the worst part of this idea is the assumption that other languages should take articles from en.wp.

The idea is that most of en.wp's articles are well-enough written, and written in accord with NPOV to a sufficient degree to overcome any such criticism of 'imperial encyclopedism.'

Mark Williamson node.ue@gmail.com wrote:

...
Nobody's arguing here that language and culture have no relationship. What I'm saying is that language does not equal culture. Many people speak French who are not part of the culture of France, for example the cities of Libreville and Abidjan in Africa.

Africa is an unusual case given that it was so linguistically diverse to begin with, and that its even moreso in the post-colonial era, when Arabic, French, English, and Dutch remain prominent marks of imperialistic influence.

Ray Saintonge saintonge@telus.net wrote:

...
This is well suited for the dustbin of terrible ideas. It ranks right up there with the notion that the European colonization of Africa was for the sole purpose of civilizing the savages.

This is the 'encyclopedic imperialism' counterargument. I thought I'd throw it out there. As Bendt noted above, Google has already been working on it for two years and has had both success and failure. It bears mentioning that their tools have been improving quite steadily. A simple test such as /English -> Arabic -> English/ will show that.

Note that colonialism isnt the issue. It still remains for example a high priority to teach English in Africa, for the simple reason that language is almost entirely a tool for communication, and English is quite good for that purpose. Its notable that the smaller colonial powers such as the French were never going to be successful at linguistic imperialism in Africa, for the simple reason that French has not actually been the lingua franca for a long time now.

...
Key to the growth of Wikipedias in minority languages is respect for the cultures that they encompass, not flooding them with the First-World Point of View. What might be a Neutral Point of View on the English Wikipedia is limited by the contributions of English writers. Those who do not understand English may arrive at a different neutrality. We have not yet arrived at a Metapedia that would synthesize a single neutrality from all projects.

I strongly disagree. Neutral point of view has worked on en.wp because its a universalist concept. The cases where other language wikis reject English content appear to come due to POV, and thus a violation of NPOV, not because - as you seem to suggest - the POV in such countries must be considered "NPOV."

Casey Brown lists@caseybrown.org wrote:

...
I'm surprised to hear that coming from someone who I thought to be a student of languages. I think you might want to read an article from today's Wall Street Journal, about how language influences culture (and, one would extrapolate, Wikipedia articles).

I had just a few days ago read Boroditsky's piece in Edge, and it covers a lot of interesting little bits of evidence. As Mark was saying, linguistic relativity (or the Sapir-Whorf hypothesis) has been around for most of a century, and its wider conjectures were strongly contradicted by Chomsky et al. Yes there is compelling evidence that language does "channel" certain kinds of thought, but this should not be overstated. Like in other sciences, linguistics can sometimes make the mistake of making *qualitative judgments based on a field of *quantitative evidence. This was essentially important back in the 40s and 50s when people were still putting down certain quasi-scientific conjectures from the late 1800s.

Still there are cultures which claim their languages to be superior in certain ways simply because they are more sonorous or emotive, or otherwise expressive, and that's the essential paradigm that some linguists are working in.

-SC

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- KIZU Naoko http://d.hatena.ne.jp/Britty (in Japanese) Quote of the Day (English): http://en.wikiquote.org/wiki/WQ:QOTD

Federico Leva (Nemo)

4 Aug 4 Aug

9:17 p.m.

Aphaia, 27/07/2010 21:33:

...

I've noticed many of English Wikipedia articles cite only English written articles even if the topics are of non-English world. And normally, specially in the developing world, the most comprehend sources are found in their own languages - how can those articles be assured in NPOV when they ignore the majority of reliable sources?

It's not only a matter of NPOV. There's even a policy for this: http://en.wikipedia.org/wiki/Wikipedia:Verifiability#Non-English_sources Obviously you can expect other language version to want the same for their language.

Nemo

Michael Galvez

5 Aug 5 Aug

1:12 p.m.

Sorry for coming into this discussion a bit late. I'm one of the members of Google's translation team, and I wanted to make myself available for feedback/questions.

Quoting some suggestions from Mark earlier in the thread:

1) Fix some of the formatting errors with GTTK. Would this really be so difficult? It seems to me that the breaking of links is a bug that needs fixing by Google.

We're working on various formatting errors based on our conversations with members of the Tamil and Telugu Wikipedia. We're hoping to push those out soon (in the coming weeks).

2) Implement spelling and punctuation check automatically within GTTK before posting of the articles.

There is spell check in Translator Toolkit, although it's not available for all languages. We don't have any punctuation checks today and I doubt that we can release this anytime soon. (If it's not available in Google Docs or Gmail, then it's unlikely that we'll have it for Translator Toolkit, as well, since we use the same infrastructure.)

What's the proposal, though - would you like for us to prevent publishing of articles if they have too many spelling errors, or simply warn the user that there are X spelling errors? Any input you can provide on preferred behavior would be great.

3) Have GTTK automatically remove broken templates and images, or require users to translate any templates before a page may be posted.

Templates are a bit tricky. Sometimes, a template in one Wikipedia does not exist in another Wikipedia. Other times, a template in one langauge maps to a template in another language but the parameters are different.

Removing broken templates automatically may not work because some templates come between words. If we remove them, the sentences or paragraph may become invalid. We've also considered creating a custom interface for localizing templates, but this requires a lot of work.

In the interim, the approach we've taken is to have translators fix the templates in Wikipedia when they post the article from Translator Toolkit. When a user clicks on Share > Publish to source page in Translator Toolkit, the Wikipedia article is in preview mode --- it's not live. The idea is that if there are any errors, the translator can fix them before saving the article.

4) Include a list of most needed articles for people to create, rather than random articles that will be of little use to local readers. Some articles, such as those on local topics, have the added benefit of encouraging more edits and community participation since they tend to generate more interest from speakers of a language in my experience.

The articles we selected actually weren't really random. Here's how we selected them:

1. we looked at the top Google searches in the region (e.g., for Tamil, we looked at searches in India and I believe Sri Lanka, as well) 2. from the top Google searches in the region, we looked at the top, clicked Wikipedia articles --- regardless of the language (so we wound up with Wikipedia source articles in English, Hindi, and other languages) 3. from the top, clicked Wikipedia articles, we looked for articles that were either stubs or unavailable in the local language - these are the articles that we sent for translation

This selection isn't perfect. For example, it assumes that the top, clicked Wikipedia articles by all users in India/Sri Lanka --- who may be searching in English, Hindi, Tamil, or some other language --- are relevant to the Tamil community. To improve this, last month, we met with members of the Tamil and Telugu Wikipedias to improve this article selection. The main changes that we agreed on were:

1. the local Wikipedia community should give Google final OK on what articles should or should not be translated 2. the local Wikipedia community add articles to Google's list 3. the local Wikipedia community can suggest titles for the articles 4. Google's translators will post the articles with their user names, and they will monitor community feedback on their user pages until the translation meets the community's standards

We're just getting started on this new process, and we'll keep refining this with the Tamil and Telugu communities as we move forward. If it's successful, we'll use it as our template for other projects.

As always, any feedback or suggestions are welcome. Also, while I plan to look at this foundation lists periodically, if you have bugs, you can also file them to our bug queue: translator-toolkit-support at google.com. While the eng team may not monitor this list, they do look at the support queue.

Regards,

Mike

On Wed, Aug 4, 2010 at 5:17 PM, Federico Leva (Nemo) nemowiki@gmail.comwrote:

...

Aphaia, 27/07/2010 21:33:

...
I've noticed many of English Wikipedia articles cite only English written articles even if the topics are of non-English world. And normally, specially in the developing world, the most comprehend sources are found in their own languages - how can those articles be assured in NPOV when they ignore the majority of reliable sources?

It's not only a matter of NPOV. There's even a policy for this: http://en.wikipedia.org/wiki/Wikipedia:Verifiability#Non-English_sources Obviously you can expect other language version to want the same for their language.

Nemo

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Lars Aronsson

5:39 p.m.

On 08/05/2010 03:12 PM, Michael Galvez wrote:

...

Sorry for coming into this discussion a bit late. I'm one of the members of Google's translation team, and I wanted to make myself available for feedback/questions.

This is an unusual and most welcome step for Google. When I first learned about GTTK in June 2009, I used it to translate a handful of articles from English to Swedish. I'm glad that it's now also possible to translate into English, but some of the errors are still there.

It's a great tool, and should be used more. We have a common interest in improving it. But for this, wikipedians need feedback. Which language pairs are most active? What words or phrases does GTTK find problematic, and can we somehow improve that? Google could benefit so much from collaborating with wikipedians. Ultimately, Google could share some translation dictionaries, so we could include them in Wiktionary, the free dictionary.

Users of Gmail or Google Apps want their privacy, but users who translate Wikipedia articles are already sharing their results, so Google could help us to find each other and make us collaborate. Translations that start from a Wikipedia article could by default be put in a shared pool where other wikipedians can find them.

To some details:

I need a way to mark in the original text that a phrase is a quote, book title or noun proper that shouldn't be translated, but copied literally. And in the statistics, those words should not be counted as untranslated and block me from publishing the result. Optimally, GTTK would learn over time where such literal phrases occur, e.g. text in italics under the ==Bibliography== section.

English ==References== corresponds to Swedish ==Källor==, even though the two words are not direct translations. GTTK was pretty quick to pick this up. However, the different styles we use for the opening paragraph of biographic articles, using parenthesis around the birth and death dates in the English Wikipedia, but not in some other languages, is something GTTK has not yet learned.

Categories should not be translated, but GTTK should follow the interwiki links for categories. If none exist, perhaps suggest a parent category.

Even for articles that already exist in the target language, we often need to translate another section. For example, the Swedish Wikipedia might have an article about Afghanistan with a good section about its geography, but the history section needs improvement, and could be translated from another language. The work-around is to begin a translation of the whole article, but only translate the relevant part and then cut-and-paste into the target without submitting through GTTK. Perhaps GTTK could bring up both articles side by side and suggest which sections are in most dire need of improvement?

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Michael Galvez

6 Aug 6 Aug

5:47 p.m.

Hi Lars,

Thanks for the detailed feedback. Some comments inline.

Mike

On Thu, Aug 5, 2010 at 1:39 PM, Lars Aronsson lars@aronsson.se wrote:

...

On 08/05/2010 03:12 PM, Michael Galvez wrote:

...
Sorry for coming into this discussion a bit late. I'm one of the members

of

...
Google's translation team, and I wanted to make myself available for feedback/questions.

This is an unusual and most welcome step for Google. When I first learned about GTTK in June 2009, I used it to translate a handful of articles from English to Swedish. I'm glad that it's now also possible to translate into English, but some of the errors are still there.

It's a great tool, and should be used more. We have a common interest in improving it. But for this, wikipedians need feedback. Which language pairs are most active? What words or phrases does GTTK find problematic, and can we somehow improve that? Google could benefit so much from collaborating with wikipedians. Ultimately, Google could share some translation dictionaries, so we could include them in Wiktionary, the free dictionary.

Several points here: 1. Re: language pairs, let me check with comms to see what we can share and how. One possibility is to periodically post the statistics and link to them from the Google Translate blog. Will keep you posted.

2. I'm not sure what you mean by "words or phrases" that are problematic. Can you clarify?

3. We acquire dictionaries on limited licenses from other parties. In general, while we can surface this content on our own sites (e.g., Google Translate, Google Dictionary, Google Translator Toolkit), we don't have permission to donate that data to other sites.

...

Users of Gmail or Google Apps want their privacy, but users who translate Wikipedia articles are already sharing their results, so Google could help us to find each other and make us collaborate. Translations that start from a Wikipedia article could by default be put in a shared pool where other wikipedians can find them.

Yes, this is one of the things we'd love to do and we're working on it.

...

To some details:

I need a way to mark in the original text that a phrase is a quote, book title or noun proper that shouldn't be translated, but copied literally. And in the statistics, those words should not be counted as untranslated and block me from publishing the result. Optimally, GTTK would learn over time where such literal phrases occur, e.g. text in italics under the ==Bibliography== section.

For HTML files, both Translate and Translator Toolkit support the tag

class="notranslate"

to exclude text from translation. ( http://translate.google.com/support/toolkit/bin/answer.py?hl=en&answer=1... )

If you tell us what MediaWiki tags you'd like for us to treat the same way, we can do the same for Wikipedia.

...

English ==References== corresponds to Swedish ==Källor==, even though the two words are not direct translations. GTTK was pretty quick to pick this up. However, the different styles we use for the opening paragraph of biographic articles, using parenthesis around the birth and death dates in the English Wikipedia, but not in some other languages, is something GTTK has not yet learned.

At the most basic level, this is how the Translator Toolkit "learns" from translations:

1. When a translator uploads a WIkipedia article into Translator Toolkit, we divide the article into segments. (sentences, section headings, etc.)

2. For each segment, we look for the highest-rated translation in the global, shared translation memory or TM --- a big database of human translations previously shared by other users. a. If we find a translation for that segment in the TM, we will "pre-translate" the segment with the highest-rated translation. b. If we don't find a translation for that segment in the TM, we will "pre-translate" that segment with machine translation (MT).

3. When the translator corrects these pre-translated segments, we save their corrections into the global, shared TM.

4. When a new user asks for the translation of a segment previously corrected by another user, we will recall that previous, human translation and prefer it over MT.

At a higher level, we also incorporate these previous, human translations into our MT engine, improving its quality over time.

In this case, it just so happened that someone had translated ==References== into ==Källor== --- that's why we're surfacing that corrected translation from the TM. In contrast, the other segments are probably coming back as MT.

...

Categories should not be translated, but GTTK should follow the interwiki links for categories. If none exist, perhaps suggest a parent category.

Following interwiki links and suggesting parent categories is a bit of work and unlikely to be implemented soon. We can disable category translation if that helps - can you confirm if that's OK?

...

Even for articles that already exist in the target language, we often need to translate another section. For example, the Swedish Wikipedia might have an article about Afghanistan with a good section about its geography, but the history section needs improvement, and could be translated from another language. The work-around is to begin a translation of the whole article, but only translate the relevant part and then cut-and-paste into the target without submitting through GTTK. Perhaps GTTK could bring up both articles side by side and suggest which sections are in most dire need of improvement?

The solution that we had previously discussed with the WMF is section-level translations. However, we haven't gotten started on this yet.

At a high level, we're working on two things:

1. We can help users seed Wikipedia content in small languages. In this case, entire articles are translated from one Wikipedia into another --- not sections.

2. Once the articles exist in multiple languages, the articles take on a life of their own and become out of sync. If Wikipedians want to keep those articles in sync, we would like to help them by enabling section-level translation.

Right now, we are busy with feature requests and bugs for the first objective --- helping users seed Wikipedia in small languages. Once we get the problems with unwanted red links, community collaboration, etc., out of the way, we can move to the second problem with section-level translations.

...

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Michael Snow

6:13 p.m.

Michael Galvez wrote:

...

Once the articles exist in multiple languages, the articles take on a

life of their own and become out of sync. If Wikipedians want to keep those articles in sync, we would like to help them by enabling section-level translation.

I'm guessing that few communities will find it particularly valuable to keep a translation in sync, except possibly for language pairs that have close affinity and parallel evolution (approaching the point that some people would regard them as "merely" dialects). So maybe for a situation like translating French articles into Picard, at most. But even supposing the Tamil community, as an example, might find it helpful to boost their content with translations of English articles, once that's been done I can't imagine them wanting those articles perpetually reharmonized with changes in the English version. The point of seed content is to provide a basis for new life and growth, which by necessity must outgrow and cast off the shell in which the seed came. At that point, trying to maintain or recreate the shell doesn't particularly help further development.

--Michael Snow

Michael Galvez

13 Aug 13 Aug

8:26 p.m.

On Fri, Aug 6, 2010 at 2:13 PM, Michael Snow wikipedia@verizon.net wrote:

...

Michael Galvez wrote:

...

Once the articles exist in multiple languages, the articles take on a

life of their own and become out of sync. If Wikipedians want to keep

those

...
articles in sync, we would like to help them by enabling section-level translation.

I'm guessing that few communities will find it particularly valuable to keep a translation in sync, except possibly for language pairs that have close affinity and parallel evolution (approaching the point that some people would regard them as "merely" dialects). So maybe for a situation like translating French articles into Picard, at most. But even supposing the Tamil community, as an example, might find it helpful to boost their content with translations of English articles, once that's been done I can't imagine them wanting those articles perpetually reharmonized with changes in the English version. The point of seed content is to provide a basis for new life and growth, which by necessity must outgrow and cast off the shell in which the seed came. At that point, trying to maintain or recreate the shell doesn't particularly help further development.

Agreed. If, later, Wikipedians decide they want help in keeping the articles in sync, then we can create something like sectional translation and other features that users may find useful.

...

--Michael Snow

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

David Gerard

6 Aug 6 Aug

9:56 p.m.

On 6 August 2010 18:47, Michael Galvez michaelcg@gmail.com wrote:

...

We acquire dictionaries on limited licenses from other parties. In

general, while we can surface this content on our own sites (e.g., Google Translate, Google Dictionary, Google Translator Toolkit), we don't have permission to donate that data to other sites.

And the data that GTTK gathers from its use in Wikipedia translations? What would need to happen for that to start coming back, in a usable form?

...

Following interwiki links and suggesting parent categories is a bit of work and unlikely to be implemented soon. We can disable category translation if that helps - can you confirm if that's OK?

As I vaguely recall from someone posting study results on the question, not all interwiki links are good, but ones that are a 1:1 match generally seem to be.

(I frequently use Wikipedia interwiki links as a guide to translating single words casually.)

- d.

Michael Galvez

13 Aug 13 Aug

8:28 p.m.

On Fri, Aug 6, 2010 at 5:56 PM, David Gerard dgerard@gmail.com wrote:

...

On 6 August 2010 18:47, Michael Galvez michaelcg@gmail.com wrote:

...

We acquire dictionaries on limited licenses from other parties. In

general, while we can surface this content on our own sites (e.g., Google Translate, Google Dictionary, Google Translator Toolkit), we don't have permission to donate that data to other sites.

And the data that GTTK gathers from its use in Wikipedia translations? What would need to happen for that to start coming back, in a usable form?

The translated segments are available to all translators in Translator Toolkit. When other volunteers use Translator Toolkit to translate other Wikipedia articles, the segments will be available to them.

...

...
Following interwiki links and suggesting parent categories is a bit of

work

...
and unlikely to be implemented soon. We can disable category translation

if

...
that helps - can you confirm if that's OK?

As I vaguely recall from someone posting study results on the question, not all interwiki links are good, but ones that are a 1:1 match generally seem to be.

(I frequently use Wikipedia interwiki links as a guide to translating single words casually.)

d.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Samuel Klein

17 Aug 17 Aug

2:22 a.m.

On Fri, Aug 13, 2010 at 4:28 PM, Michael Galvez michaelcg@gmail.com wrote:

...

On Fri, Aug 6, 2010 at 5:56 PM, David Gerard dgerard@gmail.com wrote:

...
And the data that GTTK gathers from its use in Wikipedia translations? What would need to happen for that to start coming back, in a usable form?

The translated segments are available to all translators in Translator Toolkit. When other volunteers use Translator Toolkit to translate other Wikipedia articles, the segments will be available to them.

If GTTK goes away next year, will the data gathered from these translations go away also?

I had thought the translation memory from Wikipedia translations was freely available for reuse (at least in principle). However, this isn't yet the case. As I understand it, Wikipedia translations are bundled together with all other public submissions to Google's global public translation memory, which produces the default translations you see online. This TM is not currently available for query, export or download.

...

...
As I vaguely recall from someone posting study results on the question, not all interwiki links are good, but ones that are a 1:1 match generally seem to be.

That is my impression as well.

David Gerard

10:03 a.m.

On 17 August 2010 03:22, Samuel Klein meta.sj@gmail.com wrote:

...

On Fri, Aug 13, 2010 at 4:28 PM, Michael Galvez michaelcg@gmail.com wrote:

...
On Fri, Aug 6, 2010 at 5:56 PM, David Gerard dgerard@gmail.com wrote:

...

...
...
And the data that GTTK gathers from its use in Wikipedia translations? What would need to happen for that to start coming back, in a usable form?

...

...
The translated segments are available to all translators in Translator Toolkit. When other volunteers use Translator Toolkit to translate other Wikipedia articles, the segments will be available to them.

...

If GTTK goes away next year, will the data gathered from these translations go away also? I had thought the translation memory from Wikipedia translations was freely available for reuse (at least in principle). However, this isn't yet the case. As I understand it, Wikipedia translations are bundled together with all other public submissions to Google's global public translation memory, which produces the default translations you see online. This TM is not currently available for query, export or download.

Yes, that's the question I was asking.

What would it take for the Wikipedia-gathered data to be freely reusable outside Google and its tools?

- d.

Lars Aronsson

8 Aug 8 Aug

3:30 a.m.

On 08/06/2010 07:47 PM, Michael Galvez wrote:

...

We acquire dictionaries on limited licenses from other parties. In

general, while we can surface this content on our own sites (e.g., Google Translate, Google Dictionary, Google Translator Toolkit), we don't have permission to donate that data to other sites.

Google, as any large company, uses many sources. For example, Google Maps used to buy all its maps, but later started to drive around to build its own maps (and street images). With time, I'm certain you will use Google Books as a parallel corpus and derive translations of words and phrases from translated books, and some day you might be able to build Google Translate without relying on external dictionary sources. I don't know if this is one month or one year away, but it should take less than one decade. Expecting this development, you could keep collaboration with open content movements, such as Wikipedia/Wiktionary in mind.

...

For HTML files, both Translate and Translator Toolkit support the tag

class="notranslate"

to exclude text from translation. ( http://translate.google.com/support/toolkit/bin/answer.py?hl=en&answer=1... )

If you tell us what MediaWiki tags you'd like for us to treat the same way, we can do the same for Wikipedia.

There is no such tag, unfortunately. But in the GTTK user interface, it would be useful to have a way to mark where in the original text (left-hand side) those tags should have been. If it is any help to the pretranslator, other kinds of marks could also be manually added, such as whether a phrase is a figure of speech or should be read literally. If the text says "kill two birds with one stone", that should be translated into Swedish as "hit two flies with one swat". But if David slays Goliath with a stone, that should remain a stone.

...

a. If we find a translation for that segment in the TM, we will "pre-translate" the segment with the highest-rated translation.

But when you have two or more candidates, each with a reasonable probability, the choice could be presented to the human translator.

...

When a translator uploads a WIkipedia article into Translator Toolkit, we

divide the article into segments. (sentences, section headings, etc.)

This means you do recognize some wiki markup, such as [[links]] and ==headings==. But recognition of that markup is apparently hard-wired and takes place before any learning. Now, consider the case when

'''John Doe''' (May 1, 1733 - April 5, 1799) was a British colonel

is translated, according to our manual of style, as:

'''John Doe,''' född 1 maj 1733, död 5 april 1799, var en brittisk överste

where the parentheses are replaced with commas and the words född (born) and död (died) have been added. It would be nice if the translation memory could learn not only the words (colonel = överste) but also to recognize this transformation of style. It is very context sensitive (this example only applies to the opening paragraph of biographic articles) and would need lots of translations to provide good results. And including dashes, commas and parentheses along with words as the elements of translated phrases is perhaps a major shift in what machine translation is supposed to do. (But it could open the door to translating template calls.)

...

Following interwiki links and suggesting parent categories is a bit of work and unlikely to be implemented soon. We can disable category translation if that helps - can you confirm if that's OK?

I think you should keep it as it is, until you get around to do that "bit of work".

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Michael Galvez

13 Aug 13 Aug

8:40 p.m.

On Sat, Aug 7, 2010 at 11:30 PM, Lars Aronsson lars@aronsson.se wrote:

...

On 08/06/2010 07:47 PM, Michael Galvez wrote:

...

We acquire dictionaries on limited licenses from other parties. In

general, while we can surface this content on our own sites (e.g., Google Translate, Google Dictionary, Google Translator Toolkit), we don't have permission to donate that data to other sites.

Google, as any large company, uses many sources. For example, Google Maps used to buy all its maps, but later started to drive around to build its own maps (and street images). With time, I'm certain you will use Google Books as a parallel corpus and derive translations of words and phrases from translated books, and some day you might be able to build Google Translate without relying on external dictionary sources. I don't know if this is one month or one year away, but it should take less than one decade. Expecting this development, you could keep collaboration with open content movements, such as Wikipedia/Wiktionary in mind.

...
For HTML files, both Translate and Translator Toolkit support the tag

class="notranslate"

to exclude text from translation. (

http://translate.google.com/support/toolkit/bin/answer.py?hl=en&answer=1...

...
)

If you tell us what MediaWiki tags you'd like for us to treat the same

way,

...
we can do the same for Wikipedia.

There is no such tag, unfortunately. But in the GTTK user interface, it would be useful to have a way to mark where in the original text (left-hand side) those tags should have been. If it is any help to the pretranslator, other kinds of marks could also be manually added, such as whether a phrase is a figure of speech or should be read literally. If the text says "kill two birds with one stone", that should be translated into Swedish as "hit two flies with one swat". But if David slays Goliath with a stone, that should remain a stone.

Is there a way to introduce this type of tag into MediaWiki? If we can come up with a generic MediaWiki tag for this, we can interpret this as the equivalent for "notranslate" in MediaWiki text.

When we "pretranslate" the document, we can indicate to Google Translate that this text should not be translated. In addition, we can also lock this text in Translator Toolkit so that the translator cannot edit it during translation.

...

...
a. If we find a translation for that segment in the TM, we will "pre-translate" the segment with the highest-rated translation.

But when you have two or more candidates, each with a reasonable probability, the choice could be presented to the human translator.

Yes. This choice shows up as a translation memory entry when the translator clicks, "Show Toolkit".

...

...

When a translator uploads a WIkipedia article into Translator Toolkit,

we

...
divide the article into segments. (sentences, section headings, etc.)

This means you do recognize some wiki markup, such as [[links]] and ==headings==. But recognition of that markup is apparently hard-wired and takes place before any learning. Now, consider the case when

'''John Doe''' (May 1, 1733 - April 5, 1799) was a British colonel

is translated, according to our manual of style, as:

'''John Doe,''' född 1 maj 1733, död 5 april 1799, var en brittisk överste

where the parentheses are replaced with commas and the words född (born) and död (died) have been added. It would be nice if the translation memory could learn not only the words (colonel = överste) but also to recognize this transformation of style. It is very context sensitive (this example only applies to the opening paragraph of biographic articles) and would need lots of translations to provide good results. And including dashes, commas and parentheses along with words as the elements of translated phrases is perhaps a major shift in what machine translation is supposed to do. (But it could open the door to translating template calls.)

...
Following interwiki links and suggesting parent categories is a bit of

work

...
and unlikely to be implemented soon. We can disable category translation

if

...
that helps - can you confirm if that's OK?

I think you should keep it as it is, until you get around to do that "bit of work".

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Mark Williamson

5 Aug 5 Aug

6:22 p.m.

...

Implement spelling and punctuation check automatically within GTTK before

posting of the articles.

There is spell check in Translator Toolkit, although it's not available for all languages. We don't have any punctuation checks today and I doubt that we can release this anytime soon. (If it's not available in Google Docs or Gmail, then it's unlikely that we'll have it for Translator Toolkit, as well, since we use the same infrastructure.)

What's the proposal, though - would you like for us to prevent publishing of articles if they have too many spelling errors, or simply warn the user that there are X spelling errors? Any input you can provide on preferred behavior would be great.

I would say to force spellcheck before publication, which does not seem to be the case currently. I think this would be enough - perhaps a warning as well. I don't know about preventing publication, although that might work too.

...

Have GTTK automatically remove broken templates and images, or require

users to translate any templates before a page may be posted.

Templates are a bit tricky. Sometimes, a template in one Wikipedia does not exist in another Wikipedia. Other times, a template in one langauge maps to a template in another language but the parameters are different.

Removing broken templates automatically may not work because some templates come between words. If we remove them, the sentences or paragraph may become invalid. We've also considered creating a custom interface for localizing templates, but this requires a lot of work.

In the interim, the approach we've taken is to have translators fix the templates in Wikipedia when they post the article from Translator Toolkit. When a user clicks on Share > Publish to source page in Translator Toolkit, the Wikipedia article is in preview mode --- it's not live. The idea is that if there are any errors, the translator can fix them before saving the article.

Well, many translators do fix such problems, but I was just thinking of some of the problems that I've heard so far with people who do "drive-by" translations, dropping it on a project and then disappearing. If translators are careful and do all the work themselves, templates are an annoyance rather than a real problem.

...

Include a list of most needed articles for people to create, rather than

random articles that will be of little use to local readers. Some articles, such as those on local topics, have the added benefit of encouraging more edits and community participation since they tend to generate more interest from speakers of a language in my experience.

The articles we selected actually weren't really random. Here's how we selected them:

we looked at the top Google searches in the region (e.g., for Tamil, we

looked at searches in India and I believe Sri Lanka, as well) 2. from the top Google searches in the region, we looked at the top, clicked Wikipedia articles --- regardless of the language (so we wound up with Wikipedia source articles in English, Hindi, and other languages) 3. from the top, clicked Wikipedia articles, we looked for articles that were either stubs or unavailable in the local language - these are the articles that we sent for translation

This selection isn't perfect. For example, it assumes that the top, clicked Wikipedia articles by all users in India/Sri Lanka --- who may be searching in English, Hindi, Tamil, or some other language --- are relevant to the Tamil community. To improve this, last month, we met with members of the Tamil and Telugu Wikipedias to improve this article selection. The main changes that we agreed on were:

I'm not sure if this project was separate from the Swahili Wikipedia Challenge, but I'm assuming it was after seeing articles such as http://sw.wikipedia.org/wiki/Maduka_ya_United_Cigar_Stores (about a defunct chain of cigar stores in the US) which I doubt were popular searches in East Africa.

One more idea: Automatically add existing Interwikis links to the new article.

Also, as far as Indic languages go, I would ask if there's any chance you have any Oriya speakers - with 637 articles, the Oriya Wikipedia is by far the most anemic of Indic-language Wikipedias, in spite of a speaker population of 31 million.

-m.

Michael Galvez

6 Aug 6 Aug

5:53 p.m.

Hi Mark,

Responses inline.

Mike

On Thu, Aug 5, 2010 at 2:22 PM, Mark Williamson node.ue@gmail.com wrote:

...

...

Implement spelling and punctuation check automatically within GTTK

before

...
posting of the articles.

There is spell check in Translator Toolkit, although it's not available

for

...
all languages. We don't have any punctuation checks today and I doubt

that

...
we can release this anytime soon. (If it's not available in Google Docs

or

...
Gmail, then it's unlikely that we'll have it for Translator Toolkit, as well, since we use the same infrastructure.)

What's the proposal, though - would you like for us to prevent publishing

of

...
articles if they have too many spelling errors, or simply warn the user

that

...
there are X spelling errors? Any input you can provide on preferred behavior would be great.

I would say to force spellcheck before publication, which does not seem to be the case currently. I think this would be enough - perhaps a warning as well. I don't know about preventing publication, although that might work too.

How about this: we pop up a window that says, "Your translation has misspelled words: X. Publish anyway?"

Does that work?

...

...

Have GTTK automatically remove broken templates and images, or require

users to translate any templates before a page may be posted.

Templates are a bit tricky. Sometimes, a template in one Wikipedia does

not

...
exist in another Wikipedia. Other times, a template in one langauge maps

to

...
a template in another language but the parameters are different.

Removing broken templates automatically may not work because some

templates

...
come between words. If we remove them, the sentences or paragraph may become invalid. We've also considered creating a custom interface for localizing templates, but this requires a lot of work.

In the interim, the approach we've taken is to have translators fix the templates in Wikipedia when they post the article from Translator

Toolkit.

...
When a user clicks on Share > Publish to source page in Translator

Toolkit,

...
the Wikipedia article is in preview mode --- it's not live. The idea is that if there are any errors, the translator can fix them before saving

the

...
article.

Well, many translators do fix such problems, but I was just thinking of some of the problems that I've heard so far with people who do "drive-by" translations, dropping it on a project and then disappearing. If translators are careful and do all the work themselves, templates are an annoyance rather than a real problem.

...

Include a list of most needed articles for people to create, rather

than

...
random articles that will be of little use to local readers. Some

articles,

...
such as those on local topics, have the added benefit of encouraging more edits and community participation since they tend to generate more

interest

...
from speakers of a language in my experience.

The articles we selected actually weren't really random. Here's how we selected them:

we looked at the top Google searches in the region (e.g., for Tamil,

we

...
looked at searches in India and I believe Sri Lanka, as well) 2. from the top Google searches in the region, we looked at the top,

clicked

...
Wikipedia articles --- regardless of the language (so we wound up with Wikipedia source articles in English, Hindi, and other languages) 3. from the top, clicked Wikipedia articles, we looked for articles that were either stubs or unavailable in the local language - these are the articles that we sent for translation

This selection isn't perfect. For example, it assumes that the top,

clicked

...
Wikipedia articles by all users in India/Sri Lanka --- who may be

searching

...
in English, Hindi, Tamil, or some other language --- are relevant to the Tamil community. To improve this, last month, we met with members of the Tamil and Telugu Wikipedias to improve this article selection. The main changes that we agreed on were:

I'm not sure if this project was separate from the Swahili Wikipedia Challenge, but I'm assuming it was after seeing articles such as http://sw.wikipedia.org/wiki/Maduka_ya_United_Cigar_Stores (about a defunct chain of cigar stores in the US) which I doubt were popular searches in East Africa.

It's the same set of projects although at times, there were some variations in the approach. For the Swahili project, for example, in addition to translating content (selected from search data), the students also created content from scratch.

Re: Cigar Stores, I'm actually not sure where this article comes from. You're right that it's not terribly popular --- it doesn't show up as one of the top, clicked articles from search data. It may have been added to the list later by a volunteer.

...

One more idea: Automatically add existing Interwikis links to the new article.

We already include existing interwiki links into the new article. If you find a bug in this, please let us know and we'll fix it.

...

Also, as far as Indic languages go, I would ask if there's any chance you have any Oriya speakers - with 637 articles, the Oriya Wikipedia is by far the most anemic of Indic-language Wikipedias, in spite of a speaker population of 31 million.

Oriya is one of the languages we'd love to work on. We don't have any activity on this today but if you have some Wikipedians who'd like to help us get this off the ground, we'd love to get their contact info and we can follow up from there.

...

-m.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Amir E. Aharoni

7:14 p.m.

Dear Michael, I also thank you for joining the discussion. See my question below.

2010/8/6 Michael Galvez michaelcg@gmail.com:

...

...
Also, as far as Indic languages go, I would ask if there's any chance you have any Oriya speakers - with 637 articles, the Oriya Wikipedia is by far the most anemic of Indic-language Wikipedias, in spite of a speaker population of 31 million.

Oriya is one of the languages we'd love to work on. We don't have any activity on this today but if you have some Wikipedians who'd like to help us get this off the ground, we'd love to get their contact info and we can follow up from there.

How do you decide, in general, with which languages to work? If i understand correctly, until now you worked with Arabic, Swahili and several Indian languages. But there are also languages in other parts of the world, Wikipedias in which could profit from such a project.

For example, the Greek Wikipedia is surprisingly small with only 54,500 articles (13 million speakers); Armenian has only 10,000 articles (6.7 million speakers); Georgian has 42,000 articles (4 million speakers). AFAIK, these language communities are largely monolingual, that is, speakers of these languages may know English or Russian, but they usually prefer to speak and write their own, unlike, for example, speakers of Native American languages, many of whom use English, Spanish or Portuguese online.

What has to happen so that a collaboration with Google Translation will begin in these languages? Do their representatives have to approach Google or is it usually Google's decision?

-- אָמִיר אֱלִישָׁע אַהֲרוֹנִי Amir Elisha Aharoni http://aharoni.wordpress.com "We're living in pieces, I want to live in peace." - T. Moore

Michael Galvez

13 Aug 13 Aug

8:28 p.m.

Hi Amir,

Apologies for the late reply. Replies inline below.

Mike

On Fri, Aug 6, 2010 at 3:14 PM, Amir E. Aharoni < amir.aharoni@mail.huji.ac.il> wrote:

...

Dear Michael, I also thank you for joining the discussion. See my question below.

2010/8/6 Michael Galvez michaelcg@gmail.com:

...
...
Also, as far as Indic languages go, I would ask if there's any chance you have any Oriya speakers - with 637 articles, the Oriya Wikipedia is by far the most anemic of Indic-language Wikipedias, in spite of a speaker population of 31 million.

Oriya is one of the languages we'd love to work on. We don't have any activity on this today but if you have some Wikipedians who'd like to

help

...
us get this off the ground, we'd love to get their contact info and we

can

...
follow up from there.

How do you decide, in general, with which languages to work? If i understand correctly, until now you worked with Arabic, Swahili and several Indian languages. But there are also languages in other parts of the world, Wikipedias in which could profit from such a project.

...

For example, the Greek Wikipedia is surprisingly small with only 54,500 articles (13 million speakers); Armenian has only 10,000 articles (6.7 million speakers); Georgian has 42,000 articles (4 million speakers). AFAIK, these language communities are largely monolingual, that is, speakers of these languages may know English or Russian, but they usually prefer to speak and write their own, unlike, for example, speakers of Native American languages, many of whom use English, Spanish or Portuguese online.

To decide which languages to target, we looked at several sets of metrics: - we looked at the size of each Wikipedia based on words, articles, non-stub articles (measured by articles over 2Kb), non-stub words (extrapolated), from here: http://stats.wikimedia.org/EN/ - we looked at the number of Internet users in each of those languages from here: http://www.internetworldstats.com/stats.htm We also considered doing more refined measurements by accounting for Google activity and mobile, but we ultimately went for the simple metrics above.

We took these numbers and calculated the number of words/articles/non-stub articles/non-stub words per Internet user and normalized it with the English Wikipedia = 1. We then focused on the the largest languages that had deficits vis-a-vis English.

(A few folks in the audience of our talk at Wikimania asked us to leave a soft copy of the slides that we presented that show this. I haven't forgotten about this --- I am still working with PR to make that deck publicly available.)

...

What has to happen so that a collaboration with Google Translation will begin in these languages? Do their representatives have to approach Google or is it usually Google's decision?

We can do either (Google-initiated or community-initiated).

If you'd like for us to work with a particular language, feel free to reach out to us directly. Please email translator-toolkit-support at google.com.

...

-- אָמִיר אֱלִישָׁע אַהֲרוֹנִי Amir Elisha Aharoni

http://aharoni.wordpress.com

"We're living in pieces, I want to live in peace." - T. Moore

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Mark Williamson

7 Aug 7 Aug

5:38 a.m.

...

On Thu, Aug 5, 2010 at 2:22 PM, Mark Williamson node.ue@gmail.com wrote:

...
...

Implement spelling and punctuation check automatically within GTTK

before

...
posting of the articles.

There is spell check in Translator Toolkit, although it's not available

for

...
all languages. We don't have any punctuation checks today and I doubt

that

...
we can release this anytime soon. (If it's not available in Google Docs

or

...
Gmail, then it's unlikely that we'll have it for Translator Toolkit, as well, since we use the same infrastructure.)

What's the proposal, though - would you like for us to prevent publishing

of

...
articles if they have too many spelling errors, or simply warn the user

that

...
there are X spelling errors? Any input you can provide on preferred behavior would be great.

I would say to force spellcheck before publication, which does not seem to be the case currently. I think this would be enough - perhaps a warning as well. I don't know about preventing publication, although that might work too.

How about this: we pop up a window that says, "Your translation has misspelled words: X. Publish anyway?"

Does that work?

That sounds great to me.

...

...
Also, as far as Indic languages go, I would ask if there's any chance you have any Oriya speakers - with 637 articles, the Oriya Wikipedia is by far the most anemic of Indic-language Wikipedias, in spite of a speaker population of 31 million.

Oriya is one of the languages we'd love to work on. We don't have any activity on this today but if you have some Wikipedians who'd like to help us get this off the ground, we'd love to get their contact info and we can follow up from there.

Unfortunately, there is currently not even an Oriya Wikipedia community. I think such a project would need to be managed a bit differently - seeking to either create a community (no reason participants can't start a community), or to be relatively limited in scope, or else to have more stringent controls on content quality. I would love to help with that myself in any way possible.

Another option with a bit more community but still very underdeveloped is the Punjabi Wikipedia, with 1919 pages. I would recommend contacting Gman124 or Sukh at that project on their user talk pages or through the e-mail user function.

-m.

Michael Galvez

13 Aug 13 Aug

8:31 p.m.

On Sat, Aug 7, 2010 at 1:38 AM, Mark Williamson node.ue@gmail.com wrote:

...

...
On Thu, Aug 5, 2010 at 2:22 PM, Mark Williamson node.ue@gmail.com

wrote:

...
...
...

Implement spelling and punctuation check automatically within GTTK

before

...
posting of the articles.

There is spell check in Translator Toolkit, although it's not

available

...
...
for

...
all languages. We don't have any punctuation checks today and I doubt

that

...
we can release this anytime soon. (If it's not available in Google

Docs

...
...
or

...
Gmail, then it's unlikely that we'll have it for Translator Toolkit,

as

...
...
...
well, since we use the same infrastructure.)

What's the proposal, though - would you like for us to prevent

publishing

...
...
of

...
articles if they have too many spelling errors, or simply warn the

user

...
...
that

...
there are X spelling errors? Any input you can provide on preferred behavior would be great.

I would say to force spellcheck before publication, which does not seem to be the case currently. I think this would be enough - perhaps a warning as well. I don't know about preventing publication, although that might work too.

How about this: we pop up a window that says, "Your translation has misspelled words: X. Publish anyway?"

Does that work?

That sounds great to me.

OK - I'll file this with engineering. The red links fix is going in the next release (this month). Not sure if we can rush this through the current release but we'll shoot for a subsequent release (maybe in September?).

...

...
...
Also, as far as Indic languages go, I would ask if there's any chance you have any Oriya speakers - with 637 articles, the Oriya Wikipedia is by far the most anemic of Indic-language Wikipedias, in spite of a speaker population of 31 million.

Oriya is one of the languages we'd love to work on. We don't have any activity on this today but if you have some Wikipedians who'd like to

help

...
us get this off the ground, we'd love to get their contact info and we

can

...
follow up from there.

Unfortunately, there is currently not even an Oriya Wikipedia community. I think such a project would need to be managed a bit differently - seeking to either create a community (no reason participants can't start a community), or to be relatively limited in scope, or else to have more stringent controls on content quality. I would love to help with that myself in any way possible.

I spoke with our India team and it looks like Oriya, unfortunately, is not one of our target languages this year. If you're wiling to help, though, I'll put you in touch with our project manager in India to see if there's a way to push this through.

...

Another option with a bit more community but still very underdeveloped is the Punjabi Wikipedia, with 1919 pages. I would recommend contacting Gman124 or Sukh at that project on their user talk pages or through the e-mail user function.

I'll add this to the thread with the India-based project manager.

...

-m.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

stevertigo

6 Aug 6 Aug

1:45 a.m.

Michael Galvez michaelcg@gmail.com wrote:

...

Sorry for coming into this discussion a bit late. I'm one of the members of Google's translation team, and I wanted to make myself available for feedback/questions.

Thanks for stopping by. A few questions: 1) Does GTTK have a specific API for Mediawiki markup ("wikitext") itself? 2) Can it be opened up for viewing/editing? 3) Does it include the original-language wikitext as  in the output wikitext?

...

the local Wikipedia community should give Google final OK on what

articles should or should not be translated 2. the local Wikipedia community add articles to Google's list 3. the local Wikipedia community can suggest titles for the articles 4. Google's translators will post the articles with their user names, and they will monitor community feedback on their user pages until the translation meets the community's standards

0) The main issue here is 'concepts of community,' interaction between communities, and the creation of "community" between different language communities. The more open Google is with us about what it wants and what it proposes to do, the more we can be on board with helping organize community support.

-SC

Michael Galvez

13 Aug 13 Aug

8:35 p.m.

On Thu, Aug 5, 2010 at 9:45 PM, stevertigo stvrtg@gmail.com wrote:

...

Michael Galvez michaelcg@gmail.com wrote:

...
Sorry for coming into this discussion a bit late. I'm one of the members

of

...
Google's translation team, and I wanted to make myself available for feedback/questions.

Thanks for stopping by. A few questions: 1) Does GTTK have a specific API for Mediawiki markup ("wikitext") itself? 2) Can it be opened up for viewing/editing? 3) Does it include the original-language wikitext as  in the output wikitext?

Translator Toolkit processes generic Media Wiki text, although this is not an officially supported feature and is largely untested. If you upload a UTF-8 file with extension ".mediawiki", Translator Toolkit will try to render the file in the same way that it renders Wikipedia files.

Re: original-language wikitext, I'm not familiar with that markup. Can you clarify?

...

...

the local Wikipedia community should give Google final OK on what

articles should or should not be translated 2. the local Wikipedia community add articles to Google's list 3. the local Wikipedia community can suggest titles for the articles 4. Google's translators will post the articles with their user names, and they will monitor community feedback on their user pages until the translation meets the community's standards

The main issue here is 'concepts of community,' interaction between

communities, and the creation of "community" between different language communities. The more open Google is with us about what it wants and what it proposes to do, the more we can be on board with helping organize community support.

Understood. I'm working with PR to release the deck that I presented at Wikimania. Will send that along as soon as it's available.

...

-SC

...

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

stevertigo

14 Aug 14 Aug

4:57 a.m.

Michael Galvez michaelcg@gmail.com wrote:

...

Translator Toolkit processes generic Media Wiki text, although this is not an officially supported feature and is largely untested. If you upload a UTF-8 file with extension ".mediawiki", Translator Toolkit will try to render the file in the same way that it renders Wikipedia files.

Re: original-language wikitext, I'm not familiar with that markup. Can you clarify?

I mean that, given that the output is essentially plain wikitext, it might be a good idea to include the original source paragraphs hidden within the body of the output:

The output files of course would be larger, and having large amounts of different texts might confuse autodetect.. But the issue is again to aid users in keeping a high fidelity between input and translated output, and keeping source paragraphs close.

stevertigo wrote:

...

...

The main issue here is 'concepts of community,' interaction between

communities, and the creation of "community" between different language communities. The more open Google is with us about what it wants and what it proposes to do, the more we can be on board with helping organize community support.

michael galvez wrote:

...

Understood. I'm working with PR to release the deck that I presented at Wikimania. Will send that along as soon as it's available.

Are you keeping a page about this on meta.wiki?

-SC

Federico Leva (Nemo)

7 Aug 7 Aug

8:52 a.m.

Michael Galvez, 05/08/2010 15:12:

...

Sorry for coming into this discussion a bit late. I'm one of the members of Google's translation team, and I wanted to make myself available for feedback/questions.

Thank you, you've explained some important things.

...

There is spell check in Translator Toolkit, although it's not available for all languages. We don't have any punctuation checks today and I doubt that we can release this anytime soon. (If it's not available in Google Docs or Gmail, then it's unlikely that we'll have it for Translator Toolkit, as well, since we use the same infrastructure.)

What are supported languages? I see that OOo has some 100 dictionares (and some are really great). http://wiki.services.openoffice.org/wiki/Dictionaries

Michael Galvez, 06/08/2010 19:47:

...

We acquire dictionaries on limited licenses from other parties. In

general, while we can surface this content on our own sites (e.g., Google Translate, Google Dictionary, Google Translator Toolkit), we don't have permission to donate that data to other sites.

I see (this is crucial). How many dictionaries did you acquire? Does this affect also "small" languages? What about out of print bilingual dictionaries (especially in "biggest" languages, where there are lots of them)? Do you acquire them on limited licenses, too, or you try to acquire them completely instead?

Nemo

Michael Galvez

13 Aug 13 Aug

8:32 p.m.

On Sat, Aug 7, 2010 at 4:52 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:

...

Michael Galvez, 05/08/2010 15:12:

...
Sorry for coming into this discussion a bit late. I'm one of the members

of

...
Google's translation team, and I wanted to make myself available for feedback/questions.

Thank you, you've explained some important things.

...
There is spell check in Translator Toolkit, although it's not available

for

...
all languages. We don't have any punctuation checks today and I doubt

that

...
we can release this anytime soon. (If it's not available in Google Docs

or

...
Gmail, then it's unlikely that we'll have it for Translator Toolkit, as well, since we use the same infrastructure.)

What are supported languages? I see that OOo has some 100 dictionares (and some are really great). http://wiki.services.openoffice.org/wiki/Dictionaries

Here's the list of supported languages: http://mail.google.com/support/bin/answer.py?hl=en&answer=19933. Not sure if the Gmail team can use those dictionaries but will pass on the link.

...

Michael Galvez, 06/08/2010 19:47:

...

We acquire dictionaries on limited licenses from other parties. In

general, while we can surface this content on our own sites (e.g.,

Google

...
Translate, Google Dictionary, Google Translator Toolkit), we don't have permission to donate that data to other sites.

I see (this is crucial). How many dictionaries did you acquire? Does this affect also "small" languages? What about out of print bilingual dictionaries (especially in "biggest" languages, where there are lots of them)? Do you acquire them on limited licenses, too, or you try to acquire them completely instead?

The languages for the acquired dictionaries are the one listed in http://www.google.com/dictionary. This includes monolingual and bilingual dictionaries. I believe all of them are on limited licenses, although the specifics vary depending on the deal.

...

Nemo

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Marco Chiesa

26 Jul 26 Jul

8:57 a.m.

On Sat, Jul 24, 2010 at 2:57 AM, stevertigo stvrtg@gmail.com wrote:

...

Translation between wikis currently exists as a largely pulling paradigm: Someone on the target wiki finds an article in another language (English for example) and then pulls it to their language wiki.

These days Google and other translate tools are good enough to use as the starting basis for an translated article, and we can consider how we make use of them in an active way. What is largely a "pull" paradigm can also be a "push" paradigm - we can use translation tools to "push" articles to other wikis.

I don't know whether other wikipedias have similar policies, but on the Italian Wikipedia an article which is just a machine translation can be speedy deleted according to our policies. The reason is that machine translations are not good enough and the autotranslated text is too difficult to read, at least for Italian. It is true that as Italian is not as used as a foreign language as others, native speakers are not used to people writing in bad Italian (Bad English is far more common) so it is natural to set a higher threshold. I agree that machine translations are a good starting point, but that means that someone who knows the target language (it doesn't matter whether as native or not) must fix the translation correcting for the typical machine mistakes (such as translating person names, etc.)

...

If there are issues, they can be overcome. The fact of the matter is that the vast majority of articles in English can be "pushed" over to other languages, and fill a need for those topics in those languages.

I see a big risk that this may be perceived as cultural colonialism, but that's something that already happens (some parts of the world write more on Wikipedia than others). But somehow pushing from the small wikis to the big ones is one of the best ways to get local topics globally known.

Cruccone

Pavlo Shevelo

2:23 p.m.

...

I don't know whether other wikipedias have similar policies, but on the Italian Wikipedia an article which is just a machine translation can be speedy deleted according to our policies. The reason is that machine translations are not good enough and the autotranslated text is too difficult to read, at least for Italian. It is true that as Italian is not as used as a foreign language as others, native speakers are not used to people writing in bad Italian (Bad English is far more common) so it is natural to set a higher threshold.

Same in Ukrainian Wikipedia

...

I agree that machine translations are a good starting point,

For time being machine translations are good only as aid to comprehend/grasp articles, pointed by interwiki.

On Mon, Jul 26, 2010 at 11:57 AM, Marco Chiesa chiesa.marco@gmail.com wrote:

...

On Sat, Jul 24, 2010 at 2:57 AM, stevertigo stvrtg@gmail.com wrote:

...
Translation between wikis currently exists as a largely pulling paradigm: Someone on the target wiki finds an article in another language (English for example) and then pulls it to their language wiki.

These days Google and other translate tools are good enough to use as the starting basis for an translated article, and we can consider how we make use of them in an active way. What is largely a "pull" paradigm can also be a "push" paradigm - we can use translation tools to "push" articles to other wikis.

I don't know whether other wikipedias have similar policies, but on the Italian Wikipedia an article which is just a machine translation can be speedy deleted according to our policies. The reason is that machine translations are not good enough and the autotranslated text is too difficult to read, at least for Italian. It is true that as Italian is not as used as a foreign language as others, native speakers are not used to people writing in bad Italian (Bad English is far more common) so it is natural to set a higher threshold. I agree that machine translations are a good starting point, but that means that someone who knows the target language (it doesn't matter whether as native or not) must fix the translation correcting for the typical machine mistakes (such as translating person names, etc.)

...
If there are issues, they can be overcome. The fact of the matter is that the vast majority of articles in English can be "pushed" over to other languages, and fill a need for those topics in those languages.

I see a big risk that this may be perceived as cultural colonialism, but that's something that already happens (some parts of the world write more on Wikipedia than others). But somehow pushing from the small wikis to the big ones is one of the best ways to get local topics globally known.

Cruccone

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

5227

Age (days ago)

5248

Last active (days ago)

wikimedia-l@lists.wikimedia.org

78 comments

25 participants

tags (0)

participants (25)

Amir E. Aharoni
Andreas Kolbe
Aphaia
Bence Damokos
Casey Brown
Cool Hand Luke
Cristian Consonni
David Gerard
Federico Leva (Nemo)
John Vandenberg
Lars Aronsson
Magnus Manske
Marco Chiesa
Mark Williamson
Michael Galvez
Michael Snow
Muhammad Yahia
Nikola Smolenski
Oliver Keyes
Pavlo Shevelo
Ray Saintonge
Samuel Klein
Shiju Alex
stevertigo
Ting Chen