[Foundation-l] English language dominationism is striking again

List overview All Threads
Download

newer

older

[Foundation-l] Adopting OmegaWiki...

[Foundation-l] Wikimedia...

Teofilo

22 Jun 2010 22 Jun '10

11:10 a.m.

News from the front.

A very bad and unfair unbalance of power was established in favor of English on Wikimedia Commons in 2005-2006, requiring people from the world to work for the benefit of the English language community.

In that ocean of unfairness, there was a small island where you could find comfort and grace : biological taxa: the names of animals and plants. For centuries the scientific community had been used to using latin, creating a space where scientists from the world are nearer to being equals, everybody needing to leave her/his native tongue and use a foreign language. Wikimedia Commons had decided to name categories accordingly.

I have discovered a few days ago that someone, probably in good faith and unaware of this language policy, created [[:Category:Animals by common named groups]] which is a container for English-named biological taxa, at the end of 2008.

Now I find people pushing for this container and English named wild animal species. So the front line is broken.

More reading at :

http://commons.wikimedia.org/wiki/Commons:Deletion_requests/Category:Wolves

http://fr.wikipedia.org/wiki/Projet:Biologie/Le_caf%C3%A9_des_biologistes#Ca...

Show replies by date

wiki-list＠phizz.demon.co.uk

22 Jun 22 Jun

1:06 p.m.

teofilowiki@gmail.com wrote:

...

I have discovered a few days ago that someone, probably in good faith and unaware of this language policy, created [[:Category:Animals by common named groups]] which is a container for English-named biological taxa, at the end of 2008.

There is a major problem with latin names in a number of taxa. It seems that if tehre are 5 consecutive wet days in Summer a couple of researchers put their heads together and concoct new names, move things about, split, or combine species. As such whilst the latin names are useful as a link between languages they are not stable enough for the lay person to keep up with. That is why a number of the most useful sites on the web provide xrefs for common names which is where I'll go if I wanted to know the common name of a moth in German, French or Italian:

http://www.lepidoptera.pl/show.php?ID=539&country=PL

I can't be arsed to argue it because there are alternate resources (at least at the species level), but frankly both the common and latin names should be given in all the languages that have a common name for a particular species. Also for the genus, tribe, and family where appropriate.

David Gerard

1:26 p.m.

On 22 June 2010 14:06, wiki-list@phizz.demon.co.uk wrote:

...

There is a major problem with latin names in a number of taxa. It seems that if tehre are 5 consecutive wet days in Summer a couple of researchers put their heads together and concoct new names, move things about, split, or combine species.

And the actual problem here is that "species" as biology now understands it is more than a little fluid, which is why researchers look forward to those five consecutive wet days in summer, to sort out the mess ... the problem you describe is how to make rigid descriptions of something at the fluid level.

- d.

wiki-list＠phizz.demon.co.uk

2:20 p.m.

dgerard@gmail.com wrote:

...

On 22 June 2010 14:06, wiki-list@phizz.demon.co.uk wrote:

...
There is a major problem with latin names in a number of taxa. It seems that if tehre are 5 consecutive wet days in Summer a couple of researchers put their heads together and concoct new names, move things about, split, or combine species.

And the actual problem here is that "species" as biology now understands it is more than a little fluid, which is why researchers look forward to those five consecutive wet days in summer, to sort out the mess ... the problem you describe is how to make rigid descriptions of something at the fluid level.

Of course, but then some national organisations adopt the new classifications, and other do not, or are tardy in their adoption. Meanwhile someone is using an identification key or guidebook from say 1973, or knows the species from its previous latin name.

The common name in any language has more stability as far as the lay person is concerned. the lay person shouldn't have to first find the latin name of an organism when looking it up: http://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial%3ARecherche&searc...

David Gerard

2:45 p.m.

On 22 June 2010 15:20, wiki-list@phizz.demon.co.uk wrote:

...

The common name in any language has more stability as far as the lay person is concerned. the lay person shouldn't have to first find the latin name of an organism when looking it up: http://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial%3ARecherche&searc...

Definitely. Category redirects would help here.

- d.

Thomas Dalton

4:10 p.m.

On 22 June 2010 15:45, David Gerard dgerard@gmail.com wrote:

...

On 22 June 2010 15:20, wiki-list@phizz.demon.co.uk wrote:

...
The common name in any language has more stability as far as the lay person is concerned. the lay person shouldn't have to first find the latin name of an organism when looking it up: http://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial%3ARecherche&searc...

Definitely. Category redirects would help here.

I think redirects is the obvious solution. If you can't agree on what a category should be called, choose one of the options at random and set up redirects for the rest. It really doesn't matter which name the category is actually at, as long as users can find the images they want by whatever reasonable name they search by.

David Goodman

4:21 p.m.

I'd think he category can be renamed as common names (English) and similar ones be made for the other languages. It'd not jut s matter of redirection--there are many instances where some languages do, and some do not, have a common name. I think there are also cases where in one language a common names refers to a group of species, and in another to an overlapping but not identical group of species.

In English at least, even academic journals aimed at non-taxonomists (e.g. PNAS, for an Open Access example) almost always use common names in the title and give the formal latin equivalent somewhere later in the paper.

On Tue, Jun 22, 2010 at 12:10 PM, Thomas Dalton thomas.dalton@gmail.com wrote:

...

On 22 June 2010 15:45, David Gerard dgerard@gmail.com wrote:

...
On 22 June 2010 15:20, wiki-list@phizz.demon.co.uk wrote:

...
The common name in any language has more stability as far as the lay person is concerned. the lay person shouldn't have to first find the latin name of an organism when looking it up: http://fr.wikipedia.org/w/index.php?title=Sp%C3%A9cial%3ARecherche&searc...

Definitely. Category redirects would help here.

I think redirects is the obvious solution. If you can't agree on what a category should be called, choose one of the options at random and set up redirects for the rest. It really doesn't matter which name the category is actually at, as long as users can find the images they want by whatever reasonable name they search by.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- David Goodman, Ph.D, M.L.S. http://en.wikipedia.org/wiki/User_talk:DGG

Gerard Meijssen

1:21 p.m.

Hoi, When you think that Commons is bad in supporting other languages, try to find pictures of a horse on the internet in other languages like Estonian, Nepalese ... It is not the same at all as when you are looking for images in English. Commons has the advantage that many Wikipedias refer to a category where pictures of for instance a horse can be found.

There has been a demonstration project that demonstrated that it is possible to associate a concept in a translation dictionary with the names of categories in Commons. It was demonstrated that not only is it possible, it is also possible to change the text of the category and show a word in the language selected by the user.

At OmegaWiki.org we have the ability to bring translations and Commons categories together.. We do that for quite some time now.

Given that such support is thought to be difficult and expensive. What can be done to improve the support for both information and articles, is to have "referral" pages on Wikipedias. They are pages that do not have much more then a definition of the concept and refer to Wikipedias that do have an article on a subject. When a link is available to a Commons category, it is possible to refer to Commons as well.

This does not require much investment and it will make the Wikipedias with few articles more useful. It will grow our traffic and when we learn what "referral" articles are in demand, we know what articles will make a difference when they have a genuine article. Thanks, GerardM

On 22 June 2010 13:10, Teofilo teofilowiki@gmail.com wrote:

...

News from the front.

A very bad and unfair unbalance of power was established in favor of English on Wikimedia Commons in 2005-2006, requiring people from the world to work for the benefit of the English language community.

In that ocean of unfairness, there was a small island where you could find comfort and grace : biological taxa: the names of animals and plants. For centuries the scientific community had been used to using latin, creating a space where scientists from the world are nearer to being equals, everybody needing to leave her/his native tongue and use a foreign language. Wikimedia Commons had decided to name categories accordingly.

I have discovered a few days ago that someone, probably in good faith and unaware of this language policy, created [[:Category:Animals by common named groups]] which is a container for English-named biological taxa, at the end of 2008.

Now I find people pushing for this container and English named wild animal species. So the front line is broken.

More reading at :

http://commons.wikimedia.org/wiki/Commons:Deletion_requests/Category:Wolves

http://fr.wikipedia.org/wiki/Projet:Biologie/Le_caf%C3%A9_des_biologistes#Ca...

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Anthony

4:32 p.m.

On Tue, Jun 22, 2010 at 9:21 AM, Gerard Meijssen gerard.meijssen@gmail.comwrote:

...

When you think that Commons is bad in supporting other languages, try to find pictures of a horse on the internet in other languages like Estonian, Nepalese ... It is not the same at all as when you are looking for images in English.

Don't most Internet users know enough English to be able to search for "pictures of a horse" in English?

(According to Wikipedia (http://en.wikipedia.org/wiki/Global_Internet_usage), yes... "Most Internet users speak the English language as a native or secondary language.")

geni

5:42 p.m.

On 22 June 2010 17:32, Anthony wikimail@inbox.org wrote:

...

On Tue, Jun 22, 2010 at 9:21 AM, Gerard Meijssen gerard.meijssen@gmail.comwrote:

...
When you think that Commons is bad in supporting other languages, try to find pictures of a horse on the internet in other languages like Estonian, Nepalese ... It is not the same at all as when you are looking for images in English.

Don't most Internet users know enough English to be able to search for "pictures of a horse" in English?

(According to Wikipedia (http://en.wikipedia.org/wiki/Global_Internet_usage), yes... "Most Internet users speak the English language as a native or secondary language.")

In practice pulling up the wikipedia article on "horse" in your language will cover most cases. There is a fairly good argument to be made that wikipedia is common's best search engine.

-- geni

Bence Damokos

5:52 p.m.

On Tue, Jun 22, 2010 at 7:42 PM, geni geniice@gmail.com wrote:

...

In practice pulling up the wikipedia article on "horse" in your language will cover most cases. There is a fairly good argument to be made that wikipedia is common's best search engine.

I would consider this state as a poor reflection on Commons' accessibility.

Especially as Google image search (imho, the likeliest avenue of searching for images) gives 130 000 pictures of horses on Commons if searched in English, zero if searched in Estonian ("hobu"), and while it gives 160 000 results for a Hungarian search ("ló") on the first page only one of it is an image that resembles a horse.

Best regards, Bence

Magnus Manske

6:07 p.m.

On Tue, Jun 22, 2010 at 6:52 PM, Bence Damokos bdamokos@gmail.com wrote:

...

On Tue, Jun 22, 2010 at 7:42 PM, geni geniice@gmail.com wrote:

...
In practice pulling up the wikipedia article on "horse" in your language will cover most cases. There is a fairly good argument to be made that wikipedia is common's best search engine.

I would consider this state as a poor reflection on Commons' accessibility.

Especially as Google image search (imho, the likeliest avenue of searching for images) gives 130 000 pictures of horses on Commons if searched in English, zero if searched in Estonian ("hobu"), and while it gives 160 000 results for a Hungarian search ("ló") on the first page only one of it is an image that resembles a horse.

Here's a thought: Enter "hobu" into translate.google.com, leave "source language" on automatic and target on "English", and it will happily translate it into "horse". Could we offer a "translation" link in search? As in, "translate my query into English and try again"? I'm sure we can come to an arrangement with Google (or someone else).

Magnus

Bence Damokos

6:28 p.m.

On Tue, Jun 22, 2010 at 8:07 PM, Magnus Manske magnusmanske@googlemail.comwrote:

...

...
...
I would consider this state as a poor reflection on Commons'

accessibility.

...
Especially as Google image search (imho, the likeliest avenue of

searching

...
for images) gives 130 000 pictures of horses on Commons if searched in English, zero if searched in Estonian ("hobu"), and while it gives 160

000

...
results for a Hungarian search ("ló") on the first page only one of it is

an

...
image that resembles a horse.

Here's a thought: Enter "hobu" into translate.google.com, leave "source language" on automatic and target on "English", and it will happily translate it into "horse". Could we offer a "translation" link in search? As in, "translate my query into English and try again"? I'm sure we can come to an arrangement with Google (or someone else).

Sorry if I misunderstand your suggestion.

I'm sure power users can find any number of ways to do this (I think Google already offers a similar service somewhere hidden away) though they probably speak English as well, to reach those who do not speak English or aren't power users it has to be super obvious, I'm afraid. Google will probably reach that point sometime, but while they usually support a couple of dozen languages, we do so with a couple of hundred.

I would be happy to see though some translation magic applied to Commons' category system the way templates now autotranslate - given the fact that we have a huge translation community and that interwiki links and links from Wikipedia's to Commons can be used to guess the meanings (which than could be confirmed by a human in some addictive "game"). I am not sure if Google would take the hint of the localized category names in their image search but it would be a start.

(Having an easy, special interface -- that cuts away all the wikicode confusion leaving just the image and the existing translations and a next button, adds some AJAXy background magic,maybe suggestions through the Google Translate API - to translate image descriptions might also help drive the localisation of the image descriptions. Probably there are some userscripts that do this but they could be turned on by default or at least made more prominent.)

Best regards, Bence

Nikola Smolenski

23 Jun 23 Jun

5:40 a.m.

On 06/22/2010 08:07 PM, Magnus Manske wrote:

...

...
...
I would consider this state as a poor reflection on Commons' accessibility.

Especially as Google image search (imho, the likeliest avenue of searching for images) gives 130 000 pictures of horses on Commons if searched in English, zero if searched in Estonian ("hobu"), and while it gives 160 000 results for a Hungarian search ("ló") on the first page only one of it is an image that resembles a horse.

Here's a thought: Enter "hobu" into translate.google.com, leave "source language" on automatic and target on "English", and it will happily translate it into "horse". Could we offer a "translation" link in search? As in, "translate my query into English and try again"? I'm sure we can come to an arrangement with Google (or someone else).

I already made something similar: http://toolserver.org/~nikola/mis.php

Magnus Manske

8:13 a.m.

On Wed, Jun 23, 2010 at 6:40 AM, Nikola Smolenski smolensk@eunet.rs wrote:

...

On 06/22/2010 08:07 PM, Magnus Manske wrote:

...
...
...
I would consider this state as a poor reflection on Commons' accessibility.

Especially as Google image search (imho, the likeliest avenue of searching for images) gives 130 000 pictures of horses on Commons if searched in English, zero if searched in Estonian ("hobu"), and while it gives 160 000 results for a Hungarian search ("ló") on the first page only one of it is an image that resembles a horse.

Here's a thought: Enter "hobu" into translate.google.com, leave "source language" on automatic and target on "English", and it will happily translate it into "horse". Could we offer a "translation" link in search? As in, "translate my query into English and try again"? I'm sure we can come to an arrangement with Google (or someone else).

I already made something similar: http://toolserver.org/~nikola/mis.php

Nice! Now it needs language auto-detect, and Estonian for the example (unless I didn't see it), and, of course, integration into Commons...

Nikola Smolenski

5:12 p.m.

Дана Wednesday 23 June 2010 10:13:39 Magnus Manske написа:

...

On Wed, Jun 23, 2010 at 6:40 AM, Nikola Smolenski smolensk@eunet.rs wrote:

...
On 06/22/2010 08:07 PM, Magnus Manske wrote:

...
Here's a thought: Enter "hobu" into translate.google.com, leave "source language" on automatic and target on "English", and it will happily translate it into "horse". Could we offer a "translation" link in search? As in, "translate my query into English and try again"? I'm sure we can come to an arrangement with Google (or someone else).

I already made something similar: http://toolserver.org/~nikola/mis.php

Nice! Now it needs language auto-detect, and Estonian for the example (unless I didn't see it), and, of course, integration into Commons...

All done, and I leave the integration to someone who knows how to navigate the community's labyrinths.

Bence Damokos

22 Jun 22 Jun

5:47 p.m.

On Tue, Jun 22, 2010 at 6:32 PM, Anthony wikimail@inbox.org wrote:

...

On Tue, Jun 22, 2010 at 9:21 AM, Gerard Meijssen gerard.meijssen@gmail.comwrote:

...
When you think that Commons is bad in supporting other languages, try to find pictures of a horse on the internet in other languages like

Estonian,

...
Nepalese ... It is not the same at all as when you are looking for images in English.

Don't most Internet users know enough English to be able to search for "pictures of a horse" in English?

(According to Wikipedia ( http://en.wikipedia.org/wiki/Global_Internet_usage), yes... "Most Internet users speak the English language as a native or secondary language.")

If I read the data in the article correctly, most means 35%. If we consider that current English native speakers mostly already have internet and those without internet are likelier than not to be non-English speakers I would be careful to advocate the unilateral use of English.

Best regards, Bence

Anthony

7:33 p.m.

On Tue, Jun 22, 2010 at 1:47 PM, Bence Damokos bdamokos@gmail.com wrote:

...

On Tue, Jun 22, 2010 at 6:32 PM, Anthony wikimail@inbox.org wrote:

...
Don't most Internet users know enough English to be able to search for "pictures of a horse" in English?

(According to Wikipedia ( http://en.wikipedia.org/wiki/Global_Internet_usage), yes... "Most Internet users speak the English language as a native or secondary language.")

If I read the data in the article correctly, most means 35%.

Since "most" means more than 50%, I don't think you read it correctly. The 35% figure seems to be only native English speakers.

...

If we consider that current English native speakers mostly already have internet and those without internet are likelier than not to be non-English speakers I would be careful to advocate the unilateral use of English.

As would I, though I don't think you mean what you said.

Mark Williamson

9:26 p.m.

...

...
If we consider that current English native speakers mostly already have internet and those without internet are likelier than not to be non-English speakers I would be careful to advocate the unilateral use of English.

As would I, though I don't think you mean what you said.

Why not? To me, it means that we're widening the digital divide by making it so that people who don't have the internet would have little use for it anyways if it's all written in a language they don't understand.

John Doe

11:56 p.m.

Since I'm a fairly active programmer, I have some code sitting around. If I can get some support on commons with regards to templates (something that gives me nightmares) I could probably get a translation matrix program up and running within 24-48 hours. I would just need to figure out a good method for tracking what needs translated, what has been machine translated and needs review, and what has already been translated.

John

On Tue, Jun 22, 2010 at 5:26 PM, Mark Williamson node.ue@gmail.com wrote:

...

...
...
If we consider that current English native speakers mostly already have internet and

those

...
...
without internet are likelier than not to be non-English speakers I

would

...
...
be careful to advocate the unilateral use of English.

As would I, though I don't think you mean what you said.

Why not? To me, it means that we're widening the digital divide by making it so that people who don't have the internet would have little use for it anyways if it's all written in a language they don't understand.

m.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

John Doe

23 Jun 23 Jun

3:51 a.m.

the basic translation matrix is in place, here is how you say horse in as many languages as you can: http://commons.wikimedia.org/w/index.php?title=User:%CE%94/Sandbox&oldid...

John

On Tue, Jun 22, 2010 at 7:56 PM, John Doe phoenixoverride@gmail.com wrote:

...

Since I'm a fairly active programmer, I have some code sitting around. If I can get some support on commons with regards to templates (something that gives me nightmares) I could probably get a translation matrix program up and running within 24-48 hours. I would just need to figure out a good method for tracking what needs translated, what has been machine translated and needs review, and what has already been translated.

John

On Tue, Jun 22, 2010 at 5:26 PM, Mark Williamson node.ue@gmail.comwrote:

...
...
...
If we consider that current English native speakers mostly already have internet and

those

...
...
without internet are likelier than not to be non-English speakers I

would

...
...
be careful to advocate the unilateral use of English.

As would I, though I don't think you mean what you said.

Why not? To me, it means that we're widening the digital divide by making it so that people who don't have the internet would have little use for it anyways if it's all written in a language they don't understand.

m.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Samuel Klein

5:17 a.m.

Very nice.

I'd like to see such translation tools used to enhance the tags used to identify an image, so that all internet searches can find images by those tags.

On Tue, Jun 22, 2010 at 11:51 PM, John Doe phoenixoverride@gmail.com wrote:

...

the basic translation matrix is in place, here is how you say horse in as many languages as you can: http://commons.wikimedia.org/w/index.php?title=User:%CE%94/Sandbox&oldid...

John

On Tue, Jun 22, 2010 at 7:56 PM, John Doe phoenixoverride@gmail.com wrote:

...
Since I'm a fairly active programmer, I have some code sitting around. If I can get some support on commons with regards to templates (something that gives me nightmares) I could probably get a translation matrix program up and running within 24-48 hours. I would just need to figure out a good method for tracking what needs translated, what has been machine translated and needs review, and what has already been translated.

John

On Tue, Jun 22, 2010 at 5:26 PM, Mark Williamson node.ue@gmail.comwrote:

...
...
...
If we consider that current English native speakers mostly already have internet and

those

...
...
without internet are likelier than not to be non-English speakers I

would

...
...
be careful to advocate the unilateral use of English.

As would I, though I don't think you mean what you said.

Why not? To me, it means that we're widening the digital divide by making it so that people who don't have the internet would have little use for it anyways if it's all written in a language they don't understand.

m.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- Samuel Klein identi.ca:sj w:user:sj

Tisza Gergo

10:17 a.m.

Samuel Klein <meta.sj@...> writes:

...

I'd like to see such translation tools used to enhance the tags used to identify an image, so that all internet searches can find images by those tags.

I think this stuff should be left for Google. A clever search engine should be able to figure out that if you are looking for "Pferd" images, "horse" images will also be of interest; and Google is getting clever quickly in this regard. (For example, recently Google web search has been offering to translate the search phrase to English, and translate the results back to you.)

OTOH, it would be a nice feature to show translated page and category names when someone looks at the page with the interface language set to non-English.

Magnus Manske

1:47 p.m.

On Wed, Jun 23, 2010 at 11:17 AM, Tisza Gergo gtisza@gmail.com wrote:

...

Samuel Klein <meta.sj@...> writes:

...
I'd like to see such translation tools used to enhance the tags used to identify an image, so that all internet searches can find images by those tags.

I think this stuff should be left for Google. A clever search engine should be able to figure out that if you are looking for "Pferd" images, "horse" images will also be of interest; and Google is getting clever quickly in this regard. (For example, recently Google web search has been offering to translate the search phrase to English, and translate the results back to you.)

OTOH, it would be a nice feature to show translated page and category names when someone looks at the page with the interface language set to non-English.

OK, technical solution (hackish as usual, but with potential IMHO):

http://commons.wikimedia.org/w/index.php?title=Special:Search&search=Pfe...

Basically, this will (on the search page only!) look at the last query run (the one currently in the edit box), check several language editions of Wikipedia for articles from the individual words (in this case, "Pferd" and "Schach"), count how many exist, pick the language with the most hits (in this case, German), and put a link to link to Nikola's tool under the search box. The link pre-fills the source language and query in the tool, which automatically opens the appropriate search page.

In essence, clicking on the link gets you to the toolserver and back to the search, this time in English, without you noticing.

I am checking all the languages the Nikola's tool offers (so no Estonian), except English (no point, really).

Experimenting, I noticed that even if your original query got you some results (e.g. "Schaufel"=47), the translation in English will give you more ("Shovel"=484).

I tried to restrict the language search for the languages accepted by the browser (so, using 1 or 2 queries instead of 32), but there appears to be no way in JavaScript to get that information. MediaWiki could pass it on, though...

Feel free to improve!

Cheers, Magnus

Tisza Gergo

2:17 p.m.

Magnus Manske <magnusmanske@...> writes:

...

Basically, this will (on the search page only!) look at the last query run (the one currently in the edit box), check several language editions of Wikipedia for articles from the individual words (in this case, "Pferd" and "Schach"), count how many exist, pick the language with the most hits (in this case, German), and put a link to link to Nikola's tool under the search box. The link pre-fills the source language and query in the tool, which automatically opens the appropriate search page.

Again, I would suggest using Google (or an alternative with open data, if one exists) instead of trying to reinvent the wheel:

http://translate.google.com/#auto%7Cen%7CPferd%20Schach http://code.google.com/apis/ajaxlanguage/documentation/#Detect

It might support less languages then we have wikipedias for, but I'm pretty sure it would give better results for the major ones.

Magnus Manske

2:34 p.m.

On Wed, Jun 23, 2010 at 3:17 PM, Tisza Gergo gtisza@gmail.com wrote:

...

Magnus Manske <magnusmanske@...> writes:

...
Basically, this will (on the search page only!) look at the last query run (the one currently in the edit box), check several language editions of Wikipedia for articles from the individual words (in this case, "Pferd" and "Schach"), count how many exist, pick the language with the most hits (in this case, German), and put a link to link to Nikola's tool under the search box. The link pre-fills the source language and query in the tool, which automatically opens the appropriate search page.

Again, I would suggest using Google (or an alternative with open data, if one exists) instead of trying to reinvent the wheel:

http://translate.google.com/#auto%7Cen%7CPferd%20Schach http://code.google.com/apis/ajaxlanguage/documentation/#Detect

It might support less languages then we have wikipedias for, but I'm pretty sure it would give better results for the major ones.

Well, that's what I suggested a few mails ago in this very thread. However, people didn't seem to want it.

David Gerard

3:23 p.m.

On 23 June 2010 15:34, Magnus Manske magnusmanske@googlemail.com wrote:

...

On Wed, Jun 23, 2010 at 3:17 PM, Tisza Gergo gtisza@gmail.com wrote:

...
Magnus Manske <magnusmanske@...> writes:

...

...
...
Basically, this will (on the search page only!) look at the last query run (the one currently in the edit box), check several language

...

...
Again, I would suggest using Google (or an alternative with open data, if one exists) instead of trying to reinvent the wheel:

...

Well, that's what I suggested a few mails ago in this very thread. However, people didn't seem to want it.

Reliance on Google for what is really an essential function for those who aren't native English speakers is problematic because it's (a) third-party (b) closed. Same reason we don't use reCaptcha.

- d.

Michael Peel

6:53 p.m.

On 23 Jun 2010, at 16:23, David Gerard wrote:

...

Reliance on Google for what is really an essential function for those who aren't native English speakers is problematic because it's (a) third-party (b) closed. Same reason we don't use reCaptcha.

I always think than not using reCaptcha is a shame, as it's a nice way to get people to proofread text in a reasonably efficient way. It would be really nice if someone could create something similar that proofreads OCR'd text from Wikisource... <hint, hint>.

Mike

Nikola Smolenski

7:24 p.m.

...

On 23 Jun 2010, at 16:23, David Gerard wrote:

...
Reliance on Google for what is really an essential function for those who aren't native English speakers is problematic because it's (a) third-party (b) closed. Same reason we don't use reCaptcha.

On the other hand, do we have to really _rely_ on reCaptcha? If their servers aren't working, use the ordinary captcha. Proofread books and still not rely on any external servers.

Mariano Cecowski

8:31 p.m.

--- El mié 23-jun-10, Michael Peel email@mikepeel.net escribió:

...

I always think than not using reCaptcha is a shame, as it's a nice way to get people to proofread text in a reasonably efficient way. It would be really nice if someone could create something similar that proofreads OCR'd text from Wikisource... <hint, hint>.

And how do you decide that what was entered is wrong or right?

Better take a look at Project Gutemberg's Distributed Proofreaders[1].

Cheers, MarianoC.-

[1] http://pgdp.net

Michael Peel

8:39 p.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

(Renaming the subject as we've changed topic)

On 23 Jun 2010, at 21:31, Mariano Cecowski wrote:

...

--- El mié 23-jun-10, Michael Peel email@mikepeel.net escribió:

...
I always think than not using reCaptcha is a shame, as it's a nice way to get people to proofread text in a reasonably efficient way. It would be really nice if someone could create something similar that proofreads OCR'd text from Wikisource... <hint, hint>.

And how do you decide that what was entered is wrong or right?

Better take a look at Project Gutemberg's Distributed Proofreaders[1].

Cheers, MarianoC.-

[1] http://pgdp.net

My understanding is that original text within the reCAPTCHA is shown to several different people; if they agree then the word is counted as correct. Looking at the Wikipedia article, it's a little more complex than that: http://en.wikipedia.org/wiki/ReCAPTCHA There's a reason why there are two words to solve during a reCAPTCHA.

What Distributed Proofreaders can do, Wikisource can do - but in a Wiki environment. If you haven't checked out the proofreading features that Wikisource now has, I would encourage you to give them a go, e.g. at: http://en.wikisource.org/wiki/Page:Frederic_Shoberl_-_Persia.djvu/92

Mike

Samuel Klein

24 Jun 24 Jun

2:37 p.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

I love those proofreading features, and the new default layout for a book's pages and TOC. Wikisource is becoming AWESOME.

Do we have PGDP contributors who can weigh on on how similar the processes are? Is there a way for us to actually merge workflows with them?

Prof. Greg Crane of The Perseus Project @ Tufts is looking to upload a few score classical manuscripts, and perhaps eventually their whole corpus, into Wikisource -- but they want better multilingual proofreading and annotation tools (which they are also considering developing. Hear, hear!) All of this work needs a bit more visibility.

On Wed, Jun 23, 2010 at 4:39 PM, Michael Peel email@mikepeel.net wrote:

...

(Renaming the subject as we've changed topic)

On 23 Jun 2010, at 21:31, Mariano Cecowski wrote:

...
--- El mié 23-jun-10, Michael Peel email@mikepeel.net escribió:

...
I always think than not using reCaptcha is a shame, as it's a nice way to get people to proofread text in a reasonably efficient way. It would be really nice if someone could create something similar that proofreads OCR'd text from Wikisource... <hint, hint>.

And how do you decide that what was entered is wrong or right?

Better take a look at Project Gutemberg's Distributed Proofreaders[1].

Cheers, MarianoC.-

[1] http://pgdp.net

My understanding is that original text within the reCAPTCHA is shown to several different people; if they agree then the word is counted as correct. Looking at the Wikipedia article, it's a little more complex than that: http://en.wikipedia.org/wiki/ReCAPTCHA There's a reason why there are two words to solve during a reCAPTCHA.

What Distributed Proofreaders can do, Wikisource can do - but in a Wiki environment. If you haven't checked out the proofreading features that Wikisource now has, I would encourage you to give them a go, e.g. at: http://en.wikisource.org/wiki/Page:Frederic_Shoberl_-_Persia.djvu/92

Mike _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- Samuel Klein identi.ca:sj w:user:sj

James Forrester

3:16 p.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

On 24 June 2010 15:37, Samuel Klein meta.sj@gmail.com wrote:

...

I love those proofreading features, and the new default layout for a book's pages and TOC. Wikisource is becoming AWESOME.

Ahem. Even more awesome, you mean. :-)

...

Do we have PGDP contributors who can weigh on on how similar the processes are? Is there a way for us to actually merge workflows with them?

Disclaimer - my PGDP account dates from 2004, but I only get involved in fits every couple of years. This should be seen mostly as an "outsider's" viewpoint. :-)

IME, PGDP's processes are /seriously/ heavy-weight, burning lots of worker time on 2nd or even 3rd-level passes, and multiple tiers of work (Proofreading, Formatting, and all the special management levels for people running projects). The pyramid of processes has grown so great that they have seemed to crash in on themselves - there's a huge dearth of people at the "higher" levels (you need to qualify at the lower levels before the system will let you contribute to the activities at the end). It's generally quite "unwiki".

I think Wikisource's model is a great deal more light weight that PGDP's - and that we really don't want to push Wikisource down that route. :-) Unfortunately I think that this means linking the two up might prove challenging - and there's also a danger that people may jump ship, damaging PGDP still further and making them upset with us.

-- James D. Forrester jdforrester@wikimedia.org | jdforrester@gmail.com [[Wikipedia:User:Jdforrester|James F.]]

Birgitte SB

6:43 p.m.

New subject: [Foundation-l] Wikisource and PGDP

--- On Thu, 6/24/10, James Forrester james@jdforrester.org wrote:

...

IME, PGDP's processes are /seriously/ heavy-weight, burning lots of worker time on 2nd or even 3rd-level passes, and multiple tiers of work (Proofreading, Formatting, and all the special management levels for people running projects). The pyramid of processes has grown so great that they have seemed to crash in on themselves - there's a huge dearth of people at the "higher" levels (you need to qualify at the lower levels before the system will let you contribute to the activities at the end). It's generally quite "unwiki".

I think Wikisource's model is a great deal more light weight that PGDP's - and that we really don't want to push Wikisource down that route. :-) Unfortunately I think that this means linking the two up might prove challenging - and there's also a danger that people may jump ship, damaging PGDP still further and making them upset with us.

I definitely wouldn't want to see Wikisource move to a more heavy weight structure. Right now it is easy for anyone completely unfamiliar to the nuts and bolts of setting up a text to show up at the Proofread of the Month and validate a single page and then have nothing further to do with the text. Seldom do you even need to deal with formatting when you are validating an already proofread page. I think that this is important to keep this very simple. I would really encourage anyone who has never participated to try it out [1]

Of course, we don't really have any push to focus on a "finished" release like PGDP must have. And this eventualism has the usual results even as it keeps the structure lightweight.

Linking up with PGDP texts is mostly avoided at en.WS because it is so often impossible to match their texts with a specific edition, which we are looking for to attach scanned images. It has become easier to just start from scratch with a file we can more easily put through the Proofread Page extention. Their more rigid structure makes edition verification after release unnecessary for them, but it is very important for us since our structure is so open. It is difficult to see how we might help one another given such basic incompatibilities in structure.

Birgitte SB

[1]http://en.wikisource.org/wiki/Index:Frederic_Shoberl_-_Persia.djvu Click on any yellow highlighted number. Validate the wikitext against the image. Edit the page to make changes (if necessary) and to move the radio button to validated.

Samuel Klein

11:43 p.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

On Thu, Jun 24, 2010 at 11:16 AM, James Forrester james@jdforrester.org wrote:

...

On 24 June 2010 15:37, Samuel Klein meta.sj@gmail.com wrote:

...
I love those proofreading features, and the new default layout for a book's pages and TOC. Wikisource is becoming AWESOME.

Ahem. Even more awesome, you mean. :-)

It used to be just lowercase awesome... THINGS HAVE CHANGED. >:-)

...

Disclaimer - my PGDP account dates from 2004, but I only get involved in fits every couple of years.

Could you ask some of the wiki-savvy continuously active proofreaders to join this discussion for a little while? I like the work PGDP does, and bet we can find a way to support and amplify it.

Andre Engels

25 Jun 25 Jun

3:13 a.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

On Thu, Jun 24, 2010 at 4:37 PM, Samuel Klein meta.sj@gmail.com wrote:

...

I love those proofreading features, and the new default layout for a book's pages and TOC. Wikisource is becoming AWESOME.

Do we have PGDP contributors who can weigh on on how similar the processes are? Is there a way for us to actually merge workflows with them?

I am quite active on PGDP, but not on Wikisource, so I can tell about how things work there, but not on how similar it is to Wikisource.

Typical about the PGDP workflow are an emphasis on quality above quantity (exemplified in running not 1 or 2 but 3 rounds of human checking of the OCR result - correctness in copying is well above 99.99% for most books) and work being done in page-size chunks rather than whole books, chapters, paragraphs, sentences, words or whatever else one could think of.

There's a number of people involved, although people can and often do fill several roles for one book.

First, there is the Content Provider (CP).

He or she first contacts Project Gutenberg to get a clearance. This is basically a statement from PG that they believe the work is out of copyright. In general, US copyright is what is taken into account for this, although there are also servers in other countries (Canada and Australia as far as I know), which publish some material that is out of copyright in those countries even if it is not in the US. Such works do not go through PGDP, but may go through its sister projects DPCanada or DPEurope.

Next, the CP will scan the book, or harvest the scans from the web, and run OCR on them. They will usually also write a description of the book for the proofreaders, so those can see whether they are interested. The scans and the OCR are uploaded to the PGDP servers, and the project is handed over to the Project Manager (PM) (although in most cases CP and PM are the same person).

The Project Manager is responsible for the project in the next stages. This means: * specifying the rules and guidelines that are to be followed when proofreading the book, at least there where those differ from the standard guidelines * answer questions by proofreaders * keep the good and bad words lists up to date. These are used in wordcheck (a kind of spellchecker) so that words are considered correct or incorrect by it

The project then goes through a number of rounds. The standard number is 5 rounds, of which 3 are proofreading and 2 are formatting, but it is possible for the PM to make a request to skip one or more rounds or go through a round twice.

In the first three, proofreading, rounds, a proofreader requests one page at a time, compares the OCR output (or the previous proofreader's output) with the scan, and changes the text to correspond to the scan. In the first round (P1) everyone can do this, the second round (P2) is only accessible to those who have been at the site some time and done a certain amount of pages (21 days and 300 pages, if I recall correctly), for the third round (P3) one has to qualify. For qualification one's P2 pages are checked (using the subsequent edits of P3). The norm is that one should not leave more than one error per five pages.

After the three (or two or four) rounds of proofing, the foofing (formatting) rounds are gone through. In these, again a proofreader (now called formatter) requests and edits one page at the time, but where the proofreaders dealt with copying the text as precisely as possible, the formatter will deal with all other aspects of the work. They denote when text is italic, bold or otherwise in a special format, which texts are chapter headers, how tables are laid out, etcetera. Here there are two rounds, although the second one can be skipped or a round duplicated, like before. The first formatting round (F1) has the same entrance restrictions as P2, F2 has a qualification system comparable to P3.

After this, the PM gives the book on to the Post-Processor (PP). Again, this is often the same person, but not always. In some other cases, the PP has already been appointed, in others it will sit in a pool until picked up by a willing PP. The PP does all that is needed to get from the F2 output to something that can be put on Project Gutenberg: they recombine the pages into one work, move stuff around where needed, change the formatters' mark-up in something that's more appropriate for reading, in most cases generate an HTML version, etcetera.

A PP that has already post-processed several books in a good way can then send it to PG. In other cases, the book will then go to the PPV (Post-Processing Verifier), an experienced PP, who checks the PP's work, and gives them hints on what should be improved or makes those improvements themselves.

Finally, if the PP or PPV sends the book to PG, there is a whitewasher who checks the book once again; however, that is outside the scope of this (already too long) description, because it belongs to PG's process rather than PGDP's.

To stop the rounds from overcrowding with books, there are queues for each round, containing books that are ready to enter the round, but have not yet done so. To keep some variety, there are different queues by language and/or subject type. A problem with this has been that the later rounds, having less manpower because of the higher standards required, could not keep up with P1 and F1. There has been work to do something about it, and the P2 queues have been brought down to decent size, but in P3 and F2 books can literally sit in the queues for years, and PP still is a bottleneck as well.

-- André Engels, andreengels@gmail.com

Samuel Klein

30 Jun 30 Jun

9:49 a.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

Andre, this is a great summary -- I've linked to it from the english ws Scriptorium.

Do you see opportunities for the two projects to coordinate their wofklows better?

On Thu, Jun 24, 2010 at 11:13 PM, Andre Engels andreengels@gmail.com wrote:

...

On Thu, Jun 24, 2010 at 4:37 PM, Samuel Klein meta.sj@gmail.com wrote:

...
I love those proofreading features, and the new default layout for a book's pages and TOC. Wikisource is becoming AWESOME.

Do we have PGDP contributors who can weigh on on how similar the processes are? Is there a way for us to actually merge workflows with them?

I am quite active on PGDP, but not on Wikisource, so I can tell about how things work there, but not on how similar it is to Wikisource.

Typical about the PGDP workflow are an emphasis on quality above quantity (exemplified in running not 1 or 2 but 3 rounds of human checking of the OCR result - correctness in copying is well above 99.99% for most books) and work being done in page-size chunks rather than whole books, chapters, paragraphs, sentences, words or whatever else one could think of.

There's a number of people involved, although people can and often do fill several roles for one book.

First, there is the Content Provider (CP).

He or she first contacts Project Gutenberg to get a clearance. This is basically a statement from PG that they believe the work is out of copyright. In general, US copyright is what is taken into account for this, although there are also servers in other countries (Canada and Australia as far as I know), which publish some material that is out of copyright in those countries even if it is not in the US. Such works do not go through PGDP, but may go through its sister projects DPCanada or DPEurope.

Next, the CP will scan the book, or harvest the scans from the web, and run OCR on them. They will usually also write a description of the book for the proofreaders, so those can see whether they are interested. The scans and the OCR are uploaded to the PGDP servers, and the project is handed over to the Project Manager (PM) (although in most cases CP and PM are the same person).

The Project Manager is responsible for the project in the next stages. This means:

specifying the rules and guidelines that are to be followed when

proofreading the book, at least there where those differ from the standard guidelines

answer questions by proofreaders

keep the good and bad words lists up to date. These are used in

wordcheck (a kind of spellchecker) so that words are considered correct or incorrect by it

The project then goes through a number of rounds. The standard number is 5 rounds, of which 3 are proofreading and 2 are formatting, but it is possible for the PM to make a request to skip one or more rounds or go through a round twice.

In the first three, proofreading, rounds, a proofreader requests one page at a time, compares the OCR output (or the previous proofreader's output) with the scan, and changes the text to correspond to the scan. In the first round (P1) everyone can do this, the second round (P2) is only accessible to those who have been at the site some time and done a certain amount of pages (21 days and 300 pages, if I recall correctly), for the third round (P3) one has to qualify. For qualification one's P2 pages are checked (using the subsequent edits of P3). The norm is that one should not leave more than one error per five pages.

After the three (or two or four) rounds of proofing, the foofing (formatting) rounds are gone through. In these, again a proofreader (now called formatter) requests and edits one page at the time, but where the proofreaders dealt with copying the text as precisely as possible, the formatter will deal with all other aspects of the work. They denote when text is italic, bold or otherwise in a special format, which texts are chapter headers, how tables are laid out, etcetera. Here there are two rounds, although the second one can be skipped or a round duplicated, like before. The first formatting round (F1) has the same entrance restrictions as P2, F2 has a qualification system comparable to P3.

After this, the PM gives the book on to the Post-Processor (PP). Again, this is often the same person, but not always. In some other cases, the PP has already been appointed, in others it will sit in a pool until picked up by a willing PP. The PP does all that is needed to get from the F2 output to something that can be put on Project Gutenberg: they recombine the pages into one work, move stuff around where needed, change the formatters' mark-up in something that's more appropriate for reading, in most cases generate an HTML version, etcetera.

A PP that has already post-processed several books in a good way can then send it to PG. In other cases, the book will then go to the PPV (Post-Processing Verifier), an experienced PP, who checks the PP's work, and gives them hints on what should be improved or makes those improvements themselves.

Finally, if the PP or PPV sends the book to PG, there is a whitewasher who checks the book once again; however, that is outside the scope of this (already too long) description, because it belongs to PG's process rather than PGDP's.

To stop the rounds from overcrowding with books, there are queues for each round, containing books that are ready to enter the round, but have not yet done so. To keep some variety, there are different queues by language and/or subject type. A problem with this has been that the later rounds, having less manpower because of the higher standards required, could not keep up with P1 and F1. There has been work to do something about it, and the P2 queues have been brought down to decent size, but in P3 and F2 books can literally sit in the queues for years, and PP still is a bottleneck as well.

-- André Engels, andreengels@gmail.com

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- Samuel Klein identi.ca:sj w:user:sj

Samuel Klein

9:50 a.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

On Wed, Jun 30, 2010 at 5:49 AM, Samuel Klein sjklein@hcs.harvard.edu wrote:

...

Andre, this is a great summary -- I've linked to it from the english ws Scriptorium.

Do you see opportunities for the two projects to coordinate their wofklows better?

^^^^^^^ Clearly this email needed 1 more round of human checking.

...

On Thu, Jun 24, 2010 at 11:13 PM, Andre Engels andreengels@gmail.com wrote:

...
Typical about the PGDP workflow are an emphasis on quality above quantity (exemplified in running not 1 or 2 but 3 rounds of human checking of the OCR result - correctness in copying is well above 99.99% for most books) and work being done in page-size chunks rather

John Vandenberg

10:13 a.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

On Wed, Jun 30, 2010 at 7:49 PM, Samuel Klein sjklein@hcs.harvard.edu wrote:

...

Andre, this is a great summary -- I've linked to it from the english ws Scriptorium.

Do you see opportunities for the two projects to coordinate their wofklows better?

I don't understand your use of 'coordinate' in this context.

Wikisource has a very lax workflow (it's a wiki), it publishes the scans & text immediately, irrespective of whether it is verified, OCR quality, or if it is vandalism. However, wikisource keeps the images and the text unified from day 0 to eternity.

PGDP has a very strict and arduous workflow, big projects end up stuck in the rounds (the remaining EB projects are a great example), and they are not published until they make it out of the rounds. The result is quality, however only the text is sent downstream.

Wikisource and PGDP don't interoperate. We *could*, but when I looked at importing a PGDP project into Wikisource, I put it in the too hard basket.

Wikisource is trying to become a credible competitor to PGDP. However this isnt a zero-sum game. If the Wikisource projects succeeds in demonstrating the wiki way is a viable approach, the result is different people choosing to work in different workflows/projects, and more reliable etexts being produced.

-- John Vandenberg

Samuel J Klein

10:42 a.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

On Wed, Jun 30, 2010 at 6:13 AM, John Vandenberg jayvdb@gmail.com wrote:

...

irrespective of whether it is verified, OCR quality, or if it is vandalism. However, wikisource keeps the images and the text unified from day 0 to eternity.

Some works become verified, and reach high OCR quality.

< PGDP has a very strict and arduous workflow... The

...

result is quality, however only the text is sent downstream.

Why not send images and text downstream?

...

Wikisource and PGDP don't interoperate. We *could*, but when I looked at importing a PGDP project into Wikisource, I put it in the too hard basket.

That's what I mean by 'coordinate'. "hard" here seems like a one-time hardship followed by a permanent useful coordination.

...

Wikisource is trying to become a credible competitor to PGDP.

Perhaps we have competing interfaces / workflows. but I expect we would be glad to share 99.99%-verified high-quality texts-unified-with-images if it were easy for both projects to identify that combination of quality and comprehensive data... and would be glad to share metadata so that a WS editor could quickly check to see if there's a PGDP effort covering an edition of the text she is proofing; and vice-versa.

I want us to get better, faster, less held up by the idea of coordinating with other projects, because there are much larger projects out there worthy of coordinating with. The annotators who work on the Perseus Project come to mind... but that's truly a harder problem than this one.

...

If the Wikisource projects succeeds in demonstrating the wiki way is a viable approach, the result is different people choosing to work in different workflows/projects, and more reliable etexts being produced.

Absolutely.

John Vandenberg

11:24 a.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

On Wed, Jun 30, 2010 at 8:42 PM, Samuel J Klein sj@wikimedia.org wrote:

...

On Wed, Jun 30, 2010 at 6:13 AM, John Vandenberg jayvdb@gmail.com wrote:

...
irrespective of whether it is verified, OCR quality, or if it is vandalism. However, wikisource keeps the images and the text unified from day 0 to eternity.

Some works become verified, and reach high OCR quality.

< PGDP has a very strict and arduous workflow... The

...
result is quality, however only the text is sent downstream.

Why not send images and text downstream?

Good question! ;-) Storage is one issue. It would be interesting to estimate the storage requirements of Wikisource if we had produced the PGDP etexts.

...

...
Wikisource and PGDP don't interoperate. We *could*, but when I looked at importing a PGDP project into Wikisource, I put it in the too hard basket.

That's what I mean by 'coordinate'. "hard" here seems like a one-time hardship followed by a permanent useful coordination.

They don't have an 'export' function, and I doubt they are going to build one so that they can interoperate with us.

My 'import' function was a scraper; not something that can be used in a large scale without their permission.

In the end, it is simpler to avoid starting WS projects that would duplicate unfinished PGDP projects. There are plenty of works that have not been transcribed yet ;-)

...

...
Wikisource is trying to become a credible competitor to PGDP.

Perhaps we have competing interfaces / workflows.

This is like saying that Wikipedia and Brittanica have competing interfaces / workflows.

The wikisource workflow is a *symptom* of it being a "wiki", with all that entails. There is a lot more than merely the workflow which distinguishes the two projects.

...

.. but I expect we would be glad to share 99.99%-verified high-quality texts-unified-with-images if it were easy for both projects to identify that combination of quality and comprehensive data.

Good luck with that.

PGDP publishes etexts via PG.

If PGDP gives images+text to Wikisource for projects that are stuck in their rounds, it becomes published online immediately at whatever stage it is at - its a wiki. That is at odds with the objective of PGDP, unless they are completely abandoning the project.

It is more likely that PGDP will release images+text at the same time they publish the etext to PG. The best way for PGDP to do this is to produce a djvu with images and verified text, and then upload it to archive.org so everyone benefits.

...

and would be glad to share metadata so that a WS editor could quickly check to see if there's a PGDP effort covering an edition of the text she is proofing; and vice-versa.

IIRC, obtaining the list of ongoing PGDP projects requires a PGDP account, but anyone can create an account.

The WS project list is in google. ;-)

-- John Vandenberg

Andre Engels

12:18 p.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

On Wed, Jun 30, 2010 at 1:24 PM, John Vandenberg jayvdb@gmail.com wrote:

...

Good question! ;-) Storage is one issue. It would be interesting to estimate the storage requirements of Wikisource if we had produced the PGDP etexts.

I think it is the main reason; however, a back-of-the-envelope calculation (20.000 books, 300 pages, 100k per page; the first is quite a good estimate, the other two could be a factor 2 off) tells me that the total storage requirements would be measured in 100s of gigabytes - which means that one or two state of the art hard disks should be enough to contain it.

...

They don't have an 'export' function, and I doubt they are going to build one so that they can interoperate with us.

My 'import' function was a scraper; not something that can be used in a large scale without their permission.

On the other hand, if you _do_ get permission, there might well be a more elegant ftp-based method.

...

The wikisource workflow is a *symptom* of it being a "wiki", with all that entails. There is a lot more than merely the workflow which distinguishes the two projects.

Certainly. I think the deeper-laying difference is one of attitude, which as you write is for WS a symptom of being a wiki. As a wiki, WS uses such attitudes/principles as "make it easy for people to contribute", "publish early, publish often", "let people do what they want, as long as it's a step, however small forward". PGDP on the other hand derives its attitudes/principles from a wish to create high quality end products. As such it uses "check and doublecheck", "limit the amount of projects we work on", "quality control" and "division of tasks".

-- André Engels, andreengels@gmail.com

Andre Engels

11:35 a.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

On Wed, Jun 30, 2010 at 12:42 PM, Samuel J Klein sj@wikimedia.org wrote:

...

< PGDP has a very strict and arduous workflow... The

...
result is quality, however only the text is sent downstream.

Why not send images and text downstream?

Because PGDP produces for Project Gutenberg, which publishes text and html versions, not scans.

...

Perhaps we have competing interfaces / workflows. but I expect we would be glad to share 99.99%-verified high-quality texts-unified-with-images if it were easy for both projects to identify that combination of quality and comprehensive data... and would be glad to share metadata so that a WS editor could quickly check to see if there's a PGDP effort covering an edition of the text she is proofing; and vice-versa.

For the PGDP side, it's possible to check at PGDP itself (one will need to get a login for that, but it's as free and unencumbered as the same on Wikimedia), but there is also a useful superset at http://www.dprice48.freeserve.co.uk/GutIP.html (warning! I'm talking of a 7 megabyte html file here). This contains, sorted by author (books by more than one author given multiple times) all books that have a clearance for Project Gutenberg.

For cooperation, one idea could be to get the PGDP material either after the P3 stage or after the F2 stage. As long as a project is still active, it isn't hard at all to get both the text and the scan pages.

-- André Engels, andreengels@gmail.com

Aubrey

15 Jul 15 Jul

11:26 p.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

...

Perhaps we have competing interfaces / workflows. but I expect we would be glad to share 99.99%-verified high-quality texts-unified-with-images if it were easy for both projects to identify that combination of quality and comprehensive data... and would be glad to share metadata so that a WS editor could quickly check to see if there's a PGDP effort covering an edition of the text she is proofing; and vice-versa.

As John was saying, right now there's plenty of stuff to be transcribed and proofread, it is not so easy to duplicate ;-)

The issue of metadata is nontheless serious, because it's one of the most important flaws of Wikisource: not applying standards (i.e Dublin Core) and not having a proper tools for export/import and harvest metadata is still make us amateurs, at least for "real" digital libraries (who focus mainly on the metadata stuff, and sometimes provide either texts or images (it is really rare to have both)).

...

I want us to get better, faster, less held up by the idea of coordinating with other projects, because there are much larger projects out there worthy of coordinating with. The annotators who work on the Perseus Project come to mind... but that's truly a harder problem than this one.

The Perseus project is an *amazing* project, but I regard them far more ahead than us. The PP is actually a Virtual Research Environment, with tools for scholars and researcher for studying texts, (concordances and similar stuff).

It happens that I just finished my Master thesis about collaborative digital libraries for scholars (in the Italian context), and the outcome is quite clear: researcher do want collaborative tools in DLs, but wiki system are to simple and (right now) too naive to really help scholars in their work (and there's a lot of other issues I'm not going to explain here).

I would love to have PP people involved in collaboration with Wikisource, just don't know if this is possible.

...

...
If the Wikisource projects succeeds in demonstrating the wiki way is a viable approach, the result is different people choosing to work in different workflows/projects, and more reliable etexts being produced.

It is interesting because a project similar to PGDP (it is Italian and started in 1993, emulating the glorious PG, just with Italian texts) is, right now, moving to a wiki. Although the scale is way smaller, Wikipedia and Wikisource showed them a system which tends to eliminate bottlenecks, and for them this is becoming crucial. Luckily, the relationships with the Italian Wikisource are really good, and they'll probably share an office with Wikimedia Italy, in October. The interesting fact is that the offices will be within a library ;-), so I really expect a collaboration there.

Just one more thing: why this awesome thread has not been linked to the source-l? Probably that would have been the best place to discuss.

My regards, Aubrey

Samuel J Klein

16 Jul 16 Jul

7:49 p.m.

New subject: [Foundation-l] Wikisource and reCAPTCHA

Hello Aubrey,

On Thu, Jul 15, 2010 at 7:26 PM, Aubrey zanni.andrea84@gmail.com wrote:

...

The issue of metadata is nontheless serious, because it's one of the most important flaws of Wikisource: not applying standards (i.e Dublin Core) and not having a proper tools for export/import and harvest metadata

Both good points. Are there proposals on wikisource to address these two points in a way that's friendly to wikisource contributors?

...

...
I want us to get better, faster, less held up by the idea of coordinating with other projects, because there are much larger projects out there worthy of coordinating with. The annotators who work on the Perseus Project come to mind... but that's truly a harder problem than this one.

The Perseus project is an *amazing* project, but I regard them far more ahead than us. The PP is actually a Virtual Research Environment, with tools for scholars and researcher for studying texts, (concordances and similar stuff).

...

I would love to have PP people involved in collaboration with Wikisource, just don't know if this is possible.

Yes, PP is ahead of us in some ways. But in other ways they have run into bottleneck and multilingual issues that a wiki environment can resolve.

I believe that Prof. Greg Crane of the Perseus Project (cc:ed here) is interested in starting to collaborate with Wikisource, even while pursuing ideas about developing a larger framework for wiki-style annotations and editions.

While it may be hard in the short term, in the long term that's what I think we all want wikisource to become.

...

It is interesting because a project similar to PGDP (it is Italian and started in 1993, emulating the glorious PG, just with Italian texts) is, right now, moving to a wiki. Although the scale is way smaller, Wikipedia and Wikisource showed them a system which tends to eliminate bottlenecks, and for them this is becoming crucial.

...

Luckily, the relationships with the Italian Wikisource are really good, and they'll probably share an office with Wikimedia Italy, in October. The interesting fact is that the offices will be within a library ;-), so I really expect a collaboration there.

Wow. This is all great to hear -- can you include a link to the project? I'd like to blog about it.

Warmly, SJ

Federico Leva (Nemo)

19 Jul 19 Jul

6:39 p.m.

New subject: [Foundation-l] [Wikisource-l] Wikisource and reCAPTCHA

Samuel J Klein, 16/07/2010 21:49:

...

On Thu, Jul 15, 2010 at 7:26 PM, Aubrey zanni.andrea84@gmail.com wrote:

...
Luckily, the relationships with the Italian Wikisource are really good, and they'll probably share an office with Wikimedia Italy, in October. The interesting fact is that the offices will be within a library ;-), so I really expect a collaboration there.

Wow. This is all great to hear -- can you include a link to the project? I'd like to blog about it.

There are some info in the newly published Wikimedia News no. 29 (WMI report January-July 2010): http://www.wikimedia.it/index.php/Wikimedia_news/numero_29/en#Wikimedia_Roma

Nemo

David Gerard

24 Jun 24 Jun

12:12 p.m.

On 23 June 2010 21:31, Mariano Cecowski marianocecowski@yahoo.com.ar wrote:

...

--- El mié 23-jun-10, Michael Peel email@mikepeel.net escribió:

...

...
I always think than not using reCaptcha is a shame, as it's a nice way to get people to proofread text in a reasonably efficient way. It would be really nice if someone could create something similar that proofreads OCR'd text from Wikisource... <hint, hint>.

...

And how do you decide that what was entered is wrong or right?

It turns out that having several randomly-selected people check a given recaptcha is very accurate indeed.

http://recaptcha.net/learnmore.html

"But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct."

Your question is similar to "But if anyone can edit Wikipedia, how do you know what's entered will be accurate?"

- d.

Nikola Smolenski

23 Jun 23 Jun

5:14 p.m.

Дана Wednesday 23 June 2010 16:34:26 Magnus Manske написа:

...

On Wed, Jun 23, 2010 at 3:17 PM, Tisza Gergo gtisza@gmail.com wrote:

...
Again, I would suggest using Google (or an alternative with open data, if one exists) instead of trying to reinvent the wheel:

http://translate.google.com/#auto%7Cen%7CPferd%20Schach http://code.google.com/apis/ajaxlanguage/documentation/#Detect

It might support less languages then we have wikipedias for, but I'm pretty sure it would give better results for the major ones.

Well, that's what I suggested a few mails ago in this very thread. However, people didn't seem to want it.

This tool of mine does use Google Translate, so probably it could be done in Javascript fully, if someone knows how.

John Doe

2:46 p.m.

Like I said before, If I can get some template support on commons, Ive got a translation tool that uses one of googles APIs for translating. I just need some assistance with figuring out how to best integrate it into commons. But I do have a on demand mass translation tool.

John

On Wed, Jun 23, 2010 at 10:17 AM, Tisza Gergo gtisza@gmail.com wrote:

...

Magnus Manske <magnusmanske@...> writes:

...
Basically, this will (on the search page only!) look at the last query run (the one currently in the edit box), check several language editions of Wikipedia for articles from the individual words (in this case, "Pferd" and "Schach"), count how many exist, pick the language with the most hits (in this case, German), and put a link to link to Nikola's tool under the search box. The link pre-fills the source language and query in the tool, which automatically opens the appropriate search page.

Again, I would suggest using Google (or an alternative with open data, if one exists) instead of trying to reinvent the wheel:

http://translate.google.com/#auto%7Cen%7CPferd%20Schach http://translate.google.com/#auto%7Cen%7CPferd%20Schach http://code.google.com/apis/ajaxlanguage/documentation/#Detect

It might support less languages then we have wikipedias for, but I'm pretty sure it would give better results for the major ones.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

Noein

5:18 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Oh, this function is very interesting. If it were coupled with a function to get synonyms and metonyms (ie, equidae, mount) as a proposal to enlarge or explore a concept, then a semantic map would be created to navigate Commons in all languages. Maybe context-related or frequently-associated keywords would be useful too.

On 23/06/2010 05:51, John Doe wrote:

...

the basic translation matrix is in place, here is how you say horse in as many languages as you can: http://commons.wikimedia.org/w/index.php?title=User:%CE%94/Sandbox&oldid...

John

On Tue, Jun 22, 2010 at 7:56 PM, John Doe phoenixoverride@gmail.com wrote:

...
Since I'm a fairly active programmer, I have some code sitting around. If I can get some support on commons with regards to templates (something that gives me nightmares) I could probably get a translation matrix program up and running within 24-48 hours. I would just need to figure out a good method for tracking what needs translated, what has been machine translated and needs review, and what has already been translated.

John

On Tue, Jun 22, 2010 at 5:26 PM, Mark Williamson node.ue@gmail.comwrote:

...
...
...
If we consider that current English native speakers mostly already have internet and

those

...
...
without internet are likelier than not to be non-English speakers I

would

...
...
be careful to advocate the unilateral use of English.

As would I, though I don't think you mean what you said.

Why not? To me, it means that we're widening the digital divide by making it so that people who don't have the internet would have little use for it anyways if it's all written in a language they don't understand.

m.

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJMIZktAAoJEHCAuDvx9Z6LvWgIAIQCIz32wbGENNRRezW3IkpH X2JCgvgEcAWOK8tOxCtZ2k/3pjFXE/bpIMl6suqhUj76yVx0g6zrqIICfN/+A1Q4 7mzlPiKKaMWTrZNCZKSdk/VF5nrjQy0guc85EiEqN/CUtRxXTwnM1huI9IpHb3b8 E96w62KhXjy1xNCARjN9xJf0p84ntMNctQOs8AxrloL5a29HQzKJsGSCVAgwbpfJ TU1HSfPcHMAG/OSUfx8Cq0J0lAVQTlIPsX3RSb461ll19QvgZ0giK0jCGvul5KDy 2g66tQZ4rVxVpVvwgz2CtcdZzy3/sX0//Uiq8CMxuTsMa2+vxIpZuBZsSwGFQX0= =ihfP -----END PGP SIGNATURE-----

Bence Damokos

22 Jun 22 Jun

9:34 p.m.

On Tue, Jun 22, 2010 at 9:33 PM, Anthony wikimail@inbox.org wrote:

...

On Tue, Jun 22, 2010 at 1:47 PM, Bence Damokos bdamokos@gmail.com wrote:

...
On Tue, Jun 22, 2010 at 6:32 PM, Anthony wikimail@inbox.org wrote:

...
Don't most Internet users know enough English to be able to search for "pictures of a horse" in English?

(According to Wikipedia ( http://en.wikipedia.org/wiki/Global_Internet_usage), yes... "Most Internet users speak the English language as a native or secondary language.")

If I read the data in the article correctly, most means 35%.

Since "most" means more than 50%, I don't think you read it correctly. The 35% figure seems to be only native English speakers.

According to the Mettiam-Webster dictionary, 'majority' is only one of the

meanings of 'most' (the primary being 'greatest in quantity, extent or degree'); if you look at the second table which seems to account for non-native speaker internet users as well, English is still gets about 30% share of total users.

Although,the linked Wikipedia article could use some improvement...

Best regards, Bence

Aphaia

5:58 p.m.

I know a horse, but yesterday it took for me five minutes to remember sparrows were the bird's name I would have liked to mention. .

It helps to make this discussion helpful to some extent that native English speakers remind it is sometimes not so easy as you the native expect foreign learners. It's no sarcasm at all. Really.

On Tue, Jun 22, 2010 at 6:32 PM, Anthony wikimail@inbox.org wrote:

...

On Tue, Jun 22, 2010 at 9:21 AM, Gerard Meijssen gerard.meijssen@gmail.comwrote:

...
When you think that Commons is bad in supporting other languages, try to find pictures of a horse on the internet in other languages like Estonian, Nepalese ... It is not the same at all as when you are looking for images in English.

Don't most Internet users know enough English to be able to search for "pictures of a horse" in English?

(According to Wikipedia (http://en.wikipedia.org/wiki/Global_Internet_usage), yes... "Most Internet users speak the English language as a native or secondary language.") _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- KIZU Naoko http://d.hatena.ne.jp/Britty (in Japanese) Quote of the Day (English): http://en.wikiquote.org/wiki/WQ:QOTD

Mark Williamson

8:16 p.m.

In addition, I have a feeling that article overstates the English abilities of the average non-native internet user. Yes, lots of people have a very (very!) basic command of English, but that is not the same as functional bilingualism. A user may happen to know the name for a horse, but what are the chances a casual user from Peru knows the name for an anteater, a giraffe or a jellyfish?

On Tue, Jun 22, 2010 at 10:58 AM, Aphaia aphaia@gmail.com wrote:

...

I know a horse, but yesterday it took for me five minutes to remember sparrows were the bird's name I would have liked to mention. .

It helps to make this discussion helpful to some extent that native English speakers remind it is sometimes not so easy as you the native expect foreign learners. It's no sarcasm at all. Really.

On Tue, Jun 22, 2010 at 6:32 PM, Anthony wikimail@inbox.org wrote:

...
On Tue, Jun 22, 2010 at 9:21 AM, Gerard Meijssen gerard.meijssen@gmail.comwrote:

...
When you think that Commons is bad in supporting other languages, try to find pictures of a horse on the internet in other languages like Estonian, Nepalese ... It is not the same at all as when you are looking for images in English.

Don't most Internet users know enough English to be able to search for "pictures of a horse" in English?

(According to Wikipedia (http://en.wikipedia.org/wiki/Global_Internet_usage), yes... "Most Internet users speak the English language as a native or secondary language.") _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

-- KIZU Naoko http://d.hatena.ne.jp/Britty (in Japanese) Quote of the Day (English): http://en.wikiquote.org/wiki/WQ:QOTD

foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l

wiki-list＠phizz.demon.co.uk

9:55 p.m.

Mark Williamson wrote:

...

In addition, I have a feeling that article overstates the English abilities of the average non-native internet user. Yes, lots of people have a very (very!) basic command of English, but that is not the same as functional bilingualism. A user may happen to know the name for a horse, but what are the chances a casual user from Peru knows the name for an anteater, a giraffe or a jellyfish?

There is a greater chance that the average Peruvian will know the English than the Latin. They'll probably know the local common name so one should ensure that they can at least find a picture of the critter by that name.

Jussi-Ville Heiskanen

23 Jun 23 Jun

8:01 a.m.

Mark Williamson wrote:

...

In addition, I have a feeling that article overstates the English abilities of the average non-native internet user. Yes, lots of people have a very (very!) basic command of English, but that is not the same as functional bilingualism. A user may happen to know the name for a horse, but what are the chances a casual user from Peru knows the name for an anteater, a giraffe or a jellyfish?

Amusingly enough, a former student of Martin Luther, by the name of Michael Agricola, faced this problem when translating the bible into Finnish in the 17th century.

Yes, Virginia, the Finnish language really didn't exist as a written word but late in the 17th century.

Michaels solution to the knotty problem of how to describe animals the common folk had not really had any experience of, was to rely on the most conspicuous visual, which often ended up mildly humorous to later readers. An ostritch he dubbed what would be literally "Stork-camel", ("kamelikurki" Lion in a more amusing coinage was to Michael "a noble deer" ("jalopeura"), going with the color of the pelt despite the fact that lions are hardly ruminants.

Yours,

Jussi-Ville Heiskanen

Keegan Peterzell

8:13 a.m.

On Wed, Jun 23, 2010 at 3:01 AM, Jussi-Ville Heiskanen <cimonavaro@gmail.com

...

wrote:

Lion in a more amusing coinage was to Michael "a noble deer" ("jalopeura"), going with the color of the pelt despite the fact that lions are hardly ruminants.

Yours,

Jussi-Ville Heiskanen

Not to mention the cats not cattle thing. A pride versus a herd is a world of difference in the realm of collective connotation.

-- ~Keegan http://en.wikipedia.org/wiki/User:Keegan

Keegan Peterzell

8:17 a.m.

Oh, I misread that. Disregard.

On Wed, Jun 23, 2010 at 3:13 AM, Keegan Peterzell keegan.wiki@gmail.comwrote:

...

On Wed, Jun 23, 2010 at 3:01 AM, Jussi-Ville Heiskanen < cimonavaro@gmail.com> wrote:

...
Lion in a more amusing coinage was to Michael "a noble deer" ("jalopeura"), going with the color of the pelt despite the fact that lions are hardly ruminants.

Yours,

Jussi-Ville Heiskanen

Not to mention the cats not cattle thing. A pride versus a herd is a world of difference in the realm of collective connotation. -- ~Keegan

http://en.wikipedia.org/wiki/User:Keegan

-- ~Keegan http://en.wikipedia.org/wiki/User:Keegan

5252

Age (days ago)

5279

Last active (days ago)

wikimedia-l@lists.wikimedia.org

56 comments

29 participants

tags (0)

participants (29)

Andre Engels
Anthony
Aphaia
Aubrey
Bence Damokos
Birgitte SB
David Gerard
David Goodman
Federico Leva (Nemo)
geni
Gerard Meijssen
James Forrester
John Doe
John Vandenberg
Jussi-Ville Heiskanen
Keegan Peterzell
Magnus Manske
Mariano Cecowski
Mark Williamson
Michael Peel
Nikola Smolenski
Noein
Samuel J Klein
Samuel Klein
Samuel Klein
Teofilo
Thomas Dalton
Tisza Gergo
wiki-list＠phizz.demon.co.uk