Fwd: wikicaptcha on GitHub

List overview All Threads
Download

newer

older

JavaScript profiling

Please look at 1.19 blocker bugs

Cristian Consonni

11 Jan 2012 11 Jan '12

6:19 p.m.

2012/1/6 Platonides platonides@gmail.com:

...

Integrating this into ConfirmEdit extension shouldn't be hard. It's the extra features what makes this tricky. This system is interesting for gathering translations, but doesn't work for verifying that the answer is right. How would you verify that? The approach that comes to my mkind is to show both the current captcha plus another, optional, captcha, with a note about how filling that second captcha helps wikisource, and that the answer will be logged with their username/ip.

ReCAPTCHA already works in a way similar to this. Two words are presented but only one is known and actually serves to filtrate accesses. They then collect answers for both words and if the test on the first is passed (which indicates a human) then the answer for the second is recorded. When a certain number of people agree on the transcription of a previously unknown word then that transcription is taken as good and used in future as a filter word. We could say that we accept a transcription as "valid" after N people agrees on a given word and put back on Wikisource only the validated words (and also use them as filters, too) . This seems both a reliable and easy-to-implement system to me.

Anyway, at the beginning we could use the system you describe using the current captcha and words from books.

I believe the trickiest part is creating a system to put results back in Wikisource in a semi-automated way, but having "captcha reviewers" may help.

We could also decorate our captcha with "this captcha helps transcribing <BOOK TITLE> + link".

And this leads me to what I think is the real point: once we have a basically-working system we can think to whatever useful feature and implement it, in principle we can have a modular system which can be refined /at libitum/.

Cristian

Show replies by date

David Gerard

11 Jan 11 Jan

6:22 p.m.

On 11 January 2012 17:19, Cristian Consonni kikkocristian@gmail.com wrote:

...

We could also decorate our captcha with "this captcha helps transcribing <BOOK TITLE> + link".

Hah, use it for editor recruitment!

- d.

Cristian Consonni

6:27 p.m.

2012/1/11 David Gerard dgerard@gmail.com:

...

On 11 January 2012 17:19, Cristian Consonni kikkocristian@gmail.com wrote:

...
We could also decorate our captcha with "this captcha helps transcribing <BOOK TITLE> + link".

Hah, use it for editor recruitment!

That was the point, indeed.

Cristian

Trevor Parscal

8:03 p.m.

I spoke to some people at the Internet Archive about the ReCaptcha situation, and learned something interesting.

Apparently, although IA provided a large dataset to ReCaptcha, they never got any data back, and then after the Google acquisition, they got shut out completely.

I highly recommend we get IA involved if at all possible - it sounds like they have a data set they could provide us (identical to the one they provided ReCaptcha), or at least know exactly how to generate one. We could, you know, ACTUALLY provide them with the results and be good open content citizens.

- Trevor

On Wed, Jan 11, 2012 at 9:27 AM, Cristian Consonni kikkocristian@gmail.comwrote:

...

2012/1/11 David Gerard dgerard@gmail.com:

...
On 11 January 2012 17:19, Cristian Consonni kikkocristian@gmail.com

wrote:

...
...
We could also decorate our captcha with "this captcha helps transcribing <BOOK TITLE> + link".

Hah, use it for editor recruitment!

That was the point, indeed.

Cristian

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

David Gerard

8:07 p.m.

On 11 January 2012 19:03, Trevor Parscal tparscal@wikimedia.org wrote:

...

Apparently, although IA provided a large dataset to ReCaptcha, they never got any data back, and then after the Google acquisition, they got shut out completely. I highly recommend we get IA involved if at all possible - it sounds like they have a data set they could provide us (identical to the one they provided ReCaptcha), or at least know exactly how to generate one. We could, you know, ACTUALLY provide them with the results and be good open content citizens.

I wonder if Google will try bringing patent claims against a reimplementation of reCaptcha.

- d.

Trevor Parscal

8:08 p.m.

We have a lawyer that can help determine that. It's not obvious to me (or you apparently) so I guess we should get one involved.

- Trevor

On Wed, Jan 11, 2012 at 11:07 AM, David Gerard dgerard@gmail.com wrote:

...

On 11 January 2012 19:03, Trevor Parscal tparscal@wikimedia.org wrote:

...
Apparently, although IA provided a large dataset to ReCaptcha, they never got any data back, and then after the Google acquisition, they got shut

out

...
completely. I highly recommend we get IA involved if at all possible - it sounds like they have a data set they could provide us (identical to the one they provided ReCaptcha), or at least know exactly how to generate one. We could, you know, ACTUALLY provide them with the results and be good open content citizens.

I wonder if Google will try bringing patent claims against a reimplementation of reCaptcha.

d.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Gregory Varnum

8:28 p.m.

My amateur inquiry ( http://www.uspto.gov/patents/process/search/index.jsp ) found this: http://1.usa.gov/xCrBvq

I imagine Geoff will have a much clearer idea of it this applies and how Google treats them. :)

-greg aka varnent

On Jan 11, 2012, at 2:08 PM, Trevor Parscal wrote:

...

We have a lawyer that can help determine that. It's not obvious to me (or you apparently) so I guess we should get one involved.

Trevor

On Wed, Jan 11, 2012 at 11:07 AM, David Gerard dgerard@gmail.com wrote:

...
On 11 January 2012 19:03, Trevor Parscal tparscal@wikimedia.org wrote:

...
Apparently, although IA provided a large dataset to ReCaptcha, they never got any data back, and then after the Google acquisition, they got shut

out

...
completely. I highly recommend we get IA involved if at all possible - it sounds like they have a data set they could provide us (identical to the one they provided ReCaptcha), or at least know exactly how to generate one. We could, you know, ACTUALLY provide them with the results and be good open content citizens.

I wonder if Google will try bringing patent claims against a reimplementation of reCaptcha.

d.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

mhershberger＠wikimedia.org

12 Jan 12 Jan

6:05 a.m.

Trevor Parscal tparscal@wikimedia.org writes:

...

Apparently, although IA provided a large dataset to ReCaptcha, they never got any data back, and then after the Google acquisition, they got shut out completely.

This is why I registered FreeCaptcha.net. I read how people's effort was disappearing into a black hole and wanted to do something.

The only thing I did was register the domain, but maybe I'll be able to put it to good use now.

Mark.

Platonides

12:13 a.m.

On 11/01/12 18:19, Cristian Consonni wrote:

...

ReCAPTCHA already works in a way similar to this. Two words are presented but only one is known and actually serves to filtrate accesses. They then collect answers for both words and if the test on the first is passed (which indicates a human) then the answer for the second is recorded. When a certain number of people agree on the transcription of a previously unknown word then that transcription is taken as good and used in future as a filter word.

I know. I was thinking on the basis that being so different, people could cheat by giving a bad 'learning' word. Originally, recaptcha showed two images apparently equal, although recently it seems to have changed, and it's clear which is the test and which the unknown. Although we could even plainly make the second one optional, letting the users choose if they want to help or not. We would get less trains but of higher quality.

...

I believe the trickiest part is creating a system to put results back in Wikisource in a semi-automated way, but having "captcha reviewers" may help.

I was just 'defering' that :)

Nikola Smolenski

15 Jan 15 Jan

6:04 a.m.

Дана Wednesday 11 January 2012 18:19:14 Cristian Consonni написа:

...

I believe the trickiest part is creating a system to put results back in Wikisource in a semi-automated way, but having "captcha reviewers" may help.

OCRs generally work by finding lines of text on a page, splitting the lines into letters, then recognizing each letter separately. So, an OCR would know, for each letter of the recognized text, what is its bounding box on the page.

However, to my knowledge there is not a single OCR that exports this data, nor is there a standard format for it. If an open source OCR could be modified to do this, then it would be easy to inject data retreieved from captchas back into OCR-ed text. And it could be used for so much more :)

Leonard Wallentin

11:04 a.m.

New subject: Erasing old file versions?

Hello, I have a wiki where bots are uploading new versions of local file once in a while, slowly making the file archive bigger and bigger. Has anyone seen a maintenance script (or is there a simple way) to remove all old verisons of files? The old versions are of no use to the end users, and there are no licenses that require me to keep them (normal users use files from Commons), so they are really just a waste of space. (Here is what it might look like: http://xn--ssongsmat-v2a.nu/ssm/Fil:Export,_Pumpkins,_squash_and_gourds,_200... ) Regards,Leo Wallentin Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543http://s%C3%A4songsmat.nu

http://nairobikoll.se/ http://twitter.com/leo_wallentin Skype: leo_wallentin

-----BEGIN PGP PUBLIC KEY BLOCK----- Version: GnuPG v1.4.10 (GNU/Linux)

mQENBE1JuiUBCADsAnoxo1o3G2Apkjs8NHSfbN4g/H8HbS4XXfjruZ6Afu6PBMJI CoiopKmTzQojjUZEjM2i4QsynU2xX4PmpMloPCOXEWxQg7q6HgHKyBU9NroL+17T WA/30jGTmi2pkuznX1LpYKz0r8oSUsrg9oIeJmSbPRP6bNjwlgaDzarfYjcfEFXu Y0s8KdVT1M8Zw2Sq7QuYPaiT+gfLMN5FaCzCZPdHZu8L1lXszTtFxztiF0hTd2C6 8yAW3DlTqTOSWqQtt10tf8VxeXPGxfqs6Ti8s5MsAsUtg9vw0A7gQC36SmRXel7W l/rYbkESqDKANgIT6ofbIru0ozdy/BxnL2wXABEBAAG0LUxlb25hcmQgV2FsbGVu dGluIDxsZW9fd2FsbGVudGluQGhvdG1haWwuY29tPokBOAQTAQIAIgUCTUm6JQIb AwYLCQgHAwIGFQgCCQoLBBYCAwECHgECF4AACgkQOYTL0NQ4Ju9AwQf+L655KY1W 9Q43IOfZ6hJBwfPjhC4pWppptxe4atsSIo+wh+UHd4Zle59LjMZqCGJrFmhRNk+E DtReKuZT/9aZ6yIoqIf0qgqs2L+NPsFprLlsl284cZZtU7YR9oeOVwAK6l58pWfD 1YQnZEOZDcklvunXI7SpFesB3YbuEnlcU3AJQ9hBJquEuMsGXlcXin/1zid+wEWW lkJz4nqp/EaZ9ITHpSzhftsvknskttLqbbEiXyGjMH+FO99S5Vbn9PZAs3axRCHV MWZqx7DLM6FOTlowklLR4lH0UGawwDwjRICDJQlhcS4fA5ORByXE3zbZ45MleQJK aTYuzxb6IVBzBrkBDQRNSbolAQgAv2RTauQ/aQbpS718FPoxPCdA/GgRXvYQ/dle G7m+p0EBUuu+XlDThyQOrWMBy4UICp2OvChfeb0x7SQ2Xg7ahRkWuKnhGiPKkvoZ qBVrbZ1bKcjA6QXcImelICtSjd3UTtCHcfNttEe5d000GaRJBAzsZseDVebpblLt X1z/n/9nsas+moAdRpiyfSPX1HFW57429GzDUsyCvQfqaPwPCuZa4OBtqxw/ydyn hGh1fBcm4bwOU5nqUt+N/d7GXSpZtYChuNhQZj0uwtvMnXJoKyIEPRIx1xkRJTaC uZTZKqnNU3EGwtYDhAgMQpVzFXXBoSBAfV9Jz9XPHd4RidakEwARAQABiQEfBBgB AgAJBQJNSbolAhsMAAoJEDmEy9DUOCbviRgH/3Tm05nqdsFk531eOLMkCqSoPupM YSId04pE9qgKZCGHvqYYxWuksgDj9BFCm+vMfW+e45bX4nd2bxkZecMCVvAOQrsB yKnk7g4BFI6YxYluCt5ouaRPcB73ztk+08z9j0GvlCo3IAp06neoC/IH1XhUvkts bXBzxQt7zO7Gic54U516Qzsr4iD6MQsBkGoaSKtLtk4v8xFMgDWIl1ODy70qabES 7+5IJn0vnh9DWJqkosofRMcqYoblwvf7orDuSgr76wokaTKCCsNEOt7TTbO8XVyQ r3wzIiopALjOj/2N9kxaHg9MQOfERV6prfZmkxPP69ptlCnJVZL7vwTO/0E= =KIpq -----END PGP PUBLIC KEY BLOCK-----

Thomas Gries

2:53 p.m.

New subject: Erasing old file versions?

Am 15.01.2012 11:04, schrieb Leonard Wallentin:

...

Hello, I have a wiki where bots are uploading new versions of local file once in a while, slowly making the file archive bigger and bigger. Has anyone seen a maintenance script (or is there a simple way) to remove all old verisons of files? The old versions are of no use to the end users, and there are no licenses that require me to keep them (normal users use files from Commons), so they are really just a waste of space. (Here is what it might look like: http://xn--ssongsmat-v2a.nu/ssm/Fil:Export,_Pumpkins,_squash_and_gourds,_200... ) Regards,Leo Wallentin Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543http://s%C3%A4songsmat.nu

I used successfully https://www.mediawiki.org/wiki/Extension:SpecialDeleteOldRevisions2 for many years with MediaWiki 1.15.1

It does certainly not work with MediaWiki trunk.

Please check the https://www.mediawiki.org/wiki/Extension_talk:SpecialDeleteOldRevisions2 for Bug reports.

Tom

Leonard Wallentin

2:59 p.m.

New subject: Erasing old file versions?

Am 15.01.2012 11:04, schrieb Leonard Wallentin:

...

Hello, I have a wiki where bots are uploading new versions of local file once in a while, slowly making the file archive bigger and bigger. Has anyone seen a maintenance script (or is there a simple way) to remove all old verisons of files? The old versions are of no use to the end users, and there are no licenses that require me to keep them (normal users use files from Commons), so they are really just a waste of space. (Here is what it might look like: http://xn--ssongsmat-v2a.nu/ssm/Fil:Export,_Pumpkins,_squash_and_gourds,_200... ) Regards,Leo Wallentin Leonard Wallentin leo_wallentin@hotmail.com +46 (0)735-933 543http://s%C3%A4songsmat.nu

I used successfully https://www.mediawiki.org/wiki/Extension:SpecialDeleteOldRevisions2 for many years with MediaWiki 1.15.1

Thank you, but as far as I can understand SpecialDeleteOldRevisions2 does *not* delete archived files, but old page revisions? (I have used it for that purpose myself once or twice.) Removing old revisions of a file page still leaves the old file versions in $IP/images/archive ./Leo

Thomas Gries

3:29 p.m.

New subject: General question regarding Extensions and their status information

RE: https://www.mediawiki.org/wiki/Extension_status

Hello all,

regarding those Extensions which are currently marked as "stable" for the one reason or another, but which are only "stable" for some versions (but do not work with 1.18 or trunk)

Example: https://www.mediawiki.org/wiki/Extension:SpecialDeleteOldRevisions2 is marked as stable, but does not work with trunk.

I suggest to introduce a new status which means "stable for certain versions but not working for trunk" or something meaningful else.

This can be discussed on https://www.mediawiki.org/w/index.php/Talk:Extension_status

Chad

3:03 p.m.

New subject: Erasing old file versions?

On Sun, Jan 15, 2012 at 5:04 AM, Leonard Wallentin leo_wallentin@hotmail.com wrote:

...

Has anyone seen a maintenance script (or is there a simple way) to remove all old verisons of files?

Yep, you're looking for the "deleteArchivedFiles" and "deleteArchivedRevisions" maintenance scripts (both of which can be found in the 'maintenance' directory).

No need to use the outdated extension Tom suggested :)

-Chad

Leonard Wallentin

4:14 p.m.

New subject: Erasing old file versions?

...

On Sun, Jan 15, 2012 at 5:04 AM, Leonard Wallentin> leo_wallentin@hotmail.com wrote:

...
Has anyone seen a maintenance script (or is there a simple way) to remove all old verisons of files?

Yep, you're looking for the "deleteArchivedFiles" and "deleteArchivedRevisions" maintenance scripts (both of which can be found in the 'maintenance' directory).

-Chad

Beautiful, thank you a lot! /Leo

Cristian Consonni

19 Jan 19 Jan

11:19 a.m.

2012/1/15 Nikola Smolenski smolensk@eunet.rs:

...

Дана Wednesday 11 January 2012 18:19:14 Cristian Consonni написа: However, to my knowledge there is not a single OCR that exports this data, nor is there a standard format for it. If an open source OCR could be modified to do this, then it would be easy to inject data retreieved from captchas back into OCR-ed text. And it could be used for so much more :)

I know (but I am not proficient in their use) at least two open source OCR softwares: * OCRopus[1a][1b], by the German Research Center for Artificial Intelligence, sponsored by Google * Tesseract[2a][2b], started by HP in far 1995, now Google-sponsored (yeah, this one too!) [note: as far as I know OCRopus used tesserect as an engine for OCR] * GOCR/JOCR

I think much can be done.

Cristian

[1a]http://code.google.com/p/ocropus/ [1b]http://en.wikipedia.org/wiki/OCRopus [2a]http://code.google.com/p/tesseract-ocr/ [2b]http://en.wikipedia.org/wiki/Tesseract_%28software%29 [3]http://jocr.sourceforge.net/

Tei

12:03 p.m.

On 19 January 2012 11:19, Cristian Consonni kikkocristian@gmail.com wrote:

...

2012/1/15 Nikola Smolenski smolensk@eunet.rs:

...
Дана Wednesday 11 January 2012 18:19:14 Cristian Consonni написа: However, to my knowledge there is not a single OCR that exports this data, nor is there a standard format for it. If an open source OCR could be modified to do this, then it would be easy to inject data retreieved from captchas back into OCR-ed text. And it could be used for so much more :)

I know (but I am not proficient in their use) at least two open source OCR softwares:

OCRopus[1a][1b], by the German Research Center for Artificial

Intelligence, sponsored by Google

Tesseract[2a][2b], started by HP in far 1995, now Google-sponsored

(yeah, this one too!) [note: as far as I know OCRopus used tesserect as an engine for OCR]

GOCR/JOCR

I think much can be done.

Cristian

More related tools, the documentcloud project.

Raw Engine => Tools http://documentcloud.github.com/docsplit/

Tools => Human Documents https://github.com/documentcloud/document-viewer

Human Documents => Beatiful viewers http://www.pbs.org/newshour/rundown/documents/mark-twain-concerning-the-inte... http://www.commercialappeal.com/withers-exposed/pages-from-foia-reveal-withe...

Using tesseract alone is "too much work". Tesseract want tiff files in a particular format, and DPI. Humans want stuff in a easy to use format, perhaps click on a image and get the text directly behind the mouse arrow as text can be copied and paste.

-- -- ℱin del ℳensaje.

Federico Leva (Nemo)

25 Jan 25 Jan

10:23 a.m.

We sort of use IA's data already, because many Wikisource texts are OCR'ed on IA. If we manage to use OCR improvements within DjVu, it shouldn't be too difficult to reupload such DjVu in their items and then they could do what they want with them.

...

OCRs generally work by finding lines of text on a page, splitting the

lines into letters, then recognizing each letter separately. So, an OCR would

...

know, for each letter of the recognized text, what is its bounding box on

the page.

...

However, to my knowledge there is not a single OCR that exports this

data, nor

...

is there a standard format for it. If an open source OCR could be

modified to

...

do this, then it would be easy to inject data retreieved from

captchas back

...

into OCR-ed text. And it could be used for so much more :)

I don't understand, what data are you talking about? DjVu is an open format and can store character mappings, which is what the wikicaptcha proof of concept is based on. There's also https://en.wikipedia.org/wiki/HOCR and IA uses some proprietary ABBYY xml format which AFAIK can be somehow read and converted to hOCR.

The real problem is character training which could be used for subsequent OCRs. I doubt we can do much here, because everyone uses ABBYY, and even tesseract users don't seem to share such data in any way.

Nemo

Nikola Smolenski

2 Feb 2 Feb

10:08 a.m.

On 25/01/12 10:23, Federico Leva (Nemo) wrote:

...

...
OCRs generally work by finding lines of text on a page, splitting the

lines into letters, then recognizing each letter separately. So, an OCR would

...
know, for each letter of the recognized text, what is its bounding box on

the page.

...
However, to my knowledge there is not a single OCR that exports this

data, nor

...
is there a standard format for it. If an open source OCR could be

modified to

...
do this, then it would be easy to inject data retreieved from

captchas back

...
into OCR-ed text. And it could be used for so much more :)

I don't understand, what data are you talking about? DjVu is an open

If you know what is the bounding box of the image of the word you are sending to the captcha, how the OCR read that word, and how the users have corrected it via the captcha, it should be easy to move the corrected word back into the OCR output.

...

format and can store character mappings, which is what the wikicaptcha proof of concept is based on. There's also https://en.wikipedia.org/wiki/HOCR and IA uses some proprietary ABBYY xml format which AFAIK can be somehow read and converted to hOCR.

I have to say I didn't knew about these developments :) (I knew about ABBY's format, but it's, as you said, proprietary.)

...

The real problem is character training which could be used for subsequent OCRs. I doubt we can do much here, because everyone uses ABBYY, and even tesseract users don't seem to share such data in any way.

It is a pity, for as I said many things could be done with this data. For example, it would be possible to read the same text with multiple different OCRs and quickly find errors (if a text is read the same then it's likely correct, and if it's read differently then it's certainly wrong). It would be possible to use this data to retrain the OCR or to develop new OCRs and so on.

4713

Age (days ago)

4735

Last active (days ago)

wikitech-l@lists.wikimedia.org

19 comments

12 participants

tags (0)

participants (12)

Chad
Cristian Consonni
David Gerard
Federico Leva (Nemo)
Gregory Varnum
Leonard Wallentin
mhershberger＠wikimedia.org
Nikola Smolenski
Platonides
Tei
Thomas Gries
Trevor Parscal