OCR as a service?

List overview All Threads
Download

newer

older

Europeana prototype of an online...

An idea for a diacritic marks tool

Asaf Bartov

11 Jul 2015 11 Jul '15

3:04 a.m.

Hi.

Speaking of Wikisource software, do we already have any instance set up for OCR as a service? (I'm thinking of OpenOCR[1] hosted in a Docker[2] container somewhere, perhaps on Labs.)

If yes, where is it and who maintains it, and can I use it as a client? If not, I am prepared to set this up.

[1] http://www.openocr.net/ [2] https://www.docker.com/

-- Asaf Bartov Wikimedia Foundation http://www.wikimediafoundation.org Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! https://donate.wikimedia.org

Attachments:

attachment.htm (text/html — 1009 bytes)

Show replies by date

Alex Brollo

11 Jul 11 Jul

8:32 a.m.

Very, very interesting.... I can't help you, my skill is very limited, but I'm very interested about and I hope that my interest will be largely shared.

Alex

2015-07-11 12:04 GMT+02:00 Asaf Bartov abartov@wikimedia.org:

...

Hi.

Speaking of Wikisource software, do we already have any instance set up for OCR as a service? (I'm thinking of OpenOCR[1] hosted in a Docker[2] container somewhere, perhaps on Labs.)

If yes, where is it and who maintains it, and can I use it as a client? If not, I am prepared to set this up.

A.

[1] http://www.openocr.net/ [2] https://www.docker.com/ -- Asaf Bartov Wikimedia Foundation http://www.wikimediafoundation.org

Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! https://donate.wikimedia.org

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Nicolas VIGNERON

8:44 a.m.

Hi,

I'm not a techie so I'm not sure to know what is OCR-as-service but you should ask Tpt and Phe who have OCR stuff on the tool labs (to know what is behind tools like http://tools.wmflabs.org/phetools/ocr.php ).

Cdlt, ~nicolas

Andrea Zanni

9:59 a.m.

uh, that sounds very interesting. Right now, we mainly use OCR from djvu from Internet Archive (that means ABBYY Finereader, which is very nice).

But ideally we could think of a "customizable" OCR software that gets trained language per language: htat would be extremely useful for Wiikisources.

(i can also imagine to divide, inside every language, per centuries, because languages too changes over time ;-)

Aubrey

On Sat, Jul 11, 2015 at 5:44 PM, Nicolas VIGNERON < vigneron.nicolas@gmail.com> wrote:

...

Hi,

I'm not a techie so I'm not sure to know what is OCR-as-service but you should ask Tpt and Phe who have OCR stuff on the tool labs (to know what is behind tools like http://tools.wmflabs.org/phetools/ocr.php ).

Cdlt, ~nicolas

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Asaf Bartov

12 Jul 12 Jul

2:25 a.m.

On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni zanni.andrea84@gmail.com wrote:

...

uh, that sounds very interesting. Right now, we mainly use OCR from djvu from Internet Archive (that means ABBYY Finereader, which is very nice).

Yes, the output is generally good. But as far as I can tell, the archive's Open Library API does not offer a way to retrieve the OCR output programmatically, and certainly not for an arbitrary page rather than the whole item. What I'm working on requires the ability to OCR a single page on demand.

But ideally we could think of a "customizable" OCR software that gets

...

trained language per language: htat would be extremely useful for Wiikisources.

(i can also imagine to divide, inside every language, per centuries, because languages too changes over time ;-)

Indeed.

Andrea Zanni

3:50 a.m.

On Sun, Jul 12, 2015 at 11:25 AM, Asaf Bartov abartov@wikimedia.org wrote:

...

On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni zanni.andrea84@gmail.com wrote:

...
uh, that sounds very interesting. Right now, we mainly use OCR from djvu from Internet Archive (that means ABBYY Finereader, which is very nice).

Yes, the output is generally good. But as far as I can tell, the archive's Open Library API does not offer a way to retrieve the OCR output programmatically, and certainly not for an arbitrary page rather than the whole item. What I'm working on requires the ability to OCR a single page on demand.

True.

I've recently met Giovanni, a new (italian) guy who's now working with Internet Archive and Open Library. We discussed about a number of possible parnerships/projects, this is definitely one to bring it up.

But if we manage to do it directly in the Wikimedia world it's even better.

Aubrey

...

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Alex Brollo

3:27 p.m.

I explored abbyy gx files, the full xml output from ABBYY ocr engine running at Internet Archive, and I've been astonished by the amount of data they contain - they are stored at XCA_Extended detaiI (as documented at http://www.abbyy-developers.com/en:tech:features:xml ).

Something that wikisource best developers should explore; comparing those data with the little bit of data into mapped text layer of djvu files is impressive and should be inspiring.

But they are static data coming from a standard setting... nothing similar to a service with simple, shared, deep learning features for difficult and ancient texts. I tried "ancient italian" tesseract dictionary with very poor results.

So Asaf, I can't wait for good news from you. :-)

Alex

2015-07-12 12:50 GMT+02:00 Andrea Zanni zanni.andrea84@gmail.com:

...

On Sun, Jul 12, 2015 at 11:25 AM, Asaf Bartov abartov@wikimedia.org wrote:

...
On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni zanni.andrea84@gmail.com wrote:

...
uh, that sounds very interesting. Right now, we mainly use OCR from djvu from Internet Archive (that means ABBYY Finereader, which is very nice).

Yes, the output is generally good. But as far as I can tell, the archive's Open Library API does not offer a way to retrieve the OCR output programmatically, and certainly not for an arbitrary page rather than the whole item. What I'm working on requires the ability to OCR a single page on demand.

True.

I've recently met Giovanni, a new (italian) guy who's now working with Internet Archive and Open Library. We discussed about a number of possible parnerships/projects, this is definitely one to bring it up.

But if we manage to do it directly in the Wikimedia world it's even better.

Aubrey

...

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Asaf Bartov

28 Jul 28 Jul

11:23 p.m.

Hello again.

So, I've set up an OpenOCR instance on Labs that's available for use as a service. Just call it and point to an image. Example:

*curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage http://bit.ly/ocrimage","engine":"tesseract"}' http://openocr.wmflabs.org/ocr http://openocr.wmflabs.org/ocr*

should yield:

"You can create local variables for the pipelines within the template by preﬁxing the variable name with a “$" sign. Variable names have to be composed of alphanumeric characters and the underscore. In the example below I have used a few variations that work for variable names."

If we see evidence of abuse, we might have to protect it with API keys, but for now, let's AGF. :)

I'm working on something that would be a client of this service, but don't have a demo yet. Stay tuned! :)

On Sun, Jul 12, 2015 at 3:27 PM, Alex Brollo alex.brollo@gmail.com wrote:

...

I explored abbyy gx files, the full xml output from ABBYY ocr engine running at Internet Archive, and I've been astonished by the amount of data they contain - they are stored at XCA_Extended detaiI (as documented at http://www.abbyy-developers.com/en:tech:features:xml ).

Something that wikisource best developers should explore; comparing those data with the little bit of data into mapped text layer of djvu files is impressive and should be inspiring.

But they are static data coming from a standard setting... nothing similar to a service with simple, shared, deep learning features for difficult and ancient texts. I tried "ancient italian" tesseract dictionary with very poor results.

So Asaf, I can't wait for good news from you. :-)

Alex

2015-07-12 12:50 GMT+02:00 Andrea Zanni zanni.andrea84@gmail.com:

...
On Sun, Jul 12, 2015 at 11:25 AM, Asaf Bartov abartov@wikimedia.org wrote:

...
On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni zanni.andrea84@gmail.com wrote:

...
uh, that sounds very interesting. Right now, we mainly use OCR from djvu from Internet Archive (that means ABBYY Finereader, which is very nice).

Yes, the output is generally good. But as far as I can tell, the archive's Open Library API does not offer a way to retrieve the OCR output programmatically, and certainly not for an arbitrary page rather than the whole item. What I'm working on requires the ability to OCR a single page on demand.

True.

I've recently met Giovanni, a new (italian) guy who's now working with Internet Archive and Open Library. We discussed about a number of possible parnerships/projects, this is definitely one to bring it up.

But if we manage to do it directly in the Wikimedia world it's even better.

Aubrey

...

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Jane Darnell

11:33 p.m.

Nice! I will wait for the client though, thx. Where will the source images be stored? Labs or Commons? It would be nice if you could somehow make a client that builds a djvu file locally with the page image and the OCR text that you can cleanup before putting it into the djvu file. Now it just seems there are so many hurdles to ws that it's quicker to post pages to Commons and add the text in the template there.

On Wed, Jul 29, 2015 at 8:23 AM, Asaf Bartov abartov@wikimedia.org wrote:

...

Hello again.

So, I've set up an OpenOCR instance on Labs that's available for use as a service. Just call it and point to an image. Example:

*curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage http://bit.ly/ocrimage","engine":"tesseract"}' http://openocr.wmflabs.org/ocr http://openocr.wmflabs.org/ocr*

should yield:

"You can create local variables for the pipelines within the template by preﬁxing the variable name with a “$" sign. Variable names have to be composed of alphanumeric characters and the underscore. In the example below I have used a few variations that work for variable names."

If we see evidence of abuse, we might have to protect it with API keys, but for now, let's AGF. :)

I'm working on something that would be a client of this service, but don't have a demo yet. Stay tuned! :)

A.

On Sun, Jul 12, 2015 at 3:27 PM, Alex Brollo alex.brollo@gmail.com wrote:

...
I explored abbyy gx files, the full xml output from ABBYY ocr engine running at Internet Archive, and I've been astonished by the amount of data they contain - they are stored at XCA_Extended detaiI (as documented at http://www.abbyy-developers.com/en:tech:features:xml ).

Something that wikisource best developers should explore; comparing those data with the little bit of data into mapped text layer of djvu files is impressive and should be inspiring.

But they are static data coming from a standard setting... nothing similar to a service with simple, shared, deep learning features for difficult and ancient texts. I tried "ancient italian" tesseract dictionary with very poor results.

So Asaf, I can't wait for good news from you. :-)

Alex

2015-07-12 12:50 GMT+02:00 Andrea Zanni zanni.andrea84@gmail.com:

...
On Sun, Jul 12, 2015 at 11:25 AM, Asaf Bartov abartov@wikimedia.org wrote:

...
On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni <zanni.andrea84@gmail.com

...
wrote:

...
uh, that sounds very interesting. Right now, we mainly use OCR from djvu from Internet Archive (that means ABBYY Finereader, which is very nice).

Yes, the output is generally good. But as far as I can tell, the archive's Open Library API does not offer a way to retrieve the OCR output programmatically, and certainly not for an arbitrary page rather than the whole item. What I'm working on requires the ability to OCR a single page on demand.

True.

I've recently met Giovanni, a new (italian) guy who's now working with Internet Archive and Open Library. We discussed about a number of possible parnerships/projects, this is definitely one to bring it up.

But if we manage to do it directly in the Wikimedia world it's even better.

Aubrey

...

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

-- Asaf Bartov Wikimedia Foundation http://www.wikimediafoundation.org

Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! https://donate.wikimedia.org

Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Asaf Bartov

12 Jul 12 Jul

2:23 a.m.

On Sat, Jul 11, 2015 at 8:44 AM, Nicolas VIGNERON < vigneron.nicolas@gmail.com> wrote:

...

Hi,

I'm not a techie so I'm not sure to know what is OCR-as-service but you should ask Tpt and Phe who have OCR stuff on the tool labs (to know what is behind tools like http://tools.wmflabs.org/phetools/ocr.php ).

Thanks for the pointer! I don't see any documentation on how to feed images to it, though, and no pointer to the source code to figure it out on my own. Help?

billinghurst

3:21 a.m.

OCR is available by a javascript. Numbers of wikisources have it enabled as a gadget, though I cannot speak for all the wikis. I presume it relates to the languages available in the OCR.

Script is noted at https://wikisource.org/wiki/Wikisource:Shared_Scripts

Regards, Billinghurst

On Sun, Jul 12, 2015 at 7:23 PM Asaf Bartov abartov@wikimedia.org wrote:

...

On Sat, Jul 11, 2015 at 8:44 AM, Nicolas VIGNERON < vigneron.nicolas@gmail.com> wrote:

...
Hi,

I'm not a techie so I'm not sure to know what is OCR-as-service but you should ask Tpt and Phe who have OCR stuff on the tool labs (to know what is behind tools like http://tools.wmflabs.org/phetools/ocr.php ).

Thanks for the pointer! I don't see any documentation on how to feed images to it, though, and no pointer to the source code to figure it out on my own. Help?
A.
-- Asaf Bartov Wikimedia Foundation http://www.wikimediafoundation.org

Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality! https://donate.wikimedia.org _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l

3445

Age (days ago)

3463

Last active (days ago)

wikisource-l@lists.wikimedia.org

10 comments

6 participants

tags (0)

participants (6)

Alex Brollo
Andrea Zanni
Asaf Bartov
billinghurst
Jane Darnell
Nicolas VIGNERON