Hello again.

So, I've set up an OpenOCR instance on Labs that's available for use as a service. Just call it and point to an image. Example:

curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://openocr.wmflabs.org/ocr

should yield:

"You can create local variables for the pipelines within the template by
preﬁxing the variable name with a “$" sign. Variable names have to be
composed of alphanumeric characters and the underscore. In the example
below I have used a few variations that work for variable names."

If we see evidence of abuse, we might have to protect it with API keys, but for now, let's AGF. :)

I'm working on something that would be a client of this service, but don't have a demo yet. Stay tuned! :)

On Sun, Jul 12, 2015 at 3:27 PM, Alex Brollo <alex.brollo@gmail.com> wrote:

I explored abbyy gx files, the full xml output from ABBYY ocr engine running at Internet Archive, and I've been astonished by the amount of data they contain - they are stored at XCA_Extended detaiI (as documented at http://www.abbyy-developers.com/en:tech:features:xml ).

Something that wikisource best developers should explore; comparing those data with the little bit of data into mapped text layer of djvu files is impressive and should be inspiring.

But they are static data coming from a standard setting... nothing similar to a service with simple, shared, deep learning features for difficult and ancient texts. I tried "ancient italian" tesseract dictionary with very poor results.

So Asaf, I can't wait for good news from you. :-)

Alex

2015-07-12 12:50 GMT+02:00 Andrea Zanni <zanni.andrea84@gmail.com>:

On Sun, Jul 12, 2015 at 11:25 AM, Asaf Bartov <abartov@wikimedia.org> wrote:
On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni <zanni.andrea84@gmail.com> wrote:
uh, that sounds very interesting.
Right now, we mainly use OCR from djvu from Internet Archive (that means ABBYY Finereader, which is very nice).

Yes, the output is generally good. But as far as I can tell, the archive's Open Library API does not offer a way to retrieve the OCR output programmatically, and certainly not for an arbitrary page rather than the whole item. What I'm working on requires the ability to OCR a single page on demand.

True.
I've recently met Giovanni, a new (italian) guy who's now working with Internet Archive and Open Library.
We discussed about a number of possible parnerships/projects, this is definitely one to bring it up.

But if we manage to do it directly in the Wikimedia world it's even better.

Aubrey

_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Asaf Bartov
Wikimedia Foundation

Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us make it a reality!

https://donate.wikimedia.org