Re: [Wikisource-l] OCR as a service?

29 Jul 2015

Nice! I will wait for the client though, thx. Where will the source images
be stored? Labs or Commons? It would be nice if you could somehow make a
client that builds a djvu file locally with the page image and the OCR text
that you can cleanup before putting it into the djvu file. Now it just
seems there are so many hurdles to ws that it's quicker to post pages to
Commons and add the text in the template there.

On Wed, Jul 29, 2015 at 8:23 AM, Asaf Bartov &lt;abartov(a)wikimedia.org&gt; wrote:

...
  Hello again.

 So, I've set up an OpenOCR instance on Labs that's available for use as a
 service.  Just call it and point to an image.  Example:

 *curl -X POST -H "Content-Type: application/json" -d
 '{"img_url":"http://bit.ly/ocrimage
 <http://bit.ly/ocrimage>","engine":"tesseract"}'
 http://openocr.wmflabs.org/ocr <http://openocr.wmflabs.org/ocr>*

 should yield:

 "You can create local variables for the pipelines within the template by
 preﬁxing the variable name with a “$" sign. Variable names have to be
 composed of alphanumeric characters and the underscore. In the example
 below I have used a few variations that work for variable names."

 If we see evidence of abuse, we might have to protect it with API keys,
 but for now, let's AGF. :)

 I'm working on something that would be a client of this service, but don't
 have a demo yet.  Stay tuned! :)

    A.

 On Sun, Jul 12, 2015 at 3:27 PM, Alex Brollo &lt;alex.brollo(a)gmail.com&gt;
 wrote:

  I explored abbyy gx files, the full xml output
from ABBYY ocr engine
 running at Internet Archive, and I've been astonished by the amount of data
 they contain - they are stored at XCA_Extended  detaiI (as documented at
  http://www.abbyy-developers.com/en:tech:features:xml ).

 Something that wikisource best developers should explore; comparing those
 data with the little bit of data into mapped text layer of djvu files is
 impressive and should be inspiring.

 But they are static data coming from a standard setting... nothing
 similar to a service with simple, shared, deep learning features for
 difficult and ancient texts. I tried "ancient italian" tesseract dictionary
 with very poor results.

 So Asaf, I can't wait for good news from you. :-)

 Alex

 2015-07-12 12:50 GMT+02:00 Andrea Zanni &lt;zanni.andrea84(a)gmail.com&gt;om>:

 On Sun, Jul 12, 2015 at 11:25 AM, Asaf Bartov &lt;abartov(a)wikimedia.org&gt;
 wrote:

  On Sat, Jul 11, 2015 at 9:59 AM, Andrea Zanni
&lt;zanni.andrea84(a)gmail.com
 > wrote:

> uh, that sounds very interesting.
> Right now, we mainly use OCR from djvu from Internet Archive (that
> means ABBYY Finereader, which is very nice).
>

 Yes, the output is generally good.  But as far as I can tell, the
 archive's Open Library API does not offer a way to retrieve the OCR output
 programmatically, and certainly not for an arbitrary page rather than the
 whole item.  What I'm working on requires the ability to OCR a single page
 on demand.

 True.  I've recently met Giovanni, a new (italian) guy who's now working
with
 Internet Archive and Open Library.
 We discussed about a number of possible parnerships/projects, this is
 definitely one to bring it up.

 But if we manage to do it directly in the Wikimedia world it's even
 better.

 Aubrey

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

 --
     Asaf Bartov
     Wikimedia Foundation <http://www.wikimediafoundation.org>

 Imagine a world in which every single human being can freely share in the
 sum of all knowledge. Help us make it a reality!
 https://donate.wikimedia.org

 _______________________________________________
 Wikisource-l mailing list
 Wikisource-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikisource-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikisource-l] OCR as a service?