Dear all,
it's a lot aof time I'm wondering a project that could help Wikisource (and some GLAMs too), and the idea is simply to install ABBYY Finereader 11 on toolserver,
as a tool for all Wikisource users.

For those who don't know, ABBYY Finereader is an OCR software: it is proprietary and fairly expensive,
but it is accurate and works really, really well. Plus, its 11 version can save files in DjVu.

Now, in my mind having such a software on toolserver could take us to:
- restore our beloved OCR button, with a much more accurate OCR
- use Finereader for transforming PDF/TIFF/JPG from Commons directly in OCRred DjVus.
- others things I've not thought yet

Issues are many too:
- Cost: I don't know how much this could cost. Many WM chapters do give money to toolserver, and the status of the thing is a bit fuzzy a the moment, but, for example,
Wikimedia Italy has frozen 5000 euros for the toolserver, and maybe we can use those money for the license (I'm in WMI Board, and I've asked, they say it's OK);
- Technical: afaik, toolserver run Solaris, and apparently Finereader is Windows only (I think we can solve easly this if we want, though)
- Ethic: this is proprietary software, and I don't know if we *want* to use it on Wikimedia projects...
- Resources: i think this is probably the main issue: we need skilled people to set this up technically, and at least one toolserver operator (Phe, maybe?)

Below, the mail I sent to ABBYY Europe, to see it the thing was feasible.
They simply replied they want a phone call. Of course, if the thing would be too expensive the projects collapse immediately,
but I think it's worth to discuss. If nobody wants it, I can drop it right now.

Please, forward this may to everyone possibly interested,
I don't thin it's a good idea to scatter discussions in every ws Village Pump.

Cheers

Aubrey


 

Sent: Friday, November 25, 2011 10:20 AM

Subject: Questions about server licenses

 

Dear ABBYY Europe,
my name is Andrea Zanni, and I'm a Board member of Wikimedia Italy,
the Italian chapter of Wikimedia movement.
We are a no-profit association which promotes and sustains Wikimedia project,
as the online encyclopedia Wikipedia.

I'm writing you because I'm interested in knowing
about "server licences" of your new Finereader 11.

As far as I know, your product save files in DjVu, and this is an interesting feature that
could help some of our project.
Maybe you know Wikisource, a multilingual digital library in which the community upload, transcribe and proofread books.
This is the english version (http://en.wikisource.org/wiki/Main_Page).
In each page of each book (which are uploaded in DjVu), we have a little button "OCR"
which used to call a tesseract bot and ocr the page.
Right now, the bot doesn't work for lack of maintainance.

My idea would be to substitute the tesseract with Finereader, and also have the possibility to
use other features, as taking a PDF/JPEG file and saving it as a OCRred DjVu, or as choosing the language of the OCR from project to project.

Now, I do not have an estimate of how much this engine could be used (I understand this is a crucial factor for the price of a server license).
I would count few hundreds of pages OCRred per day (maybe more, if this thing works), and a few dozens file conversions (any to DjVu) per day.

So, my questions are:
- do you have a rough idea how much this license would cost?
- do you know if it is possible to run FR11 in other os than Windows (we actually run Solaris)?
- do you know if is possible to have all these feature via API or something?

Thank you for your time,
regards

Andrea Zanni

--
Wikimedia Italia Board

 

Sostieni la cultura, dona a Wikimedia Italia. 
http://sostienilacultura.it