Wow, thank you all for the quick responses. I'll try to reply in-line.
2011/11/28 Mathias Schindler mathias.schindler@gmail.com
I recommend sticking and supporting open source technology that has been made available by third parties, such as http://code.google.com/p/ocropus/ / http://code.google.com/p/tesseract-ocr/
This is true, and this would be the optimal way, but apparently it failed. I don't know way the OCR button is not running anymore, it seems to me that when ThomasV left things were not updated or something like that.
From my experience (I have used these software for professional projects)
the quality and the usability of the software is very different. Of course, having Tesseract is better than having nothing.
2011/11/28 Lars Aronsson mathias.schindler@gmail.com
I think this is what the Internet Archive uses, as well as several European libraries. We could look into establishing a cooperation with the Internet Archive or perhaps with Europeana in this area. Maybe the Internet Archive can open up an API for OCR-ing a single page at a time?
This would be awesome :-) I don't have a clue about technicalities here, if you want to aske them be my guest :-)
@Tomasz i think that Federico has a point in the approach he suggested: I'm wondering in fact how did IA get his license, we should ask them. Do we have any contact with Internet Archive?
I know we could use directly IA for uploading PDFs (we do it already for getting DjVus) but still it's not the more usable way to handle with institutions or simple users...
Aubrey