Wow, thank you all for the quick responses.
I'll try to reply in-line.
2011/11/28 Mathias Schindler <mathias.schindler(a)gmail.com>
> I recommend sticking and supporting open source technology that has
> been made available by third parties, such as
> http://code.google.com/p/ocropus/ /
> http://code.google.com/p/tesseract-ocr/
>
This is true, and this would be the optimal way, but apparently it failed.
I don't know way the OCR button is not running anymore, it seems to me that
when ThomasV
left things were not updated or something like that.
>From my experience (I have used these software for professional projects)
the quality and the usability
of the software is very different. Of course, having Tesseract is better
than having nothing.
2011/11/28 Lars Aronsson <mathias.schindler(a)gmail.com>
> I think this is what the Internet Archive uses, as well as
> several European libraries. We could look into establishing
> a cooperation with the Internet Archive or perhaps with
> Europeana in this area. Maybe the Internet Archive can
> open up an API for OCR-ing a single page at a time?
>
This would be awesome :-)
I don't have a clue about technicalities here, if you want to aske them be
my guest :-)
@Tomasz
i think that Federico has a point in the approach he suggested:
I'm wondering in fact how did IA get his license, we should ask them.
Do we have any contact with Internet Archive?
I know we could use directly IA for uploading PDFs (we do it already for
getting DjVus)
but still it's not the more usable way to handle with institutions or
simple users...
Aubrey
On 11/28/2011 01:59 PM, Mathias Schindler wrote:
> I recommend sticking and supporting open source technology that has
> been made available by third parties, such as
> http://code.google.com/p/ocropus/ /
> http://code.google.com/p/tesseract-ocr/
Do you recommend this based on experience, or based on free software
ideology? Apparently the Internet Archive tried and gave up, because
Finereader was far better. Are there any good examples where free
software has been used for good OCR quality?
Wikisource does provide feedback on quality: After OCR, when a page
has been proofread, the OCR software could learn from the diff.
But is there any OCR software that can take this kind of input?
When running OCR as an engine/server/API, what do we do when it
misinterprets columns in a page, and reads long lines across the
page? Is there a way to manually indicate where columns are, and
resubmit the page for new OCR?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
2011/11/28 Gerard Meijssen <gerard.meijssen(a)gmail.com>
> Hoi,
> What scripts does it support ?
> Thanks,
> GerardM
>
Hi Gerard,
could you be more explicit?
I don't understand the question.
Aubrey
Dear all,
it's a lot aof time I'm wondering a project that could help Wikisource (and
some GLAMs too), and the idea is simply to install ABBYY Finereader 11 on
toolserver,
as a tool for all Wikisource users.
For those who don't know, ABBYY Finereader is an OCR software: it is
proprietary and fairly expensive,
but it is accurate and works really, really well. Plus, its 11 version can
save files in DjVu.
Now, in my mind having such a software on toolserver could take us to:
- restore our beloved OCR button, with a much more accurate OCR
- use Finereader for transforming PDF/TIFF/JPG from Commons directly in
OCRred DjVus.
- others things I've not thought yet
Issues are many too:
- Cost: I don't know how much this could cost. Many WM chapters do give
money to toolserver, and the status of the thing is a bit fuzzy a the
moment, but, for example,
Wikimedia Italy has frozen 5000 euros for the toolserver, and maybe we can
use those money for the license (I'm in WMI Board, and I've asked, they say
it's OK);
- Technical: afaik, toolserver run Solaris, and apparently Finereader is
Windows only (I think we can solve easly this if we want, though)
- Ethic: this is proprietary software, and I don't know if we *want* to use
it on Wikimedia projects...
- Resources: i think this is probably the main issue: we need skilled
people to set this up technically, and at least one toolserver operator
(Phe, maybe?)
Below, the mail I sent to ABBYY Europe, to see it the thing was feasible.
They simply replied they want a phone call. Of course, if the thing would
be too expensive the projects collapse immediately,
but I think it's worth to discuss. If nobody wants it, I can drop it right
now.
Please, forward this may to everyone possibly interested,
I don't thin it's a good idea to scatter discussions in every ws Village
Pump.
Cheers
Aubrey
*From:* Andrea Zanni <andrea.zanni(a)wikimedia.it>
*Sent:* Friday, November 25, 2011 10:20 AM
*To:* support_eu(a)abbyy.com
*Subject:* Questions about server licenses
Dear ABBYY Europe,
my name is Andrea Zanni, and I'm a Board member of Wikimedia Italy,
the Italian chapter of Wikimedia movement.
We are a no-profit association which promotes and sustains Wikimedia
project,
as the online encyclopedia Wikipedia.
I'm writing you because I'm interested in knowing
about "server licences" of your new Finereader 11.
As far as I know, your product save files in DjVu, and this is an
interesting feature that
could help some of our project.
Maybe you know Wikisource, a multilingual digital library in which the
community upload, transcribe and proofread books.
This is the english version (http://en.wikisource.org/wiki/Main_Page).
In each page of each book (which are uploaded in DjVu), we have a little
button "OCR"
which used to call a tesseract bot and ocr the page.
Right now, the bot doesn't work for lack of maintainance.
My idea would be to substitute the tesseract with Finereader, and also have
the possibility to
use other features, as taking a PDF/JPEG file and saving it as a OCRred
DjVu, or as choosing the language of the OCR from project to project.
Now, I do not have an estimate of how much this engine could be used (I
understand this is a crucial factor for the price of a server license).
I would count few hundreds of pages OCRred per day (maybe more, if this
thing works), and a few dozens file conversions (any to DjVu) per day.
So, my questions are:
- do you have a rough idea how much this license would cost?
- do you know if it is possible to run FR11 in other os than Windows (we
actually run Solaris)?
- do you know if is possible to have all these feature via API or
something?
Thank you for your time,
regards
Andrea Zanni
--
Wikimedia Italia Board
Sostieni la cultura, dona a Wikimedia Italia.
http://sostienilacultura.it
Dear all,
I'm pleased to announce that the Italian Wikisource has started a
collaboration with AlmaDL, the digital library of University of Bologna (
http://amshistorica.cib.unibo.it).
The project, called Wikiproject Scientia (
http://it.wikisource.org/wiki/Progetto:Scientia), aims to take on
Wikisource(s) around 40 issues of "Scientia", a scientific journals
published at the begginning of the century. The journal has been published
in 4 different languages, and includes (original) articles from scientists
from all around the world, as G. Peano, Enrico Fermi, Bertrand Russell, E.
Rutherford, H. Lorentz, Sigmund Freud, Henri Poincaré, Ernst Mach, Albert
Einstein, Werner Heisenberg; Rudolf Carnap, Otto Neurath, and many, many
more.
The whole journal has been published here:
http://amshistorica.cib.unibo.it/6http://amshistorica.cib.unibo.it/7
but we are still uploading the bundled djvus on Commons (and doing the OCR
with ABBYY).
We have completed and formatted one issue (
http://it.wikisource.org/wiki/Indice:Rivista_di_Scienza_-_Vol._I.djvu), to
have a clue of the complexity of the work, and it is definetely complex.
You can take a look here:
http://it.wikisource.org/wiki/Indice:Scientia_-_Vol._VII.djvuhttp://it.wikisource.org/wiki/Indice:Scientia_-_Vol._VIII.djvuhttp://it.wikisource.org/wiki/Indice:Scientia_-_Vol._IX.djvuhttp://it.wikisource.org/wiki/Indice:Scientia_-_Vol._X.djvu
The biggest challenge is to set up the transclusion of the articles in all
the Wikisources interested, because every issue contains articles in at
least 3 languages (sometimes 4: French is the most used language of the
journal).
So, this mail is to inform all the potential wikimedians interested in
participating, and helping us to set up books and indexes in the respective
wikisources.
Thank you,
Aubrey
NB: I work for AlmaDL, so I "persuaded" my boss to try releasing some djvus
and see what the community of Wikisource could do. This does not mean I am
a full time "Wikisourcian in residence", but I certanly use some of my work
time for this and I can definitely help with original scans, metadata and
even the OCR with ABBYY Finereader. I soon discovered that the project is
much more bigger than I expected, especially for the multilingual issue.
[sorry for cross-posting]
Hi all,
I just wanted to announce that the wikiguides produced by Wikimedia Italy
are now
on Youtube with subs in different languages. If you are interested in
translation, just contact me.
The videos have proven themselves as a good introduction to the wiki world
(a fourth video on Wikiquote is under production),
and (at least in Wikisource) we have seen a significant improvement in
access and use of the website.
Hope you will enjoy too.
*Wikisource
*http://www.youtube.com/watch?v=cR0g5ACaC-g
*Wikipedia*
http://www.youtube.com/watch?v=qoLBZ7_vY-k
*
Commons*
http://www.youtube.com/watch?v=AOTlhuokVDs
Aubrey
We fixed a djvu file adding some missing pages. Now, something strange
appears. Take a look at
http://it.wikisource.org/wiki/Pagina:Delle_strade_ferrate_e_della_loro_futu….
Then enter in edit mode. The thumb showed in view mode comes from the old,
wrong page; the thumb showed in edit mode is the right one.
Can we force purging of such old thumb/image, presumably saved somewhere
from the old djvu file?
Alex