[X-posting also on Wikitech-l]
Dear all,
I hope your season's holidays are going well.
I'm writing to inform you that I have released on Github a first (0.1)
version of wikicapthca[1] a ReCAPTCHA-like program for Wiki*. The
thing is born from an initial observation by Alex brollo[2] about
which we have discussed much both in WMI mailing list and also on
Wikisource-l[3].
For starters the code there is a rewriting of Alex's scripts and
nothing more, and here's what the program does for now: 1) gets a
djvu, 2) extracts the text layer, 3) identifies non recognized words
and 4) produce a tiff image of them.
But I would like to write a "proof of concept" of the whole process
from getting OCR-ed djvu's from Commons to producing challenges,
serving them, collecting answer and then using those answers in some
useful way. That's said, you see there's still a long way to go.
Obviously in the long run we could use this system as a backup for the
current one, which has demonstrated some limitations[4], but I'm sure
there are many aspects of the problem which go beyond my knowledge
(I'm a physicist not a computer scientist, you know) so any help
and/or advice is welcome.
Thanks for your time.
Cristian
[1]https://github.com/CristianCantoro/wikicaptcha
[2]http://it.wikisource.org/wiki/Utente:Alex_brollo
[3]http://lists.wikimedia.org/pipermail/wikisource-l/2011-February/000939.ht…
[4]http://lists.wikimedia.org/pipermail/wikitech-l/2011-November/056078.html
Wow, thank you all for the quick responses.
I'll try to reply in-line.
2011/11/28 Mathias Schindler <mathias.schindler(a)gmail.com>
> I recommend sticking and supporting open source technology that has
> been made available by third parties, such as
> http://code.google.com/p/ocropus/ /
> http://code.google.com/p/tesseract-ocr/
>
This is true, and this would be the optimal way, but apparently it failed.
I don't know way the OCR button is not running anymore, it seems to me that
when ThomasV
left things were not updated or something like that.
>From my experience (I have used these software for professional projects)
the quality and the usability
of the software is very different. Of course, having Tesseract is better
than having nothing.
2011/11/28 Lars Aronsson <mathias.schindler(a)gmail.com>
> I think this is what the Internet Archive uses, as well as
> several European libraries. We could look into establishing
> a cooperation with the Internet Archive or perhaps with
> Europeana in this area. Maybe the Internet Archive can
> open up an API for OCR-ing a single page at a time?
>
This would be awesome :-)
I don't have a clue about technicalities here, if you want to aske them be
my guest :-)
@Tomasz
i think that Federico has a point in the approach he suggested:
I'm wondering in fact how did IA get his license, we should ask them.
Do we have any contact with Internet Archive?
I know we could use directly IA for uploading PDFs (we do it already for
getting DjVus)
but still it's not the more usable way to handle with institutions or
simple users...
Aubrey
On 11/30/2011 09:55 PM, Eugene Zelenko wrote:
> ABBYY has own online OCR service http://finereader.abbyyonline.com
This is very interesting, OCR as a cloud service. I didn't know they
were doing this. They charge EUR 7 per 200 pages, or US$ 0.05
per page, which I guess can be (almost) reasonable for the
Wikimedia Foundation to pay. I sometimes feel bad because I have
OCRed so many tens of thousand pages with a single EUR 129
license of Finereader. Here, EUR 129 would buy us 3700 pages.
All languages of Wikisource together are proofreading slightly
less than 900 pages/day, for which OCR would cost EUR 32/day
or US$ 43/day. With good OCR, proofreading is more fun, and
these numbers may increase. But then again, we wouldn't need
the service for all pages, as some books already have OCR.
The most interesting feature of a cloud-based OCR service, is
if they can accumulate improvements in font training (?) and
dictionaries from a large number of users over time. With
Wikisource, they can of course get direct access to the page
after proofreading.
So, is the service any good? They even promise to do Fraktur
(blackletter). Does it work well?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
On 11/28/2011 01:59 PM, Mathias Schindler wrote:
> I recommend sticking and supporting open source technology that has
> been made available by third parties, such as
> http://code.google.com/p/ocropus/ /
> http://code.google.com/p/tesseract-ocr/
Do you recommend this based on experience, or based on free software
ideology? Apparently the Internet Archive tried and gave up, because
Finereader was far better. Are there any good examples where free
software has been used for good OCR quality?
Wikisource does provide feedback on quality: After OCR, when a page
has been proofread, the OCR software could learn from the diff.
But is there any OCR software that can take this kind of input?
When running OCR as an engine/server/API, what do we do when it
misinterprets columns in a page, and reads long lines across the
page? Is there a way to manually indicate where columns are, and
resubmit the page for new OCR?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
2011/11/28 Gerard Meijssen <gerard.meijssen(a)gmail.com>
> Hoi,
> What scripts does it support ?
> Thanks,
> GerardM
>
Hi Gerard,
could you be more explicit?
I don't understand the question.
Aubrey
Dear all,
it's a lot aof time I'm wondering a project that could help Wikisource (and
some GLAMs too), and the idea is simply to install ABBYY Finereader 11 on
toolserver,
as a tool for all Wikisource users.
For those who don't know, ABBYY Finereader is an OCR software: it is
proprietary and fairly expensive,
but it is accurate and works really, really well. Plus, its 11 version can
save files in DjVu.
Now, in my mind having such a software on toolserver could take us to:
- restore our beloved OCR button, with a much more accurate OCR
- use Finereader for transforming PDF/TIFF/JPG from Commons directly in
OCRred DjVus.
- others things I've not thought yet
Issues are many too:
- Cost: I don't know how much this could cost. Many WM chapters do give
money to toolserver, and the status of the thing is a bit fuzzy a the
moment, but, for example,
Wikimedia Italy has frozen 5000 euros for the toolserver, and maybe we can
use those money for the license (I'm in WMI Board, and I've asked, they say
it's OK);
- Technical: afaik, toolserver run Solaris, and apparently Finereader is
Windows only (I think we can solve easly this if we want, though)
- Ethic: this is proprietary software, and I don't know if we *want* to use
it on Wikimedia projects...
- Resources: i think this is probably the main issue: we need skilled
people to set this up technically, and at least one toolserver operator
(Phe, maybe?)
Below, the mail I sent to ABBYY Europe, to see it the thing was feasible.
They simply replied they want a phone call. Of course, if the thing would
be too expensive the projects collapse immediately,
but I think it's worth to discuss. If nobody wants it, I can drop it right
now.
Please, forward this may to everyone possibly interested,
I don't thin it's a good idea to scatter discussions in every ws Village
Pump.
Cheers
Aubrey
*From:* Andrea Zanni <andrea.zanni(a)wikimedia.it>
*Sent:* Friday, November 25, 2011 10:20 AM
*To:* support_eu(a)abbyy.com
*Subject:* Questions about server licenses
Dear ABBYY Europe,
my name is Andrea Zanni, and I'm a Board member of Wikimedia Italy,
the Italian chapter of Wikimedia movement.
We are a no-profit association which promotes and sustains Wikimedia
project,
as the online encyclopedia Wikipedia.
I'm writing you because I'm interested in knowing
about "server licences" of your new Finereader 11.
As far as I know, your product save files in DjVu, and this is an
interesting feature that
could help some of our project.
Maybe you know Wikisource, a multilingual digital library in which the
community upload, transcribe and proofread books.
This is the english version (http://en.wikisource.org/wiki/Main_Page).
In each page of each book (which are uploaded in DjVu), we have a little
button "OCR"
which used to call a tesseract bot and ocr the page.
Right now, the bot doesn't work for lack of maintainance.
My idea would be to substitute the tesseract with Finereader, and also have
the possibility to
use other features, as taking a PDF/JPEG file and saving it as a OCRred
DjVu, or as choosing the language of the OCR from project to project.
Now, I do not have an estimate of how much this engine could be used (I
understand this is a crucial factor for the price of a server license).
I would count few hundreds of pages OCRred per day (maybe more, if this
thing works), and a few dozens file conversions (any to DjVu) per day.
So, my questions are:
- do you have a rough idea how much this license would cost?
- do you know if it is possible to run FR11 in other os than Windows (we
actually run Solaris)?
- do you know if is possible to have all these feature via API or
something?
Thank you for your time,
regards
Andrea Zanni
--
Wikimedia Italia Board
Sostieni la cultura, dona a Wikimedia Italia.
http://sostienilacultura.it
[sorry for cross-posting]
Hi all,
I just wanted to announce that the wikiguides produced by Wikimedia Italy
are now
on Youtube with subs in different languages. If you are interested in
translation, just contact me.
The videos have proven themselves as a good introduction to the wiki world
(a fourth video on Wikiquote is under production),
and (at least in Wikisource) we have seen a significant improvement in
access and use of the website.
Hope you will enjoy too.
*Wikisource
*http://www.youtube.com/watch?v=cR0g5ACaC-g
*Wikipedia*
http://www.youtube.com/watch?v=qoLBZ7_vY-k
*
Commons*
http://www.youtube.com/watch?v=AOTlhuokVDs
Aubrey