Scripto is an alternative to the ProofreadPage extension used
by Wikisource. It is based on Mediawiki but also on OpenLayers,
the software used to zoom and pan in OpenStreetMap.
The only website I have seen that uses Scripto is the U.K.
War Department papers, and in many ways it is more clumsy
than ProofreadPage. But there might be a few ideas that could
be worth picking up. Take a look.
The software is described at http://scripto.org/
As for reference installations, they mention
http://wardepartmentpapers.org/transcribe.php
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se
Ron Unz, a long-time Wikimedia supporter, alerted me to this personal
project that he's been working on for a long time:
http://www.unz.org/
It's an archive of periodicals, books, and videos, some of which
hosted there, some externally.
Examples:
http://www.unz.org/Publication/SaturdayRevhttp://www.unz.org/Publication/Century
Timeslice from the outbreak of WWI:
http://www.unz.org/Publication/AllArticles?Period=1914aug
According to Ron, the system contains almost 400,000 authors and their
writings. A couple of examples of author pages:
http://www.unz.org/Author/MenckenHLhttp://www.unz.org/Author/WhartonEdith
Ron believes that the copyright situation is clear -- that either it's
PD due to age, due to lack of copyright renewal, or that he has
permission in some cases via licensing agreements. In any case,
there's quite a bit of unambiguously public domain stuff there that I
haven't seen digitized elsewhere, and it should be useful as a
research library for Wikipedians as well.
Cheers,
Erik
--
Erik Möller
VP of Engineering and Product Development, Wikimedia Foundation
Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate
Wow, thank you all for the quick responses.
I'll try to reply in-line.
2011/11/28 Mathias Schindler <mathias.schindler(a)gmail.com>
> I recommend sticking and supporting open source technology that has
> been made available by third parties, such as
> http://code.google.com/p/ocropus/ /
> http://code.google.com/p/tesseract-ocr/
>
This is true, and this would be the optimal way, but apparently it failed.
I don't know way the OCR button is not running anymore, it seems to me that
when ThomasV
left things were not updated or something like that.
>From my experience (I have used these software for professional projects)
the quality and the usability
of the software is very different. Of course, having Tesseract is better
than having nothing.
2011/11/28 Lars Aronsson <mathias.schindler(a)gmail.com>
> I think this is what the Internet Archive uses, as well as
> several European libraries. We could look into establishing
> a cooperation with the Internet Archive or perhaps with
> Europeana in this area. Maybe the Internet Archive can
> open up an API for OCR-ing a single page at a time?
>
This would be awesome :-)
I don't have a clue about technicalities here, if you want to aske them be
my guest :-)
@Tomasz
i think that Federico has a point in the approach he suggested:
I'm wondering in fact how did IA get his license, we should ask them.
Do we have any contact with Internet Archive?
I know we could use directly IA for uploading PDFs (we do it already for
getting DjVus)
but still it's not the more usable way to handle with institutions or
simple users...
Aubrey
On 11/30/2011 09:55 PM, Eugene Zelenko wrote:
> ABBYY has own online OCR service http://finereader.abbyyonline.com
This is very interesting, OCR as a cloud service. I didn't know they
were doing this. They charge EUR 7 per 200 pages, or US$ 0.05
per page, which I guess can be (almost) reasonable for the
Wikimedia Foundation to pay. I sometimes feel bad because I have
OCRed so many tens of thousand pages with a single EUR 129
license of Finereader. Here, EUR 129 would buy us 3700 pages.
All languages of Wikisource together are proofreading slightly
less than 900 pages/day, for which OCR would cost EUR 32/day
or US$ 43/day. With good OCR, proofreading is more fun, and
these numbers may increase. But then again, we wouldn't need
the service for all pages, as some books already have OCR.
The most interesting feature of a cloud-based OCR service, is
if they can accumulate improvements in font training (?) and
dictionaries from a large number of users over time. With
Wikisource, they can of course get direct access to the page
after proofreading.
So, is the service any good? They even promise to do Fraktur
(blackletter). Does it work well?
--
Lars Aronsson (lars(a)aronsson.se)
Aronsson Datateknik - http://aronsson.se