[Commons-l] Digitisation equipment

Lars Aronsson lars at aronsson.se
Sat Aug 29 02:57:14 UTC 2009


Andrew Turvey wrote:

> We had a discussion at a recent Wikimedia UK board meeting about 
> potentially buying some digitisation equipment which could be 
> used to generate content for the Wikimedia projects. This recent 
> email to the EN-WP list sparked my interest.
> 
> Does anyone have any experience with equipment like this, and 
> could you recommend anything? Any idea what the price range and 
> quality typically is?
> 
> Also, is anyone else in the Wikimedia community currently doing 
> this?

I'm on the board of Wikimedia Sverige (Sweden), and also the 
founder (in 1992) of Project Runeberg, the Scandinavian offspring 
of Project Gutenberg.  The Swedish-language Wikisource isn't doing 
much, because Project Runeberg still does a lot of book scanning. 
Its archive now contains images of 550,000 book pages, 
corresponding to nearly 30 linear metres of shelving.

Book digitization is a matter of using the right tools for each 
job.  Much depends on the kind of book and the kind of labour. If 
you use unpaid volunteers, you can afford slower equipment. If you 
need to pay your staff, any equipment that speeds up the work will 
quickly pay its own cost, including those very expensive 
"professional" book scanners.  On the scale of Google Book Search, 
aiming to digitize millions of books, it pays off to let Google 
engineers work on developing even faster equipment, just like 
Google develops its own Linux-based storage architecture.

It's hard to measure the usefulness of a digitized book, since 
Wikisource (and Project Runeberg) doesn't have any income.  (And 
neither has Google Book Search, I believe.) If your success is 
measured in how much money you spend (as some charities have it), 
it is very easy to invest a lot of money, without much result.
The worst you can do is to spend a lot to digitize something that 
is already available for free download.  Look around first.

You should start to think of what do you want to achieve?  Is 
there some book, or genre of books, that would be really useful 
for Wikipedia to have on Wikisource?  Anything British that all 
those American projects haven't already covered?  For us from 
non-English speaking countries, it's far easier.  Very little has 
been digitized, so there is a lot to do.  The most useful thing is 
to digitize an old encyclopedia, just like that 11th edition of 
Encyclopaedia Britannica (from 1911).

Now, encyclopedias are common items in used bookstores or online 
auctions.  You can buy 20 volumes for 200 euro, or even cheaper. 
At that price, the best investment is a paper cutter (or ask a 
print shop to help you) and a two-sided (duplex) sheet-feeding 
scanner, such as the Canon DR-2050C or Fujitsu Scansnap s510.
http://www.youtube.com/watch?v=1oH3mQZLpL8

OCR software might be included with the scanner. Or you can buy 
www.finereader.com for 160 euro.

The total investment would be less than 1000 euro (scanner + OCR 
software + 20 volume encyclopedia + ask a print shop to cut the 
spines).  After this, you only need hours of volunteer work.

That's how I digitized the "New Student's Reference Work" (from 
1914, 5 volumes, some 2500 pages) for Wikisource in 2005, only to 
show that Wikisource could be used that way.

Here are some old pictures (with Swedish text, from 2001), 
http://runeberg.org/admin/snuff.html

These scanners and that OCR software are not open source products, 
but neither is my digital camera, and I use that to produce free
pictures for Wikimedia Commons.  I know there have been many 
attempts to make free OCR software, but is it any good?

In fact, if you have books where you can't afford to cut the 
spine, maybe some rare thing that you only find in a library, a 10 
megapixel digital camera is very useful. You need to experiment a 
little with tripod stands and good lights.  You only need to open 
the book at 90 degrees, to get a good view of a page, which is 
much friendlier to the book than an old flatbed scanner (and 
faster).  If you have two cameras, you can shoot left pages with 
one, and right pages with the other.  That's in fact how the 
fastest modern "book scanners" work.  Google builds their own, and 
so can you.  Some radical ideas are found on http://bkrpr.org/

Again, a pair of digital cameras is a total investment of less 
than 1000 euro.  That's a good starting range.  You can achieve a 
lot, and learn even more, in very little time.

What can you do with 2000 euro?  Buy one sheet-feeding set, and 
one pair of digital cameras.  Let two teams compete against each 
other. Write a report for next year's Wikimania. Have great fun!


-- 
  Lars Aronsson (lars at aronsson.se)
  Aronsson Datateknik - http://aronsson.se

  Project Runeberg - free Nordic literature - http://runeberg.org/

  Wikimedia Sverige - stöd fri kunskap - http://wikimedia.se/



More information about the Commons-l mailing list