Andrew Turvey wrote:
We had a discussion at a recent Wikimedia UK board meeting about potentially buying some digitisation equipment which could be used to generate content for the Wikimedia projects. This recent email to the EN-WP list sparked my interest.
Does anyone have any experience with equipment like this, and could you recommend anything? Any idea what the price range and quality typically is?
Also, is anyone else in the Wikimedia community currently doing this?
I'm on the board of Wikimedia Sverige (Sweden), and also the founder (in 1992) of Project Runeberg, the Scandinavian offspring of Project Gutenberg. The Swedish-language Wikisource isn't doing much, because Project Runeberg still does a lot of book scanning. Its archive now contains images of 550,000 book pages, corresponding to nearly 30 linear metres of shelving.
Book digitization is a matter of using the right tools for each job. Much depends on the kind of book and the kind of labour. If you use unpaid volunteers, you can afford slower equipment. If you need to pay your staff, any equipment that speeds up the work will quickly pay its own cost, including those very expensive "professional" book scanners. On the scale of Google Book Search, aiming to digitize millions of books, it pays off to let Google engineers work on developing even faster equipment, just like Google develops its own Linux-based storage architecture.
It's hard to measure the usefulness of a digitized book, since Wikisource (and Project Runeberg) doesn't have any income. (And neither has Google Book Search, I believe.) If your success is measured in how much money you spend (as some charities have it), it is very easy to invest a lot of money, without much result. The worst you can do is to spend a lot to digitize something that is already available for free download. Look around first.
You should start to think of what do you want to achieve? Is there some book, or genre of books, that would be really useful for Wikipedia to have on Wikisource? Anything British that all those American projects haven't already covered? For us from non-English speaking countries, it's far easier. Very little has been digitized, so there is a lot to do. The most useful thing is to digitize an old encyclopedia, just like that 11th edition of Encyclopaedia Britannica (from 1911).
Now, encyclopedias are common items in used bookstores or online auctions. You can buy 20 volumes for 200 euro, or even cheaper. At that price, the best investment is a paper cutter (or ask a print shop to help you) and a two-sided (duplex) sheet-feeding scanner, such as the Canon DR-2050C or Fujitsu Scansnap s510. http://www.youtube.com/watch?v=1oH3mQZLpL8
OCR software might be included with the scanner. Or you can buy www.finereader.com for 160 euro.
The total investment would be less than 1000 euro (scanner + OCR software + 20 volume encyclopedia + ask a print shop to cut the spines). After this, you only need hours of volunteer work.
That's how I digitized the "New Student's Reference Work" (from 1914, 5 volumes, some 2500 pages) for Wikisource in 2005, only to show that Wikisource could be used that way.
Here are some old pictures (with Swedish text, from 2001), http://runeberg.org/admin/snuff.html
These scanners and that OCR software are not open source products, but neither is my digital camera, and I use that to produce free pictures for Wikimedia Commons. I know there have been many attempts to make free OCR software, but is it any good?
In fact, if you have books where you can't afford to cut the spine, maybe some rare thing that you only find in a library, a 10 megapixel digital camera is very useful. You need to experiment a little with tripod stands and good lights. You only need to open the book at 90 degrees, to get a good view of a page, which is much friendlier to the book than an old flatbed scanner (and faster). If you have two cameras, you can shoot left pages with one, and right pages with the other. That's in fact how the fastest modern "book scanners" work. Google builds their own, and so can you. Some radical ideas are found on http://bkrpr.org/
Again, a pair of digital cameras is a total investment of less than 1000 euro. That's a good starting range. You can achieve a lot, and learn even more, in very little time.
What can you do with 2000 euro? Buy one sheet-feeding set, and one pair of digital cameras. Let two teams compete against each other. Write a report for next year's Wikimania. Have great fun!
wikisource-l@lists.wikimedia.org