We had a discussion at a recent Wikimedia UK board meeting about potentially buying some digitisation equipment which could be used to generate content for the Wikimedia projects. This recent email to the EN-WP list sparked my interest.
Does anyone have any experience with equipment like this, and could you recommend anything? Any idea what the price range and quality typically is?
Also, is anyone else in the Wikimedia community currently doing this?
Thanks,
---- Forwarded Message ----- From: "Steve Bennett" stevagewp@gmail.com To: "English Wikipedia" wikien-l@lists.wikimedia.org Sent: Sunday, 23 August, 2009 10:55:32 GMT +00:00 GMT Britain, Ireland, Portugal Subject: Re: [WikiEN-l] Wikipedia reaches 3 millionth article
On Wed, Aug 19, 2009 at 11:15 PM, David Gerarddgerard@gmail.com wrote:
I believe they have machines to turn pages, and something to figure out the distorted photo of the book and render it how it would look as a flat page.
Yeah, there are videos of these machines. The book sits open, the scanner comes down and scans both open pages at once. As it goes up again, it sucks on one page, causing it to flip over. Then repeat.
Oh, look, here you go: http://www.youtube.com/watch?v=hlOQuuLYavY
And while we're at it: http://en.wikipedia.org/wiki/Book_scanning
Steve
_______________________________________________ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
On Wed, Aug 26, 2009 at 6:27 PM, Andrew Turveyandrewrturvey@googlemail.com wrote:
We had a discussion at a recent Wikimedia UK board meeting about potentially buying some digitisation equipment which could be used to generate content for the Wikimedia projects. This recent email to the EN-WP list sparked my interest.
Does anyone have any experience with equipment like this, and could you recommend anything? Any idea what the price range and quality typically is?
Also, is anyone else in the Wikimedia community currently doing this?
For digitizing what?
Archive.org digitizes books using a pair of canon 1Ds (? perhaps it was a 5D? In any case the 5DII would be sufficient now) on a custom stand with a hacked up copy of gphoto2 to actuate the cameras.
Turn the page, click a button... It avoids the stress on the books that a flatbed scanner would add and is faster to boot.
I'm not sure how they're dealing with curvature (I think they just may lay a glass plate on the pages), but it would be easy enough to solve using a laser pointer with a pattern generating holographic grating and a second exposure to capture the page distortion and some fairly simple software processing after the fact.
Gregory Maxwell wrote:
For digitizing what?
Exactly, that's the first question.
Archive.org digitizes books using a pair of canon 1Ds (? perhaps it was a 5D? In any case the 5DII would be sufficient now) on a custom stand with a hacked up copy of gphoto2 to actuate the cameras.
That's Brewster Kahle doing things many years ago (2002? 2003?). Today, a much cheaper low-end digital SLR, or even compact cameras will give you the needed 10 or so megapixels. But again, if you need to pay your staff, a ten times more expensive camera might easily pay its own cost in increased speed, or increased shutter lifespan.
I'm not sure how they're dealing with curvature (I think they just may lay a glass plate on the pages), but it would be easy enough to solve using a laser pointer with a pattern generating holographic grating and a second exposure to capture the page distortion and some fairly simple software processing after the fact.
The Internet Archive apparently uses a fixed glass, and lowers the book cradle to turn pages, http://aipengineering.com/scribe/
Other designs have a fixed book cradle and lifts the glass, e.g. the Atiz DIY, http://diy.atiz.com/
I thought the Internet Archive design was very clever, since it keeps a fixed distance from lens to book surface (beneath the glass), until I saw the bkrpr.org where you just lift everything. That's a design for 2009! I haven't tried to build one myself yet.
----
However, you can capture lots of books (that can be opened fully) with a single camera, laying the book flat on a table with a glass on top. That's just like a flatbed scanner (but much faster) turned upside down.
In January 2008, I used a 10 megapixel Canon EOS 400D (Digital Rebel XTi) with a 50 mm lens to shoot this, laying flat on a table under a glass, http://runeberg.org/stridfin/0226.html
On that webpage, the image is reduced to 120 dpi (1.2 megapixel), but the original is 300 dpi (7.5 megapixel). The map shown is reused in http://en.wikipedia.org/wiki/Battle_of_Alavus
That's an example of how one specialized book can be very useful for a limited Wikiproject. This book was published in 1909 for the 100th anniversary of the Finnish War (1808-1809), and digitized in 2008 for the 200th anniversary.
I love the fact you can achieve such high-quality results with relatively cheap equipment. For many archives i think getting the people to manually scan pages are probably easier to motivate than for us, but chapters and individual Wikimedians could probably be of much help with all the technical aspects, uploading stuff to Commons / Wikisource, getting the word out to other people, etc.
-- Hay
On Sat, Aug 29, 2009 at 6:10 AM, Lars Aronssonlars@aronsson.se wrote:
Gregory Maxwell wrote:
For digitizing what?
Exactly, that's the first question.
Archive.org digitizes books using a pair of canon 1Ds (? perhaps it was a 5D? In any case the 5DII would be sufficient now) on a custom stand with a hacked up copy of gphoto2 to actuate the cameras.
That's Brewster Kahle doing things many years ago (2002? 2003?). Today, a much cheaper low-end digital SLR, or even compact cameras will give you the needed 10 or so megapixels. But again, if you need to pay your staff, a ten times more expensive camera might easily pay its own cost in increased speed, or increased shutter lifespan.
I'm not sure how they're dealing with curvature (I think they just may lay a glass plate on the pages), but it would be easy enough to solve using a laser pointer with a pattern generating holographic grating and a second exposure to capture the page distortion and some fairly simple software processing after the fact.
The Internet Archive apparently uses a fixed glass, and lowers the book cradle to turn pages, http://aipengineering.com/scribe/
Other designs have a fixed book cradle and lifts the glass, e.g. the Atiz DIY, http://diy.atiz.com/
I thought the Internet Archive design was very clever, since it keeps a fixed distance from lens to book surface (beneath the glass), until I saw the bkrpr.org where you just lift everything. That's a design for 2009! I haven't tried to build one myself yet.
However, you can capture lots of books (that can be opened fully) with a single camera, laying the book flat on a table with a glass on top. That's just like a flatbed scanner (but much faster) turned upside down.
In January 2008, I used a 10 megapixel Canon EOS 400D (Digital Rebel XTi) with a 50 mm lens to shoot this, laying flat on a table under a glass, http://runeberg.org/stridfin/0226.html
On that webpage, the image is reduced to 120 dpi (1.2 megapixel), but the original is 300 dpi (7.5 megapixel). The map shown is reused in http://en.wikipedia.org/wiki/Battle_of_Alavus
That's an example of how one specialized book can be very useful for a limited Wikiproject. This book was published in 1909 for the 100th anniversary of the Finnish War (1808-1809), and digitized in 2008 for the 200th anniversary.
-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se
Project Runeberg - free Nordic literature - http://runeberg.org/
Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l
Hi!
Just tried this Sunday...
6MP digital SLR camera + tripod is good enough to make images of book pages for OCR. Speed is ~ 10-15 seconds per page.
However, if you want good quality images, you should use regular scanner, adjust page position, press book.
Eugene.
On Wed, Aug 26, 2009 at 3:27 PM, Andrew Turveyandrewrturvey@googlemail.com wrote:
We had a discussion at a recent Wikimedia UK board meeting about potentially buying some digitisation equipment which could be used to generate content for the Wikimedia projects. This recent email to the EN-WP list sparked my interest.
Does anyone have any experience with equipment like this, and could you recommend anything? Any idea what the price range and quality typically is?
Also, is anyone else in the Wikimedia community currently doing this?
Thanks,
---- Forwarded Message ----- From: "Steve Bennett" stevagewp@gmail.com To: "English Wikipedia" wikien-l@lists.wikimedia.org Sent: Sunday, 23 August, 2009 10:55:32 GMT +00:00 GMT Britain, Ireland, Portugal Subject: Re: [WikiEN-l] Wikipedia reaches 3 millionth article
On Wed, Aug 19, 2009 at 11:15 PM, David Gerarddgerard@gmail.com wrote:
I believe they have machines to turn pages, and something to figure out the distorted photo of the book and render it how it would look as a flat page.
Yeah, there are videos of these machines. The book sits open, the scanner comes down and scans both open pages at once. As it goes up again, it sucks on one page, causing it to flip over. Then repeat.
Oh, look, here you go: http://www.youtube.com/watch?v=hlOQuuLYavY
And while we're at it: http://en.wikipedia.org/wiki/Book_scanning
Steve
WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l
On Thu, Aug 27, 2009 at 8:27 AM, Andrew Turveyandrewrturvey@googlemail.com wrote:
We had a discussion at a recent Wikimedia UK board meeting about potentially buying some digitisation equipment which could be used to generate content for the Wikimedia projects. This recent email to the EN-WP list sparked my interest.
Does anyone have any experience with equipment like this, and could you recommend anything? Any idea what the price range and quality typically is?
Also, is anyone else in the Wikimedia community currently doing this?
This came up on the Australian Wikimedia list.
http://lists.wikimedia.org/pipermail/wikimediaau-l/2009-August/002606.html
I think it is terribly inefficient for Wikimedians to start mass scanning projects while we have so few people engaging in transcription projects. Libraries have scanned millions of books, and there is no signs that they are going to stop. Commons and Wikisource should be mining and transcribing these books which are already scanned.
http://lists.wikimedia.org/pipermail/wikimediaau-l/2009-August/002611.html
-- John Vandenberg
Andrew Turvey wrote:
We had a discussion at a recent Wikimedia UK board meeting about potentially buying some digitisation equipment which could be used to generate content for the Wikimedia projects. This recent email to the EN-WP list sparked my interest.
Does anyone have any experience with equipment like this, and could you recommend anything? Any idea what the price range and quality typically is?
Also, is anyone else in the Wikimedia community currently doing this?
I'm on the board of Wikimedia Sverige (Sweden), and also the founder (in 1992) of Project Runeberg, the Scandinavian offspring of Project Gutenberg. The Swedish-language Wikisource isn't doing much, because Project Runeberg still does a lot of book scanning. Its archive now contains images of 550,000 book pages, corresponding to nearly 30 linear metres of shelving.
Book digitization is a matter of using the right tools for each job. Much depends on the kind of book and the kind of labour. If you use unpaid volunteers, you can afford slower equipment. If you need to pay your staff, any equipment that speeds up the work will quickly pay its own cost, including those very expensive "professional" book scanners. On the scale of Google Book Search, aiming to digitize millions of books, it pays off to let Google engineers work on developing even faster equipment, just like Google develops its own Linux-based storage architecture.
It's hard to measure the usefulness of a digitized book, since Wikisource (and Project Runeberg) doesn't have any income. (And neither has Google Book Search, I believe.) If your success is measured in how much money you spend (as some charities have it), it is very easy to invest a lot of money, without much result. The worst you can do is to spend a lot to digitize something that is already available for free download. Look around first.
You should start to think of what do you want to achieve? Is there some book, or genre of books, that would be really useful for Wikipedia to have on Wikisource? Anything British that all those American projects haven't already covered? For us from non-English speaking countries, it's far easier. Very little has been digitized, so there is a lot to do. The most useful thing is to digitize an old encyclopedia, just like that 11th edition of Encyclopaedia Britannica (from 1911).
Now, encyclopedias are common items in used bookstores or online auctions. You can buy 20 volumes for 200 euro, or even cheaper. At that price, the best investment is a paper cutter (or ask a print shop to help you) and a two-sided (duplex) sheet-feeding scanner, such as the Canon DR-2050C or Fujitsu Scansnap s510. http://www.youtube.com/watch?v=1oH3mQZLpL8
OCR software might be included with the scanner. Or you can buy www.finereader.com for 160 euro.
The total investment would be less than 1000 euro (scanner + OCR software + 20 volume encyclopedia + ask a print shop to cut the spines). After this, you only need hours of volunteer work.
That's how I digitized the "New Student's Reference Work" (from 1914, 5 volumes, some 2500 pages) for Wikisource in 2005, only to show that Wikisource could be used that way.
Here are some old pictures (with Swedish text, from 2001), http://runeberg.org/admin/snuff.html
These scanners and that OCR software are not open source products, but neither is my digital camera, and I use that to produce free pictures for Wikimedia Commons. I know there have been many attempts to make free OCR software, but is it any good?
In fact, if you have books where you can't afford to cut the spine, maybe some rare thing that you only find in a library, a 10 megapixel digital camera is very useful. You need to experiment a little with tripod stands and good lights. You only need to open the book at 90 degrees, to get a good view of a page, which is much friendlier to the book than an old flatbed scanner (and faster). If you have two cameras, you can shoot left pages with one, and right pages with the other. That's in fact how the fastest modern "book scanners" work. Google builds their own, and so can you. Some radical ideas are found on http://bkrpr.org/
Again, a pair of digital cameras is a total investment of less than 1000 euro. That's a good starting range. You can achieve a lot, and learn even more, in very little time.
What can you do with 2000 euro? Buy one sheet-feeding set, and one pair of digital cameras. Let two teams compete against each other. Write a report for next year's Wikimania. Have great fun!