Some of you know that I (LA2) am the founder of Project Runeberg, a website (runeberg.org) where we scan old books from Sweden, Denmark, Finland, Norway, and Iceland, including several works that are recycled for Wikipedia. In promoting the way we work, my biggest obstacle is that the world at large is so unwilling to learn Swedish. You don't know what you're missing!
So I found and bought "The New Student's Reference Work", a little encyclopedia in five volumes, published in Chicago in 1914. As it was published before 1923, it is now in the public domain. Since this non-Scandinavian work doesn't fit in Project Runeberg, I put it in Wikisource.
First I scanned images (300 dpi JPEG) of all 2791 pages and uploaded them to Wikimedia Commons, where you will find them in the category http://commons.wikimedia.org/wiki/Category:LA2-NSRW
Then, for each book page, I created a wiki page on Wikisource displaying the scanned image and containing the raw OCR text. If you want to help in proofreading, use two separate browser windows to open the enlarged image and edit the wiki text.
Finally, I made a front page with a short preface and a rough table of contents, which is your starting point:
http://en.wikisource.org/wiki/The_New_Student%27s_Reference_Work
Some trivia:
It turns out that this work has a historic relationship to Encyclopædia Britannica. The main editor, Chandler B. Beach, started out as a salesman for Britannica, but in 1892 set out to create his own, smaller and more easily sold "Youth's Cyclopedia". His companion F.E. Compton took over the firm in 1907 and later produced "Compton's Encyclopedia". In 1941 this competitor was acquired by Encyclopædia Britannica, Inc.
Someone already noticed that the illustration plate for Peanuts, http://commons.wikimedia.org/wiki/Image:LA2-NSRW-3-0473.jpg is identical to the one found in Koehler's Medicinal-Plants 1887, http://commons.wikimedia.org/wiki/Image:Koeh-163.jpg
So reuse of encyclopedic materials is nothing new indeed.
Lars Aronsson wrote:
So I found and bought "The New Student's Reference Work", a little encyclopedia in five volumes, published in Chicago in 1914. As it was published before 1923, it is now in the public domain. Since this non-Scandinavian work doesn't fit in Project Runeberg, I put it in Wikisource.
First I scanned images (300 dpi JPEG) of all 2791 pages and uploaded them to Wikimedia Commons, where you will find them in the category http://commons.wikimedia.org/wiki/Category:LA2-NSRW
Then, for each book page, I created a wiki page on Wikisource displaying the scanned image and containing the raw OCR text. If you want to help in proofreading, use two separate browser windows to open the enlarged image and edit the wiki text.
I don't mean this to sound particularly harsh, but I'm wondering why we're doing this on Wikisource, when Distributed Proofreaders for Project Gutenberg already has a well-debugged workflow for taking texts from images to OCR to proofread to a final version. Is there an advantage to starting our own project that does the same thing they already do pretty well?
-Mark
Delirium wrote:
I don't mean this to sound particularly harsh, but I'm wondering why we're doing this on Wikisource, when Distributed Proofreaders for Project Gutenberg already has a well-debugged workflow for taking texts from images to OCR to proofread to a final version. Is there an advantage to starting our own project that does the same thing they already do pretty well?
This is a fascinating topic with many facets.
My starting point was not PGDP vs Wikisource, but Project Runeberg's existing software and workflow vs MediaWiki.
I hope that the text and illustrations of this old encyclopedia are useful on their own. Having the scanned images on Wikimedia Commons make them easily accessible. Having the text on Wikisource, where it can be proofread with wiki markup makes it easy to reuse in Wikipedia and other projects.
But this is also a demonstration of a new and different principle for digitization and proofreading of old books. This started out as an informal discussion during Wikimania, as you can read on http://meta.wikimedia.org/wiki/User:LA2#Digitizing_books_with_MediaWiki
One argument against PGDP's current solution is that it is a workflow only, it is not a wiki. Once they finish proofing a book and ship the e-text to Project Gutenberg, there is no way to go back and correct errors (the wiki way). They also don't publish the scanned images, so if you suspect an error (right or wrong) in a Project Gutenberg e-text you cannot go back and look at the scanned image. These drawbacks are overcome by publishing the scanned images and using a wiki approach to proofreading (never-ending, non-linear). This is what Project Runeberg does, and what this new MediaWiki/Wikisource demo does.
People are asking me why I don't publish Project Runeberg's software to allow other projects to be started with a similar structure. My answer is that the source code is ugly, not developed to be distributed. Instead of cleaning up that code, it would make more sense to improve MediaWiki to support digitization and proofreading. In fact, MediaWiki can already be used for this. How can I demonstrate that? By starting my own MediaWiki site, of course. But since I'm lazy I decided instead to use Wikisource for my little demo.
I think it is reasonable to put in question why Wikisource exists at all, since we already have all these other digitization projects. My point was never to redefine Wikisource. I just happened to use it for this encyclopedia. Some people in the English Wikisource "Scriptorium" (community discussion) have welcomed my initiative.
Delirium wrote:
Lars Aronsson wrote:
So I found and bought "The New Student's Reference Work", a little encyclopedia in five volumes, published in Chicago in 1914. As it was published before 1923, it is now in the public domain. Since this non-Scandinavian work doesn't fit in Project Runeberg, I put it in Wikisource. First I scanned images (300 dpi JPEG) of all 2791 pages and uploaded them to Wikimedia Commons, where you will find them in the category http://commons.wikimedia.org/wiki/Category:LA2-NSRW
Then, for each book page, I created a wiki page on Wikisource displaying the scanned image and containing the raw OCR text. If you want to help in proofreading, use two separate browser windows to open the enlarged image and edit the wiki text.
I don't mean this to sound particularly harsh, but I'm wondering why we're doing this on Wikisource, when Distributed Proofreaders for Project Gutenberg already has a well-debugged workflow for taking texts from images to OCR to proofread to a final version. Is there an advantage to starting our own project that does the same thing they already do pretty well?
I view this event as what Wikisource is all about. Project Runeberg has been doing this sort of thing all along: a stable JPEG image of the original page to insure historical accuracy, and an editable OCR initiated product that can easily be wikified, footnoted or translated as circumstances require. This goes well beyond the capabilities of Project Gutenberg.
Ec
wikipedia-l@lists.wikimedia.org