On Sun, Jun 21, 2009 at 10:07 PM, Anthony wikimail@inbox.org wrote:
On Sun, Jun 21, 2009 at 7:54 AM, John Vandenberg jayvdb@gmail.com wrote:
Whether Google is good or evil is off-topic, and irrelevant to boot.
Whether or not they have a right to exclude bots isn't.
Actually, it is. This mailing list is about the Wikimedia Foundation and its project, and this thread is about Wikisource. Anyone who has done significant amounts of Wikisource work will tell you that they don't consider Google Book click through license to be an problem that needs discussing at this level.
Do you think that 750,000 Google Books were manually converted to DJVU, and copied over to Internet Archive?
Is there a book that you seek that isn't available at Internet Archive?
I wrote a GreaseMonkey user script to scrape the text from Google Books; it is now broken and unmaintained because I no longer need to take text from Google Books, as the vast majority of the texts I want are now on Internet Archive, and that is a more productive workflow.
Also worth noting, Project Gutenberg has digitised less than 30,000
books since 1971. Distributed Proofreaders has done 15,000 of those since 2000, so throughput is picking up. But, there are more than enough too keep everyone busy for a very long time.
The interesting thing is, even if you don't use a bot, it's still faster to copy/paste from Google manually than it is to get the book and scan it in yourself (assuming you don't want to destroy the original, anyway).
No, it is quicker to download the DJVU file from Internet Archive, upload it to Wikisource, set up a transcription project, and fix the OCR text there, and copy and paste it wherever you like.
It takes about 10 minutes unless there is some copyright concern.
If you're going to make a project out OCRing books that Google has already OCRed, I don't see any point in reinventing the scanning or first pass OCRing part.
I suggest you take a look at a few of the DJVU files provided by Internet Archive. Then you can point out real faults that you see.
-- John Vandenberg