Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

21 Jun 2009


      On Sun, Jun 21, 2009 at 10:07 PM, Anthony wikimail@inbox.org wrote:
...
On Sun, Jun 21, 2009 at 7:54 AM, John Vandenberg jayvdb@gmail.com wrote:
...
Whether Google is good or evil is off-topic, and irrelevant to boot.
Whether or not they have a right to exclude bots isn't.
Actually, it is.  This mailing list is about the Wikimedia Foundation
and its project, and this thread is about Wikisource.  Anyone who has
done significant amounts of Wikisource work will tell you that they
don't consider Google Book click through license to be an problem that
needs discussing at this level.
Do you think that 750,000 Google Books were manually converted to
DJVU, and copied over to Internet Archive?
Is there a book that you seek that isn't available at Internet Archive?
I wrote a GreaseMonkey user script to scrape the text from Google
Books; it is now broken and unmaintained because I no longer need to
take text from Google Books, as the vast majority of the texts I want
are now on Internet Archive, and that is a more productive workflow.
...
Also worth noting, Project Gutenberg has digitised less than 30,000
...
books since 1971.  Distributed Proofreaders has done 15,000 of those
since 2000, so throughput is picking up.  But, there are more than
enough too keep everyone busy for a very long time.
The interesting thing is, even if you don't use a bot, it's still faster to
copy/paste from Google manually than it is to get the book and scan it in
yourself (assuming you don't want to destroy the original, anyway).
No, it is quicker to download the DJVU file from Internet Archive,
upload it to Wikisource, set up a transcription project, and fix the
OCR text there, and copy and paste it wherever you like.
It takes about 10 minutes unless there is some copyright concern.
...
If you're going to make a project out OCRing books that Google has already
OCRed, I don't see any point in reinventing the scanning or first pass
OCRing part.
I suggest you take a look at a few of the DJVU files provided by
Internet Archive.  Then you can point out real faults that you see.
--
John Vandenberg

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship