[Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

John Vandenberg jayvdb at gmail.com
Sun Jun 21 12:35:31 UTC 2009

On Sun, Jun 21, 2009 at 10:07 PM, Anthony <wikimail at inbox.org> wrote:
> On Sun, Jun 21, 2009 at 7:54 AM, John Vandenberg <jayvdb at gmail.com> wrote:
> > Whether Google is good or evil is off-topic, and irrelevant to boot.
> >
> Whether or not they have a right to exclude bots isn't.

Actually, it is.  This mailing list is about the Wikimedia Foundation
and its project, and this thread is about Wikisource.  Anyone who has
done significant amounts of Wikisource work will tell you that they
don't consider Google Book click through license to be an problem that
needs discussing at this level.

Do you think that 750,000 Google Books were manually converted to
DJVU, and copied over to Internet Archive?

Is there a book that you seek that isn't available at Internet Archive?

I wrote a GreaseMonkey user script to scrape the text from Google
Books; it is now broken and unmaintained because I no longer need to
take text from Google Books, as the vast majority of the texts I want
are now on Internet Archive, and that is a more productive workflow.

> Also worth noting, Project Gutenberg has digitised less than 30,000
> > books since 1971.  Distributed Proofreaders has done 15,000 of those
> > since 2000, so throughput is picking up.  But, there are more than
> > enough too keep everyone busy for a very long time.
> The interesting thing is, even if you don't use a bot, it's still faster to
> copy/paste from Google manually than it is to get the book and scan it in
> yourself (assuming you don't want to destroy the original, anyway).

No, it is quicker to download the DJVU file from Internet Archive,
upload it to Wikisource, set up a transcription project, and fix the
OCR text there, and copy and paste it wherever you like.

It takes about 10 minutes unless there is some copyright concern.

> If you're going to make a project out OCRing books that Google has already
> OCRed, I don't see any point in reinventing the scanning or first pass
> OCRing part.

I suggest you take a look at a few of the DJVU files provided by
Internet Archive.  Then you can point out real faults that you see.

John Vandenberg

More information about the foundation-l mailing list