Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

List overview All Threads
Download

newer

older

[Foundation-l] Public repositories...

[Foundation-l] Video from...

Parker Higgins

20 Jun 2009 20 Jun '09

7:27 p.m.

Except google isn't asserting any kind of copyright control over these books, they're just not making it convenient to download them in your preferred format. Maybe not The Right Thing, but not as boneheaded as suing a party who reprints public domain material, as was the case in Feist v. Rural (the supreme court case you mention.)

Sent from my portable e-mail unit

On Jun 20, 2009 3:23 PM, "Geoffrey Plourde" geo.plrd@yahoo.com wrote:

For some reason, I am reminded of a Supreme Court case about the information in telephone directories. Maybe because of the insanity of trying to put public domain material under copyright.

________________________________ From: Brian Brian.Mingus@colorado.edu To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org Sent: Saturday, June 20, 2009 11:47:28 AM Subject: Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

That is against the law. It violates Google's ToS. I'm mostly complaining that Google is being Ver...

Show replies by date

Stephen Bain

21 Jun 21 Jun

1:55 a.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

On Sun, Jun 21, 2009 at 5:27 AM, Parker Higginsparkerhiggins@gmail.com wrote:

...

Except google isn't asserting any kind of copyright control over these books, they're just not making it convenient to download them in your preferred format. Maybe not The Right Thing, but not as boneheaded as suing a party who reprints public domain material, as was the case in Feist v. Rural (the supreme court case you mention.)

They want people to use their service. Fair enough, given that the scanning and OCRing happened on their dime.

-- Stephen Bain stephen.bain@gmail.com

Ray Saintonge

5:51 a.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Stephen Bain wrote:

...

On Sun, Jun 21, 2009 at 5:27 AM, Parker Higginsparkerhiggins@gmail.com wrote:

...
Except google isn't asserting any kind of copyright control over these books, they're just not making it convenient to download them in your preferred format. Maybe not The Right Thing, but not as boneheaded as suing a party who reprints public domain material, as was the case in Feist v. Rural (the supreme court case you mention.)

They want people to use their service. Fair enough, given that the scanning and OCRing happened on their dime.

How does that give them any special rights? There are no database protection laws in the US, and sweat-of-the-brow has been rejected as a basis for new copyrights.

Anthony

11:17 a.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

On Sun, Jun 21, 2009 at 1:51 AM, Ray Saintonge saintonge@telus.net wrote:

...

Stephen Bain wrote:

...
On Sun, Jun 21, 2009 at 5:27 AM, Parker Higginsparkerhiggins@gmail.com

wrote:

...
...
Except google isn't asserting any kind of copyright control over these books, they're just not making it convenient to download them in your preferred format. Maybe not The Right Thing, but not as boneheaded as

suing

...
...
a party who reprints public domain material, as was the case in Feist v. Rural (the supreme court case you mention.)

They want people to use their service. Fair enough, given that the scanning and OCRing happened on their dime.

How does that give them any special rights? There are no database protection laws in the US, and sweat-of-the-brow has been rejected as a basis for new copyrights.

You're right, it doesn't give them any *special* rights. They have the same rights as any other computer owner. Specifically, they have the right to choose who uses their computers, and how they use them. Whether or not a terms of service is legally binding is really not the issue. (*) The issue is whether or not they have a duty to make it *convenient* for you to download the data. Of course they don't. Why should they be required to help you put them out of business? That kind of twisted logic might make sense in the non-profit world (although I still haven't seen the WMF step up to the plate and make it easy for people to make a full history fork, or even to download all the images), but Google is not a non-profit organization. Google would be Evil if it *didn't* protect itself against this, as it'd be breaking a promise to its shareholders.

(*) Personally, I'm of the opinion that merely accessing a website is not sufficient to bind a websurfer to a TOS, and that at most a TOS which you do not have to even click "agree" to is a unilateral contract which can only impose promises upon the offeror, though this is not a legal opinion but merely my opinion of what the law should be.

Anthony

11:33 a.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

On Sun, Jun 21, 2009 at 7:17 AM, Anthony wikimail@inbox.org wrote:

...

(*) Personally, I'm of the opinion that merely accessing a website is not sufficient to bind a websurfer to a TOS, and that at most a TOS which you do not have to even click "agree" to is a unilateral contract which can only impose promises upon the offeror, though this is not a legal opinion but merely my opinion of what the law should be.

You know what, after further thought I'm going to withdraw that. First of all, I think Google does require you to click agree before you can access the service we're talking about. But more importantly, I'm going to cast doubt on my previously held opinion of whether or not a TOS should be able to bind someone who didn't click on anything. If I leave a bunch of Apples on the table at work and put next to it a sign that says "Apples: $.25 each"... I don't know, I'll have to think about it.

John Vandenberg

11:54 a.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

On Sun, Jun 21, 2009 at 9:17 PM, Anthony wikimail@inbox.org wrote:

...

On Sun, Jun 21, 2009 at 1:51 AM, Ray Saintonge saintonge@telus.net wrote:

...
Stephen Bain wrote:

...
On Sun, Jun 21, 2009 at 5:27 AM, Parker Higginsparkerhiggins@gmail.com

wrote:

...
...
Except google isn't asserting any kind of copyright control over these books, they're just not making it convenient to download them in your preferred format. Maybe not The Right Thing, but not as boneheaded as

suing

...
...
a party who reprints public domain material, as was the case in Feist v. Rural (the supreme court case you mention.)

They want people to use their service. Fair enough, given that the scanning and OCRing happened on their dime.

How does that give them any special rights? There are no database protection laws in the US, and sweat-of-the-brow has been rejected as a basis for new copyrights.

You're right, it doesn't give them any *special* rights. They have the same rights as any other computer owner. Specifically, they have the right to choose who uses their computers, and how they use them. Whether or not a terms of service is legally binding is really not the issue. (*) The issue is whether or not they have a duty to make it *convenient* for you to download the data. Of course they don't. Why should they be required to help you put them out of business? That kind of twisted logic might make sense in the non-profit world (although I still haven't seen the WMF step up to the plate and make it easy for people to make a full history fork, or even to download all the images), but Google is not a non-profit organization. Google would be Evil if it *didn't* protect itself against this, as it'd be breaking a promise to its shareholders.

(*) Personally, I'm of the opinion that merely accessing a website is not sufficient to bind a websurfer to a TOS, and that at most a TOS which you do not have to even click "agree" to is a unilateral contract which can only impose promises upon the offeror, though this is not a legal opinion but merely my opinion of what the law should be.

Whether Google is good or evil is off-topic, and irrelevant to boot.

There are nearly _750,000_ books from Google that are available on archive.org, available in DJVU format with OCR.

http://www.archive.org/details/googlebooks

Microsoft donated many texts directly to IA, but that approach only netted 440,000 books.

http://www.archive.org/details/msn_books

See here for more of the collections: http://www.archive.org/details/texts

Also worth noting, Project Gutenberg has digitised less than 30,000 books since 1971. Distributed Proofreaders has done 15,000 of those since 2000, so throughput is picking up. But, there are more than enough too keep everyone busy for a very long time.

-- John Vandenberg

Anthony

12:07 p.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

On Sun, Jun 21, 2009 at 7:54 AM, John Vandenberg jayvdb@gmail.com wrote:

...

Whether Google is good or evil is off-topic, and irrelevant to boot.

Whether or not they have a right to exclude bots isn't.

Also worth noting, Project Gutenberg has digitised less than 30,000

...

books since 1971. Distributed Proofreaders has done 15,000 of those since 2000, so throughput is picking up. But, there are more than enough too keep everyone busy for a very long time.

The interesting thing is, even if you don't use a bot, it's still faster to copy/paste from Google manually than it is to get the book and scan it in yourself (assuming you don't want to destroy the original, anyway).

If you're going to make a project out OCRing books that Google has already OCRed, I don't see any point in reinventing the scanning or first pass OCRing part.

John Vandenberg

12:35 p.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

On Sun, Jun 21, 2009 at 10:07 PM, Anthony wikimail@inbox.org wrote:

...

On Sun, Jun 21, 2009 at 7:54 AM, John Vandenberg jayvdb@gmail.com wrote:

...
Whether Google is good or evil is off-topic, and irrelevant to boot.

Whether or not they have a right to exclude bots isn't.

Actually, it is. This mailing list is about the Wikimedia Foundation and its project, and this thread is about Wikisource. Anyone who has done significant amounts of Wikisource work will tell you that they don't consider Google Book click through license to be an problem that needs discussing at this level.

Do you think that 750,000 Google Books were manually converted to DJVU, and copied over to Internet Archive?

Is there a book that you seek that isn't available at Internet Archive?

I wrote a GreaseMonkey user script to scrape the text from Google Books; it is now broken and unmaintained because I no longer need to take text from Google Books, as the vast majority of the texts I want are now on Internet Archive, and that is a more productive workflow.

...

Also worth noting, Project Gutenberg has digitised less than 30,000

...
books since 1971. Distributed Proofreaders has done 15,000 of those since 2000, so throughput is picking up. But, there are more than enough too keep everyone busy for a very long time.

The interesting thing is, even if you don't use a bot, it's still faster to copy/paste from Google manually than it is to get the book and scan it in yourself (assuming you don't want to destroy the original, anyway).

No, it is quicker to download the DJVU file from Internet Archive, upload it to Wikisource, set up a transcription project, and fix the OCR text there, and copy and paste it wherever you like.

It takes about 10 minutes unless there is some copyright concern.

...

If you're going to make a project out OCRing books that Google has already OCRed, I don't see any point in reinventing the scanning or first pass OCRing part.

I suggest you take a look at a few of the DJVU files provided by Internet Archive. Then you can point out real faults that you see.

-- John Vandenberg

Anthony

2:23 p.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

On Sun, Jun 21, 2009 at 8:35 AM, John Vandenberg jayvdb@gmail.com wrote:

...

I suggest you take a look at a few of the DJVU files provided by Internet Archive. Then you can point out real faults that you see.

I will. My apologies for misunderstanding your email.

Anthony

2:55 p.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

On Sun, Jun 21, 2009 at 10:23 AM, Anthony wikimail@inbox.org wrote:

...

On Sun, Jun 21, 2009 at 8:35 AM, John Vandenberg jayvdb@gmail.com wrote:

...
I suggest you take a look at a few of the DJVU files provided by Internet Archive. Then you can point out real faults that you see.

I will. My apologies for misunderstanding your email.

Okay, http://www.archive.org/details/catholicencyclo16herbgoog happened to be the first book I randomly picked from Google Book Search. There's no text version.

And the text version I find of other editions seems to be much much worse than the google OCR results.

Anthony

3:20 p.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

On Sun, Jun 21, 2009 at 10:55 AM, Anthony wikimail@inbox.org wrote:

...

On Sun, Jun 21, 2009 at 10:23 AM, Anthony wikimail@inbox.org wrote:

...
On Sun, Jun 21, 2009 at 8:35 AM, John Vandenberg jayvdb@gmail.comwrote:

...
I suggest you take a look at a few of the DJVU files provided by Internet Archive. Then you can point out real faults that you see.

I will. My apologies for misunderstanding your email.

Okay, http://www.archive.org/details/catholicencyclo16herbgoog happened to be the first book I randomly picked from Google Book Search. There's no text version.

And the text version I find of other editions seems to be much much worse than the google OCR results.

http://books.google.com/books?id=TZ0UAAAAYAAJ strike two, not even there. http://books.google.com/books?id=PYAaAAAAYAAJ strike three http://www.archive.org/details/happinessessays00hiltgoog finally...let's compare the OCR:

"Great numbers of thoughtful people are just now much perplexed to know what to make of the faffs of life, and are looking about them for some reasonable interpretation of the modern world. They cannot abandon the work of the world, but they are conscious that they have not learned the art of work."

"Greaf numbers of thoughtful people are just now much perplexed to know what to make of thefaSls of life^ and are looking about them for some reasonable interpretation of the modem world. They cannot abandon the work of the worlds but they are conscious that they have not learned the art of work." --- "Few people, however, really know how to work, and even in an age when oftener perhaps than ever before we hear of "work" and "workers" one cannot observe that the art of work makes much positive progress. On the contrary, the general inclination seems to be to work as little as possible, or to work for a short time in order to pass the remainder of one's life in rest."

"Few people, however, really know how to work, and even in an age when oftener perhaps than ever before we hear of" work " and " workers " one cannotobserve that the art of work makes much positive progress. On the contrary, the general inclination seems to be to work as little as possible, or to work for a short time in order to pass the remainder of one's life in rest. " --- I guess that's acceptable. The Catholic encyclopedia results were much worse, though. Maybe it was a font thing, but I'm not quite interested enough to bother doing a more in depth study right now.

Ray Saintonge

22 Jun 22 Jun

4:33 a.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Anthony wrote:

...

On Sun, Jun 21, 2009 at 10:55 AM, Anthony wrote:

...
Okay, http://www.archive.org/details/catholicencyclo16herbgoog happened to be the first book I randomly picked from Google Book Search. There's no text version.

And the text version I find of other editions seems to be much much worse than the google OCR results.

http://books.google.com/books?id=TZ0UAAAAYAAJ strike two, not even there. http://books.google.com/books?id=PYAaAAAAYAAJ strike three http://www.archive.org/details/happinessessays00hiltgoog finally...let's compare the OCR:

"Great numbers of thoughtful people are just now much perplexed to know what to make of the faffs of life, and are looking about them for some reasonable interpretation of the modern world. They cannot abandon the work of the world, but they are conscious that they have not learned the art of work."

"Greaf numbers of thoughtful people are just now much perplexed to know what to make of thefaSls of life^ and are looking about them for some reasonable interpretation of the modem world. They cannot abandon the work of the worlds but they are conscious that they have not learned the art of work."

"Few people, however, really know how to work, and even in an age when oftener perhaps than ever before we hear of "work" and "workers" one cannot observe that the art of work makes much positive progress. On the contrary, the general inclination seems to be to work as little as possible, or to work for a short time in order to pass the remainder of one's life in rest."

"Few people, however, really know how to work, and even in an age when oftener perhaps than ever before we hear of" work " and " workers " one cannotobserve that the art of work makes much positive progress. On the contrary, the general inclination seems to be to work as little as possible, or to work for a short time in order to pass the remainder of one's life in rest. "

I guess that's acceptable. The Catholic encyclopedia results were much worse, though. Maybe it was a font thing, but I'm not quite interested enough to bother doing a more in depth study right now.

. Who is expecting OCR to be perfect anywhere? In the absence of real human proofreading I assume any OCR material to be fraught with errors. Wikisource aims to accurately reproduce what was published, including original errors. Scans alone provide the needed accuracy, but they are not suitable for the added value of wikification.

Platonides

11:23 p.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Anthony wrote:

...

On Sun, Jun 21, 2009 at 7:54 AM, John Vandenberg jayvdb@gmail.com wrote:

...
Whether Google is good or evil is off-topic, and irrelevant to boot.

Whether or not they have a right to exclude bots isn't.

Also worth noting, Project Gutenberg has digitised less than 30,000

...
books since 1971. Distributed Proofreaders has done 15,000 of those since 2000, so throughput is picking up. But, there are more than enough too keep everyone busy for a very long time.

The interesting thing is, even if you don't use a bot, it's still faster to copy/paste from Google manually than it is to get the book and scan it in yourself (assuming you don't want to destroy the original, anyway).

If you're going to make a project out OCRing books that Google has already OCRed, I don't see any point in reinventing the scanning or first pass OCRing part.

IMHO the interesting bit would be to make a google books browser prefiling the wiki editor.

Platonides

23 Jun 23 Jun

1:15 a.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

Anthony wrote:

...

(although I still haven't seen the WMF step up to the plate and make it easy for people to make a full history fork, or even to download all the images)

You'll find full history dumps of almost all wikis at http://download.wikimedia.org/

Although not trivial, downloading all images is in fact quite easy. You can find scripts to do that already made. You can also ask Brion to rsync3 them. But do you have enough space to dedicate? How many wikis do you want to mirror? Just commons is more than 3 TB...

That's the reason so few people were interested in the images when the image dump was available.

Peter Gervai

9:26 a.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

On Tue, Jun 23, 2009 at 03:15, PlatonidesPlatonides@gmail.com wrote:

...

Although not trivial, downloading all images is in fact quite easy. You can find scripts to do that already made. You can also ask Brion to rsync3 them. But do you have enough space to dedicate? How many wikis do you want to mirror? Just commons is more than 3 TB...

Well disks are cheap nowadays. If it's really just the question of asking, I may be interested. for example.

The more complex question is the parameters of such usage, meaning what can I do with the images after I've got them. This is the main reason behind not publishing them in the first hand: the images itself aren't suggesting any particular license.

Now that I wrote this, it would be possible (not sure if feasible, though) to publish CC-BY-SA pictures with author info in the comment of the image itself. Most image formats support sizeable comment blocks, and standardised templates make it possible to select media by license, and get author/copyright info to put into the file.

...

That's the reason so few people were interested in the images when the image dump was available.

People are interested, generally, but not in mirroring the whole shebang. :-)

grin

Anthony

12:59 p.m.

New subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship

On Mon, Jun 22, 2009 at 9:15 PM, Platonides Platonides@gmail.com wrote:

...

Anthony wrote:

...
(although I still haven't seen the WMF step up to the plate and make it easy for people to make a full history fork, or even to download all the images)

You'll find full history dumps of almost all wikis at http://download.wikimedia.org/

Key word being "almost".

Although not trivial, downloading all images is in fact quite easy.

Yep. All I need is permission.

...

But do you have enough space to dedicate?

Not at the moment. No sense in buying the drives when I don't have permission to fill them up.

...

How many wikis do you want to mirror? Just commons is more than 3 TB...

Commons and En.wikipedia would probably be good for starters.

The main thing I want is permission to scrape en.wikipedia, though. (Not really scraping, as I'd probably use the API and Special:Export. Basically I just would like someone official to tell me how *fast* I'm allowed to use the API and Special:Export. Special:Export especially, because I could easily overwhelm the servers using that, due to a bug in the script.)

That's the reason so few people were interested in the images when the

...

image dump was available.

I downloaded it. It was well under 1 TB at the time.

5513

Age (days ago)

5516

Last active (days ago)

wikimedia-l@lists.wikimedia.org

15 comments

7 participants

tags (0)

participants (7)

Anthony
John Vandenberg
Parker Higgins
Peter Gervai
Platonides
Ray Saintonge
Stephen Bain