[Foundation-l] Hosting scans of the 1911 Britannica on Wikimedia

List overview All Threads
Download

newer

older

[Foundation-l] Re: A community the...

[Foundation-l] Florida public...

Brian

9 Nov 2005 9 Nov '05

2:11 a.m.

For those who don't know, the 1911 Encyclopaedia Britannica is a famous public domain encyclopedia, advertised as the "sum of all human knowledge" in 1911.

I recently (today) acquired a DVD containing scans of every page of the 1911 Britannica, along with index files for it all, organized by letter and page number. I've already talked with avar, TimStarling, and brion on IRC, and TimStarling specifically asked me to tell you all that he is "confident that the server requirements will be minimal." They would set up a domain name, generate some web pages automatically using the index files, and host the entire set of 29,700 files totaling about 4 GB.

One more thing, these are black and white TIFs, and there is discussion about whether they should be mass converted to PNGs to be easily viewable.

brian0918@gmail.com

Show replies by date

Tim Starling

9 Nov 9 Nov

2:37 a.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Brian wrote:

...

For those who don't know, the 1911 Encyclopaedia Britannica is a famous public domain encyclopedia, advertised as the "sum of all human knowledge" in 1911.

I recently (today) acquired a DVD containing scans of every page of the 1911 Britannica, along with index files for it all, organized by letter and page number. I've already talked with avar, TimStarling, and brion on IRC, and TimStarling specifically asked me to tell you all that he is "confident that the server requirements will be minimal." They would set up a domain name, generate some web pages automatically using the index files, and host the entire set of 29,700 files totaling about 4 GB.

One more thing, these are black and white TIFs, and there is discussion about whether they should be mass converted to PNGs to be easily viewable.

A few notes on this: firstly it seems that the guy who made the scans has no intention of claiming any rights to them. He seems to be interested in disseminating the material widely, for religious reasons. His webpage is here:

http://freierscientologe.netfirms.com/booksbritannica.htm

The CD/DVD sets are apparently quite rare, Brian was lucky to get his hands on one at a fairly cheap price.

There's the trademark issue -- Britannica may attempt to scare us with legal threats over this. A disclaimer on every HTML page declaring non-affiliation with Britannica would probably put us on sound legal footing, although I'd be willing to hear advice about this from people who are more knowledgeable. If the "LoveToKnow Free Online Encyclopedia" (1911encyclopedia.org) can host this content, then we should be able to find a way too. And we can do it without the abominable license restrictions and "copyright traps" scattered throughout the work to enforce them.

Wikipedia owes a lot to the 1911 edition -- we've copied many of its articles. A public, canonical copy will be a valuable tool to deal with LoveToKnow's frequent OCR errors, its incompleteness, and its specious legal threats against us based on our use of unspecified copyright material hidden in their doctored online copy. Hopefully the availability of page images will spur development of a complete and accurate OCR copy.

The only question in my mind is the domain: should this be under eb1911.wikipedia.org? We could make it visually distinct, to avoid confusion with Wikipedia itself. Or would eb1911.wikimedia.org be better? Or eb1911.wikisource.org?

-- Tim Starling

Brian

2:45 a.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Let's not forget the thousands of illustrations that we will now have access to as a result of this.

Tim Starling wrote:

...

Brian wrote:

...
For those who don't know, the 1911 Encyclopaedia Britannica is a famous public domain encyclopedia, advertised as the "sum of all human knowledge" in 1911.

I recently (today) acquired a DVD containing scans of every page of the 1911 Britannica, along with index files for it all, organized by letter and page number. I've already talked with avar, TimStarling, and brion on IRC, and TimStarling specifically asked me to tell you all that he is "confident that the server requirements will be minimal." They would set up a domain name, generate some web pages automatically using the index files, and host the entire set of 29,700 files totaling about 4 GB.

One more thing, these are black and white TIFs, and there is discussion about whether they should be mass converted to PNGs to be easily viewable.

A few notes on this: firstly it seems that the guy who made the scans has no intention of claiming any rights to them. He seems to be interested in disseminating the material widely, for religious reasons. His webpage is here:

http://freierscientologe.netfirms.com/booksbritannica.htm

The CD/DVD sets are apparently quite rare, Brian was lucky to get his hands on one at a fairly cheap price.

There's the trademark issue -- Britannica may attempt to scare us with legal threats over this. A disclaimer on every HTML page declaring non-affiliation with Britannica would probably put us on sound legal footing, although I'd be willing to hear advice about this from people who are more knowledgeable. If the "LoveToKnow Free Online Encyclopedia" (1911encyclopedia.org) can host this content, then we should be able to find a way too. And we can do it without the abominable license restrictions and "copyright traps" scattered throughout the work to enforce them.

Wikipedia owes a lot to the 1911 edition -- we've copied many of its articles. A public, canonical copy will be a valuable tool to deal with LoveToKnow's frequent OCR errors, its incompleteness, and its specious legal threats against us based on our use of unspecified copyright material hidden in their doctored online copy. Hopefully the availability of page images will spur development of a complete and accurate OCR copy.

The only question in my mind is the domain: should this be under eb1911.wikipedia.org? We could make it visually distinct, to avoid confusion with Wikipedia itself. Or would eb1911.wikimedia.org be better? Or eb1911.wikisource.org?

-- Tim Starling

foundation-l mailing list foundation-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/foundation-l

Angela

2:56 a.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

On 11/9/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:

...

The only question in my mind is the domain: should this be under eb1911.wikipedia.org? We could make it visually distinct, to avoid confusion with Wikipedia itself. Or would eb1911.wikimedia.org be better? Or eb1911.wikisource.org?

Whether at Wikipedia or Wikisource, I'd rather it was "1911" rather than "eb1911". If there are trademark issues with calling it Encyclopedia Britannica, then calling it EB just seems a sneaky way around that since it's obvious what the EB stands for. It annoys me when people use "WP" on sites we'd not approve the use of "Wikipedia" since it still implies some official connection, especially considering how often the abbreviation is used to refer to Wikipedia.

Angela.

Tim Starling

3:20 a.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Angela wrote:

...

On 11/9/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:

...
The only question in my mind is the domain: should this be under eb1911.wikipedia.org? We could make it visually distinct, to avoid confusion with Wikipedia itself. Or would eb1911.wikimedia.org be better? Or eb1911.wikisource.org?

Whether at Wikipedia or Wikisource, I'd rather it was "1911" rather than "eb1911". If there are trademark issues with calling it Encyclopedia Britannica, then calling it EB just seems a sneaky way around that since it's obvious what the EB stands for. It annoys me when people use "WP" on sites we'd not approve the use of "Wikipedia" since it still implies some official connection, especially considering how often the abbreviation is used to refer to Wikipedia.

It's dishonest to call it anything other than the 1911 Encyclopaedia Britannica. Academics and editors of modern encyclopedias should know what they are citing. I suggested EB as an abbreviation, not as a way to avoid trademark issues.

I would appreciate legal advice on the best way to avoid infringing Britannica's trademark while maintaining academic honesty. Intuitively, a disclaimer of non-affiliation seemed like a good way to do that.

I didn't suggest 1911.wikisource.org because it violates the following RFC 1035 recommendation by not starting with a letter:

: The DNS specifications attempt to be as general as possible in the rules : for constructing domain names. The idea is that the name of any : existing object can be expressed as a domain name with minimal changes. : : However, when assigning a domain name for an object, the prudent user : will select a name which satisfies both the rules of the domain system : and any existing rules for the object, whether these rules are published : or implied by existing programs.

[...]

: The labels must follow the rules for ARPANET host names. They must : start with a letter, end with a letter or digit, and have as interior : characters only letters, digits, and hyphen. There are also some : restrictions on the length. Labels must be 63 characters or less.

-- Tim Starling

Angela

3:55 a.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

On 11/9/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:

...

I suggested EB as an abbreviation, not as a way to avoid trademark issues.

The potential trademark issue was brought up on this list last July: http://mail.wikimedia.org/pipermail/foundation-l/2005-July/003645.html. Since then, Wikisource has been mostly using the EB abbreviation, but there are still many mentions of the full name, so perhaps it was decided it wasn't an issue.

...

I didn't suggest 1911.wikisource.org because it violates the following RFC 1035 recommendation by not starting with a letter:

I didn't know that, and I've got two Wikicities which violate it.

Since Wikisource are already working on this, after being moved away from Wikibooks, the Wikisource domain might be the best place for it.

Angela.

Tim Starling

4:44 a.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Angela wrote:

...

On 11/9/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:

...
I suggested EB as an abbreviation, not as a way to avoid trademark issues.

The potential trademark issue was brought up on this list last July: http://mail.wikimedia.org/pipermail/foundation-l/2005-July/003645.html. Since then, Wikisource has been mostly using the EB abbreviation, but there are still many mentions of the full name, so perhaps it was decided it wasn't an issue.

We can remove references to the name if we have to, but don't expect me to be happy about it. In any case I'd like to hear a qualified opinion.

-- Tim Starling

Ray Saintonge

10 Nov 10 Nov

10:12 p.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Angela wrote:

...

On 11/9/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:

...
I suggested EB as an abbreviation, not as a way to avoid trademark issues.

The potential trademark issue was brought up on this list last July: http://mail.wikimedia.org/pipermail/foundation-l/2005-July/003645.html. Since then, Wikisource has been mostly using the EB abbreviation, but there are still many mentions of the full name, so perhaps it was decided it wasn't an issue.

IIRC , someone had written to PG on this and they didn't seem to be worried about this.

Robert Scott Horning

11 Nov 11 Nov

1:42 a.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Ray Saintonge wrote:

...

Angela wrote:

...
On 11/9/05, Tim Starling t.starling@physics.unimelb.edu.au wrote:

...
I suggested EB as an abbreviation, not as a way to avoid trademark issues.

The potential trademark issue was brought up on this list last July: http://mail.wikimedia.org/pipermail/foundation-l/2005-July/003645.html.

Since then, Wikisource has been mostly using the EB abbreviation, but there are still many mentions of the full name, so perhaps it was decided it wasn't an issue.

IIRC , someone had written to PG on this and they didn't seem to be worried about this.

Ec

I wrote to Michael Hart and didn't get any reply at all. If I had, I would have posted it. I poked around Distrubited Proofreaders and generally speaking it appears as though the rest of the volumes for the Encyclopaedia Britannica are going to be using the original trademark to describe the contents of the e-book. In that regard PG doesn't seem too worried about the trademark issues. If the Encyclopaedia Britannica, Inc. wants to go after the Wikimedia Foundation and myself as an editor who helped with the creation of the project on Wikisource, I say bring it on! Dropping a note on slashdot and a few other tech news sites is going to make Brittanica Inc. wish they had gone to Mars instead for the negative publicity it would generate. Besides, they havn't asked or demanded anything either, and this project has been going for several months now, and widespread links through other projects like Wikipedia and Wiktionary. We also have a formal disclaimer that acknowledges the source of the trademark, and even a hyperlink to the company website if anybody is really interested. That with a higher Google ranking alone ought to be worth something to Britannica, Inc.

Wikisource is using the EB abbriviation mainly because typing out Encyclopaedia Britannica as a namespace is a pain to use all of the time and consumes server space and bandwidth. It is easier to type EB1911 than 1911 Encyclopaedia Britannica. (The ae ligature is especially nasty to type in from an English language keyboard.) I'm sorry if I didn't make that clear somewhere on Wikisource earlier. It was certainly not done to avoid trademark usage although that was another motive of minor concern, not the primary issue. There are a bunch of templates, Wikisource project organization pages, categories, and more that were added that some very simple namespace had to be created to coordinate all of those pages and keep from making ambiguous references to elsewhere on Wikisource.

-- Robert Scott Horning

Ray Saintonge

2:09 a.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Robert Scott Horning wrote:

...

Ray Saintonge wrote:

...
IIRC , someone had written to PG on this and they didn't seem to be worried about this.

I wrote to Michael Hart and didn't get any reply at all. If I had, I would have posted it.

Thanks for clarifying this. I believe that he commented on this somewhere, even if not to us.

...

I poked around Distrubited Proofreaders and generally speaking it appears as though the rest of the volumes for the Encyclopaedia Britannica are going to be using the original trademark to describe the contents of the e-book. In that regard PG doesn't seem too worried about the trademark issues. If the Encyclopaedia Britannica, Inc. wants to go after the Wikimedia Foundation and myself as an editor who helped with the creation of the project on Wikisource, I say bring it on! Dropping a note on slashdot and a few other tech news sites is going to make Brittanica Inc. wish they had gone to Mars instead for the negative publicity it would generate. Besides, they havn't asked or demanded anything either, and this project has been going for several months now, and widespread links through other projects like Wikipedia and Wiktionary. We also have a formal disclaimer that acknowledges the source of the trademark, and even a hyperlink to the company website if anybody is really interested. That with a higher Google ranking alone ought to be worth something to Britannica, Inc.

The fact that EB has not commented on this at all is the important one. Until they do we can only speculate about what they think. They have apparently not taken offense to the others who have referenced their name so it is also conceivable that the doctrine of laches may apply as wel if they do in the future.

We do have some people who like to guess what the law is, and then proceed to interpret it to their disadvantage.

...

Wikisource is using the EB abbriviation mainly because typing out Encyclopaedia Britannica as a namespace is a pain to use all of the time and consumes server space and bandwidth. It is easier to type EB1911 than 1911 Encyclopaedia Britannica. (The ae ligature is especially nasty to type in from an English language keyboard.) I'm sorry if I didn't make that clear somewhere on Wikisource earlier. It was certainly not done to avoid trademark usage although that was another motive of minor concern, not the primary issue. There are a bunch of templates, Wikisource project organization pages, categories, and more that were added that some very simple namespace had to be created to coordinate all of those pages and keep from making ambiguous references to elsewhere on Wikisource.

I would have shortened it to simply "EB11" since it is also the 11th edition. Since the reference needs to be in all relevant titles to distinguish them from other articles on the same topic it is better that it be as short as possible.

Daniel Mayer

9 Nov 9 Nov

12:50 p.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

--- Tim Starling t.starling@physics.unimelb.edu.au wrote:

...

... The only question in my mind is the domain: should this be under eb1911.wikipedia.org? We could make it visually distinct, to avoid confusion with Wikipedia itself. Or would eb1911.wikimedia.org be better? Or eb1911.wikisource.org?

It absolutely should *not* be on a Wikipedia subdomain. Wikisource is the place for this.

-- mav

__________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com

Daniel Mayer

12:59 p.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

-- Daniel Mayer maveric149@yahoo.com wrote:

...

--- Tim Starling t.starling@physics.unimelb.edu.au wrote:

...
... The only question in my mind is the domain: should this be under eb1911.wikipedia.org? We could make it visually distinct, to avoid

confusion

...
with Wikipedia itself. Or would eb1911.wikimedia.org be better? Or eb1911.wikisource.org?

It absolutely should *not* be on a Wikipedia subdomain. Wikisource is the place for this.

If a subdomain at Wikisource is needed, then it should not be a wiki. Ideally we would have book support in Wikisource that would all this to be put directly in Wikisource. But I can accept having this on a separate subdomain until that feature is added and only if the interim subdomain hack will not host a wiki.

-- mav

__________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com

Ray Saintonge

10 Nov 10 Nov

10:05 p.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Tim Starling wrote:

...

The CD/DVD sets are apparently quite rare, Brian was lucky to get his hands on one at a fairly cheap price.

Not really.

...

There's the trademark issue -- Britannica may attempt to scare us with legal threats over this. A disclaimer on every HTML page declaring non-affiliation with Britannica would probably put us on sound legal footing, although I'd be willing to hear advice about this from people who are more knowledgeable. If the "LoveToKnow Free Online Encyclopedia" (1911encyclopedia.org) can host this content, then we should be able to find a way too. And we can do it without the abominable license restrictions and "copyright traps" scattered throughout the work to enforce them.

It may turn out that their copyright traps are really their lack of OCR proofreading. :-)

...

Wikipedia owes a lot to the 1911 edition -- we've copied many of its articles. A public, canonical copy will be a valuable tool to deal with LoveToKnow's frequent OCR errors, its incompleteness, and its specious legal threats against us based on our use of unspecified copyright material hidden in their doctored online copy. Hopefully the availability of page images will spur development of a complete and accurate OCR copy.

Someone with a legitimate copyright claim does not need to hide traps in the text. Doing so cannot create new copyrights in the test, although they still have a copyright in the surrounding framework. I would go so far as to say that changing text for the sake of hidden copyrights may be a violation of the original author's moral rights.

Anthony DiPierro

9 Nov 9 Nov

3:07 a.m.

Now just implement a captcha for account creation and anonymous editing which requires the user to convert a sentence or two to text.

J/K, it probably wouldn't work, but it would be neat...

On 11/8/05, Brian brian0918@gmail.com wrote:

...

For those who don't know, the 1911 Encyclopaedia Britannica is a famous public domain encyclopedia, advertised as the "sum of all human knowledge" in 1911.

I recently (today) acquired a DVD containing scans of every page of the 1911 Britannica, along with index files for it all, organized by letter and page number. I've already talked with avar, TimStarling, and brion on IRC, and TimStarling specifically asked me to tell you all that he is "confident that the server requirements will be minimal." They would set up a domain name, generate some web pages automatically using the index files, and host the entire set of 29,700 files totaling about 4 GB.

One more thing, these are black and white TIFs, and there is discussion about whether they should be mass converted to PNGs to be easily viewable.

brian0918@gmail.com

Lars Aronsson

9:23 p.m.

Brian wrote:

...

I recently (today) acquired a DVD containing scans of every page of the 1911 Britannica, along with index files for it all, [...] TimStarling specifically asked me to tell you all that he is "confident that the server requirements will be minimal." They would set up a domain name, generate some web pages automatically using the index files, and host the entire set of 29,700 files totaling about 4 GB.

What is suddenly wrong with using Wikimedia Commons and Wikisource? Why do you need a new domain and server just for this book? Didn't you see http://en.wikisource.org/wiki/The_New_Student%27s_Reference_Work

...

One more thing, these are black and white TIFs, and there is discussion about whether they should be mass converted to PNGs to be easily viewable.

MediaWiki's image upload already does rescaling. All you need to do is to hack the TIFF-to-PNG conversion into MediaWiki and upload the original TIFFs to Wikimedia Commons.

-- Lars Aronsson (lars@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/

Brian

9:35 p.m.

Lars Aronsson wrote:

...

Brian wrote:

...
I recently (today) acquired a DVD containing scans of every page of the 1911 Britannica, along with index files for it all, [...] TimStarling specifically asked me to tell you all that he is "confident that the server requirements will be minimal." They would set up a domain name, generate some web pages automatically using the index files, and host the entire set of 29,700 files totaling about 4 GB.

What is suddenly wrong with using Wikimedia Commons and Wikisource? Why do you need a new domain and server just for this book? Didn't you see http://en.wikisource.org/wiki/The_New_Student%27s_Reference_Work

I think they want this to be on a separate domain because not only is 1911EB a famous encyclopedia, Wikipedia has benefitted greatly from it, and several projects still want its contents. So, they would rather prop it up and give it a separate space specifically designed for making 1911EB's contents easily accessible.

Ray Saintonge

10 Nov 10 Nov

8:26 p.m.

Brian wrote:

...

Lars Aronsson wrote:

...
Brian wrote:

...
I recently (today) acquired a DVD containing scans of every page of the 1911 Britannica, along with index files for it all, [...] TimStarling specifically asked me to tell you all that he is "confident that the server requirements will be minimal." They would set up a domain name, generate some web pages automatically using the index files, and host the entire set of 29,700 files totaling about 4 GB.

What is suddenly wrong with using Wikimedia Commons and Wikisource? Why do you need a new domain and server just for this book? Didn't you see http://en.wikisource.org/wiki/The_New_Student%27s_Reference_Work

I think they want this to be on a separate domain because not only is 1911EB a famous encyclopedia, Wikipedia has benefitted greatly from it, and several projects still want its contents. So, they would rather prop it up and give it a separate space specifically designed for making 1911EB's contents easily accessible.

Like Lars, I don't see the point of a separate domain for this. I agree that Wikipedia has benefitted greatly from it, but it has benefitted from many other works as well. With the scanned pages on Commons they will still be easily available to the several projects that you have in mind. Easily editable and Wikified texts will continue to belong on Wikisource.

Tim Starling

11:10 p.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Ray Saintonge wrote:

...

Like Lars, I don't see the point of a separate domain for this. I agree that Wikipedia has benefitted greatly from it, but it has benefitted from many other works as well. With the scanned pages on Commons they will still be easily available to the several projects that you have in mind. Easily editable and Wikified texts will continue to belong on Wikisource.

Why are you still talking about a separate domain? That's ancient history. I suggested a compromise, why aren't you talking about that?

-- Tim Starling

Ray Saintonge

11 Nov 11 Nov

1:51 a.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Tim Starling wrote:

...

Ray Saintonge wrote:

...
Like Lars, I don't see the point of a separate domain for this. I agree that Wikipedia has benefitted greatly from it, but it has benefitted from many other works as well. With the scanned pages on Commons they will still be easily available to the several projects that you have in mind. Easily editable and Wikified texts will continue to belong on Wikisource.

Why are you still talking about a separate domain? That's ancient history. I suggested a compromise, why aren't you talking about that?

The message that I was responding to was less than 24 hours old. How can that possibly be "ancient"? If your "compromise" needs commenting, I'll do that when I get to it.

Tim Starling

9 Nov 9 Nov

10:06 p.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Lars Aronsson wrote:

...

Brian wrote:

...
I recently (today) acquired a DVD containing scans of every page of the 1911 Britannica, along with index files for it all, [...] TimStarling specifically asked me to tell you all that he is "confident that the server requirements will be minimal." They would set up a domain name, generate some web pages automatically using the index files, and host the entire set of 29,700 files totaling about 4 GB.

What is suddenly wrong with using Wikimedia Commons and Wikisource? Why do you need a new domain and server just for this book? Didn't you see http://en.wikisource.org/wiki/The_New_Student%27s_Reference_Work

When Brian came on to IRC and asked us "What is the best way to upload 30,000 images requiring 6 GB to commons?" the reaction from Brion and I was a groan. The hardware requirements for commons are rapidly increasing, and uploading and storing such content in MediaWiki is inefficient and non-portable. If we had them in a separate directory on a separate domain, we could copy them from server to server, make tarballs, run batch conversion jobs -- all with a minimal amount of programming and system administration work. And it wouldn't require writing a bot to create 30,000 index pages, we could just write a hundred lines of PHP to index the whole lot. The collection will be easier to use and more reliable, and it will be easy to maintain and update the index pages.

All of the navigation text, the headers and footers, could be editable in wiki fashion. You could let anyone change the header that will be displayed on 30,000 pages, with no server strain whatsoever. This is in stark contrast to the system requirements of templates which are used on large numbers of wiki pages.

Wikisource has suffered so far due to a lack of specialised software. This kind of initiative could see it become more usable generally.

-- Tim Starling

Tim Starling

10:26 p.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

I wrote:

...

When Brian came on to IRC and asked us "What is the best way to upload 30,000 images requiring 6 GB to commons?" the reaction from Brion and I was a groan. The hardware requirements for commons are rapidly increasing, and uploading and storing such content in MediaWiki is inefficient and non-portable. If we had them in a separate directory on a separate domain, we could copy them from server to server, make tarballs, run batch conversion jobs -- all with a minimal amount of programming and system administration work. And it wouldn't require writing a bot to create 30,000 index pages, we could just write a hundred lines of PHP to index the whole lot. The collection will be easier to use and more reliable, and it will be easy to maintain and update the index pages.

All of the navigation text, the headers and footers, could be editable in wiki fashion. You could let anyone change the header that will be displayed on 30,000 pages, with no server strain whatsoever. This is in stark contrast to the system requirements of templates which are used on large numbers of wiki pages.

Wikisource has suffered so far due to a lack of specialised software. This kind of initiative could see it become more usable generally.

Come to think of it, I could probably do it as a MediaWiki extension, and embed this content in en.wikisource.org. You'd get all of the same features, but it would also appear to be integrated with the wiki. You wouldn't be able to edit the page images, but I don't think that's a desirable property anyway. It would be easy for someone to download the whole collection, run a processing script (say, automated correction of the scanning quality), and then upload the whole new collection and incorporate it into the wiki. Easy as in no bots, no screen scrapers, no server strain, just a tarball download and a tarball upload.

-- Tim Starling

Brian

11 Nov 11 Nov

1:46 a.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Tim Starling wrote:

...

I wrote:

...
When Brian came on to IRC and asked us "What is the best way to upload 30,000 images requiring 6 GB to commons?" the reaction from Brion and I was a groan. The hardware requirements for commons are rapidly increasing, and uploading and storing such content in MediaWiki is inefficient and non-portable. If we had them in a separate directory on a separate domain, we could copy them from server to server, make tarballs, run batch conversion jobs -- all with a minimal amount of programming and system administration work. And it wouldn't require writing a bot to create 30,000 index pages, we could just write a hundred lines of PHP to index the whole lot. The collection will be easier to use and more reliable, and it will be easy to maintain and update the index pages.

All of the navigation text, the headers and footers, could be editable in wiki fashion. You could let anyone change the header that will be displayed on 30,000 pages, with no server strain whatsoever. This is in stark contrast to the system requirements of templates which are used on large numbers of wiki pages.

Wikisource has suffered so far due to a lack of specialised software. This kind of initiative could see it become more usable generally.

Come to think of it, I could probably do it as a MediaWiki extension, and embed this content in en.wikisource.org. You'd get all of the same features, but it would also appear to be integrated with the wiki. You wouldn't be able to edit the page images, but I don't think that's a desirable property anyway. It would be easy for someone to download the whole collection, run a processing script (say, automated correction of the scanning quality), and then upload the whole new collection and incorporate it into the wiki. Easy as in no bots, no screen scrapers, no server strain, just a tarball download and a tarball upload.

-- Tim Starling

That sounds like a good alternative to a separate domain or sticking it on Commons, as long as it doesn't require the tech crew to put in too many extra hours.

Lars Aronsson

9 Nov 9 Nov

11:34 p.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Tim Starling wrote:

...

When Brian came on to IRC and asked us "What is the best way to upload 30,000 images requiring 6 GB to commons?" the reaction from Brion and I was a groan. The hardware requirements for commons are rapidly increasing, and uploading and storing such content in MediaWiki is inefficient and non-portable.

While I can understand your reaction, I think we should fix our systems so they can handle these volumes without having to create new exceptions. The existence of Wikisource could be questioned, since there are already other projects (such as Project Gutenberg and Distributed Proofreaders) that do this kind of work. But if Wikisource is to exist, it should be capable of handling large volumes (terabytes) of digitized text (and scanned images). It cannot be that every new book requires a new project, because Wikisource is unable to handle its size. Encyclopaedia Britannica might be bigger than anything that is currently in Wikisource, but just wait til someone suggests we digitize the Spanish "Enciclopedia Universal Ilustrada" (70 fat volumes, 1908-1930), which makes EB look tiny.

Andreas Grosz' scans of EB1911 have been available on DVD for more than two years, so I see no immediate hurry for us to host it. As far as I know, PGDP is doing a good work proofreading it, and we could benefit from waiting for them to finish more of the work.

...

If we had them in a separate directory on a separate domain,

Or if MediaWiki could handle separate directories on the same domain...

The recent donation (and import) of 10,000 art images from Directmedia GmbH to Wikimedia Commons put the system to its limits. What if the next donation consists of a million images? Or a million audio recordings? Dump them in a directory, supply an index description in XML, and let MediaWiki use the data where it is, instead of trying to stuff it into the MySQL database through the wiki upload form.

...

Wikisource has suffered so far due to a lack of specialised software. This kind of initiative could see it become more usable generally.

Or the specialization could be added to MediaWiki, so anybody could benefit from it, not just Wikisource.

-- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se

Brian

10 Nov 10 Nov

1:06 a.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

I think most of your concerns should be directed to wikitech-l, rather than being injected into other discussions.

As for the availability of the DVD, it has been available _for a price_, and not easily accessible. This effectively prohibits and group projects from moving for which require the use of the images.

Also, several people are excited by this and want to know when it will be online, believing that it will finally allow projects such as Wikisource's 1911EB to take off, so it's not accurate to say that there is no hurry. It seems that most people, myself included, didn't even know that these were available anywhere. In any case, they were _never_ easily available, or freely available.

Lars Aronsson wrote:

...

Tim Starling wrote:

...
When Brian came on to IRC and asked us "What is the best way to upload 30,000 images requiring 6 GB to commons?" the reaction from Brion and I was a groan. The hardware requirements for commons are rapidly increasing, and uploading and storing such content in MediaWiki is inefficient and non-portable.

While I can understand your reaction, I think we should fix our systems so they can handle these volumes without having to create new exceptions. The existence of Wikisource could be questioned, since there are already other projects (such as Project Gutenberg and Distributed Proofreaders) that do this kind of work. But if Wikisource is to exist, it should be capable of handling large volumes (terabytes) of digitized text (and scanned images). It cannot be that every new book requires a new project, because Wikisource is unable to handle its size. Encyclopaedia Britannica might be bigger than anything that is currently in Wikisource, but just wait til someone suggests we digitize the Spanish "Enciclopedia Universal Ilustrada" (70 fat volumes, 1908-1930), which makes EB look tiny.

Andreas Grosz' scans of EB1911 have been available on DVD for more than two years, so I see no immediate hurry for us to host it. As far as I know, PGDP is doing a good work proofreading it, and we could benefit from waiting for them to finish more of the work.

...
If we had them in a separate directory on a separate domain,

Or if MediaWiki could handle separate directories on the same domain...

The recent donation (and import) of 10,000 art images from Directmedia GmbH to Wikimedia Commons put the system to its limits. What if the next donation consists of a million images? Or a million audio recordings? Dump them in a directory, supply an index description in XML, and let MediaWiki use the data where it is, instead of trying to stuff it into the MySQL database through the wiki upload form.

...
Wikisource has suffered so far due to a lack of specialised software. This kind of initiative could see it become more usable generally.

Or the specialization could be added to MediaWiki, so anybody could benefit from it, not just Wikisource.

Ray Saintonge

9:50 p.m.

New subject: [Foundation-l] Re: Hosting scans of the 1911 Britannica on Wikimedia

Brian wrote:

...

As for the availability of the DVD, it has been available _for a price_, and not easily accessible. This effectively prohibits and group projects from moving for which require the use of the images.

I just checked on eBay. Someone in Wichita is currently selling them for $18.99.

Ray Saintonge

8:32 p.m.

Brian wrote:

...

One more thing, these are black and white TIFs, and there is discussion about whether they should be mass converted to PNGs to be easily viewable.

How does it handle those pages that were originally in colour? Will they need to be rescanned?

6802

Age (days ago)

6804

Last active (days ago)

wikimedia-l@lists.wikimedia.org

25 comments

8 participants

tags (0)

participants (8)

Angela
Anthony DiPierro
Brian
Daniel Mayer
Lars Aronsson
Ray Saintonge
Robert Scott Horning
Tim Starling