My own thoughts on this, which I also expressed on the meta page:
1. There is plenty of material out that that is already public domain. Part of the problem is that it can take forever and a day to digitize it all. In the case of books and magazines, digitization often involves destroying the hard copies in the process. There are, however, specialized scanners that can do the work without ruining the books themselves. These are expensive (about US $30,000 a machine). Ten machines, strategically located around the world, along with student staff to operate them around the clock could help to preserve these texts and store them for prosperity. Additional people (paid and volunteer) will be needed to OCR, proof, and hyperlink the material to ensure that it doesn't get lost in a glut of material (I have visions of the final scene of Raiders of the Lost Ark, when the Ark was finally stored in some crate in an army warehouse).
2. While OCR capacities exist for some languages, they do not exist for other languages, where the material is much more likely to get lost. Manuscripts in Tibetan monasteries, for example, can be scanend but not OCRed easily. To make this information available, developers should be paid to create adequate OCR tools for these languages. Rough cost: $5 million.
3. Music has been recorded around the world for well over a century, yet many of the early recordings are being lost, especially those on wax cylinders and porcelain records. Preservation includes locating, identifying, and remastering. People must be trained to do this. Rough cost: $35 million over two years.
4. This is true of old films as well. Celluloid copies are extremely rare and extremely flammable. Restoration is exceedingly costly. For example, [[Theda Bara]] is a well-known vamp of early Hollywood (the word "vamp" was first used to describe her), yet none of her films survive, and they were made less than a hundred years ago. Films are international, they include important historic documents such as newsreels, and they are being lost every day. Today, most preservation work is being done by major studios, since it is so costly. In other words, they are taking important works now in the public domain, restoring them, and contending that the restoration is an original work, i.e., another hundred years at least until some Vigo or Charlie Chaplin films enter the public domain ... and little attention is being paid to newsreels of events like the Russian revolution, World War I, etc. Like music, people should be offered scholarships to learn the art of film restoration and work on these projects. Until this happens it can be outsourced. Rough cost: $50 million.
5. To ensure all of this remains accessible, we will need a LOT of servers and bandwidth: Initial outlay: $10 million.
Total $100 million dollars, spent over 5 years. Costs include staffing, identifying prospective targets, transportation, overhead, etc. Just coordinating a project of this scope will take a lot of effort.
And there is competition too. As an example, _http://historical.library.cornell.edu/IWP/_ (http://historical.library.cornell.edu/IWP/) is a collection of Internation Women's Journals, some of which are very important historically. They are already scanned, but they are inaccessible because a private company has (rightfully or wrongfully) copyrighted the scans.
Lots to be done. You will see how quickly $100 million can be spent.
Danny
In a message dated 10/15/2006 11:27:57 AM Eastern Daylight Time, jwales@wikia.com writes:
I would like to gather from the community some examples of works you would like to see made free, works that we are not doing a good job of generating free replacements for, works that could in theory be purchased and freed.
Dream big. Imagine there existed a budget of $100 million to purchase copyrights to be made available under a free license. What would you like to see purchased and released under a free license?
Photos libraries? textbooks? newspaper archives? Be bold, be specific, be general, brainstorm, have fun with it.
I was recently asked this question by someone who is potentially in a position to make this happen, and he wanted to know what we need, what we dream of, that we can't accomplish on our own, or that we would expect to take a long time to accomplish on our own.
--Jimbo _______________________________________________ Commons-l mailing list Commons-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/commons-l
Hi,
Danny tells it right. I have little to add to his mail.
daniwo59@aol.com a écrit :
My own thoughts on this, which I also expressed on the meta page:
- There is plenty of material out that that is already public domain.
Part of the problem is that it can take forever and a day to digitize it all. In the case of books and magazines, digitization often involves destroying the hard copies in the process. There are, however, specialized scanners that can do the work without ruining the books themselves. These are expensive (about US $30,000 a machine). Ten machines, strategically located around the world, along with student staff to operate them around the clock could help to preserve these texts and store them for prosperity. Additional people (paid and volunteer) will be needed to OCR, proof, and hyperlink the material to ensure that it doesn't get lost in a glut of material (I have visions of the final scene of Raiders of the Lost Ark, when the Ark was finally stored in some crate in an army warehouse).
- While OCR capacities exist for some languages, they do not exist for
other languages, where the material is much more likely to get lost. Manuscripts in Tibetan monasteries, for example, can be scanend but not OCRed easily. To make this information available, developers should be paid to create adequate OCR tools for these languages. Rough cost: $5 million.
Much of the limits of Wikisource now is on the capability to scan and ocr documents. There is no good free OCR software, apart the new software recently released to GPL by Google, but it works only for English and has still limitations. So developing a good free and multilingual OCR software would be my priority. AFAIK there is no good OCR software (free or not) for any Indian languages, including Sanskrit. I have never seen any for Tibetan either.
But having a software is not enough. A few OCR servers managed by the Foundation where anyone can sent an automated OCR request would be very useful. There are already proprietary OCR software who can do that.
- Music has been recorded around the world for well over a century, yet
many of the early recordings are being lost, especially those on wax cylinders and porcelain records. Preservation includes locating, identifying, and remastering. People must be trained to do this. Rough cost: $35 million over two years.
- This is true of old films as well. Celluloid copies are extremely
rare and extremely flammable. Restoration is exceedingly costly. For example, [[Theda Bara]] is a well-known vamp of early Hollywood (the word "vamp" was first used to describe her), yet none of her films survive, and they were made less than a hundred years ago. Films are international, they include important historic documents such as newsreels, and they are being lost every day. Today, most preservation work is being done by major studios, since it is so costly. In other words, they are taking important works now in the public domain, restoring them, and contending that the restoration is an original work, i.e., another hundred years at least until some Vigo or Charlie Chaplin films enter the public domain ... and little attention is being paid to newsreels of events like the Russian revolution, World War I, etc. Like music, people should be offered scholarships to learn the art of film restoration and work on these projects. Until this happens it can be outsourced. Rough cost: $50 million.
I would add a special request for some of Cartier-Bresson photographs of Gandhi's funerals. I would have said a copy of the Encyclopedia of the Enlightment (1750, by Diderot and d'Alembert), but we already have it. ;o)
- To ensure all of this remains accessible, we will need a LOT of
servers and bandwidth: Initial outlay: $10 million.
Yes, it's important not to forget that point.
Total $100 million dollars, spent over 5 years. Costs include staffing, identifying prospective targets, transportation, overhead, etc. Just coordinating a project of this scope will take a lot of effort.
Yes, I would generally put more money on people's work than on documents.
And there is competition too. As an example, http://historical.library.cornell.edu/IWP/ is a collection of Internation Women's Journals, some of which are very important historically. They are already scanned, but they are inaccessible because a private company has (rightfully or wrongfully) copyrighted the scans.
Lots to be done. You will see how quickly $100 million can be spent.
Danny
In a message dated 10/15/2006 11:27:57 AM Eastern Daylight Time, jwales@wikia.com writes:
I would like to gather from the community some examples of works you would like to see made free, works that we are not doing a good job of generating free replacements for, works that could in theory be purchased and freed. Dream big. Imagine there existed a budget of $100 million to purchase copyrights to be made available under a free license. What would you like to see purchased and released under a free license? Photos libraries? textbooks? newspaper archives? Be bold, be specific, be general, brainstorm, have fun with it. I was recently asked this question by someone who is potentially in a position to make this happen, and he wanted to know what we need, what we dream of, that we can't accomplish on our own, or that we would expect to take a long time to accomplish on our own.
Yes, fun has just started.
--Jimbo
Regards,
Yann
daniwo59@aol.com wrote:
My own thoughts on this, which I also expressed on the meta page:
- There is plenty of material out that that is already public domain. Part
of the problem is that it can take forever and a day to digitize it all. In the case of books and magazines, digitization often involves destroying the hard copies in the process. There are, however, specialized scanners that can do the work without ruining the books themselves. These are expensive (about US $30,000 a machine). Ten machines, strategically located around the world, along with student staff to operate them around the clock could help to preserve these texts and store them for prosperity. Additional people (paid and volunteer) will be needed to OCR, proof, and hyperlink the material to ensure that it doesn't get lost in a glut of material
All that scanning and OCR work could be quite tedious, and people might even need to be paid for this. As look as these workers don't develop an addiction to the money we provide it could work. These machines could go into key small institutions with significant archives who would appreciate having the machine. Perhaps they could even keep the machine once our work there is done. Students could be paid on a per semester contract basis, with renewal available when the previous year's targets are met.
- To ensure all of this remains accessible, we will need a LOT of servers
and bandwidth: Initial outlay: $10 million.
Total $100 million dollars, spent over 5 years. Costs include staffing, identifying prospective targets, transportation, overhead, etc. Just coordinating a project of this scope will take a lot of effort.
A long term hardware optimization strategy would make interesting reading.
And there is competition too. As an example, _http://historical.library.cornell.edu/IWP/_ (http://historical.library.cornell.edu/IWP/) is a collection of Internation Women's Journals, some of which are very important historically. They are already scanned, but they are inaccessible because a private company has (rightfully or wrongfully) copyrighted the scans.
This is where we need to decide where we should and where we shouldn't co-operate. Our bottom line must remain to make everything accessible to everybody. If they insist on the proprietary nature of this material, or try to invoke database protection laws it might be necessary to scan our own copies of everything that they have.
Ec
wikisource-l@lists.wikimedia.org