http://blogs.law.harvard.edu/infolaw/2009/06/19/using-wikisource-as-an-alter...
Interesting. How well does this fit with what Wikisource does?
- d.
There is a wealth of work done all the time by primary source researchers and publishers, which could be improved on by having wikisource entries, translations, &c.
Related question : how appropriate would large numbers of public domain texts, with page scans and the best available OCR [and translations of same], fit with what Wikisource does now? This is clearly a wiki project that needs to happen : OCR even at its best misses rare meaning-bearing words. If not Wikisource, where should this work take place?
SJ
On Sat, Jun 20, 2009 at 11:41 AM, David Gerarddgerard@gmail.com wrote:
http://blogs.law.harvard.edu/infolaw/2009/06/19/using-wikisource-as-an-alter...
Interesting. How well does this fit with what Wikisource does?
- d.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
This has reminded me to complain about Google Books. Google has the world's best OCR (in virtue of having the largest OCR'able dataset) and also has a mission to scan in all the public domain books they can get their hand on. They recently updated their interface to, as they put it, "make it easier to find our plain text versions of public domain books. If a book is available in full view, you can click the 'Plain text' button in the toolbar." Unfortunately the only way I've found to download the full text of a public domain book from Google is to flip through the book a page at a time, copying the text to your clipboard. There are roughly 2-3 million public domain books in Google Books.
On Sat, Jun 20, 2009 at 10:10 AM, Samuel Klein meta.sj@gmail.com wrote:
There is a wealth of work done all the time by primary source researchers and publishers, which could be improved on by having wikisource entries, translations, &c.
Related question : how appropriate would large numbers of public domain texts, with page scans and the best available OCR [and translations of same], fit with what Wikisource does now? This is clearly a wiki project that needs to happen : OCR even at its best misses rare meaning-bearing words. If not Wikisource, where should this work take place?
SJ
On Sat, Jun 20, 2009 at 11:41 AM, David Gerarddgerard@gmail.com wrote:
http://blogs.law.harvard.edu/infolaw/2009/06/19/using-wikisource-as-an-alter...
Interesting. How well does this fit with what Wikisource does?
- d.
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Brian wrote:
Unfortunately the only way I've found to download the full text of a public domain book from Google is to flip through the book a page at a time, copying the text to your clipboard. There are roughly 2-3 million public domain books in Google Books.
That's easy to fix :)
Not likely. I've been banned from Google's regular search at least a dozen times during semi-frenetic search sprees in which I was identified as a bot. There is no doubt that if you try to automate it you will be quickly shot down.
On Sat, Jun 20, 2009 at 12:02 PM, Platonides Platonides@gmail.com wrote:
Brian wrote:
Unfortunately the only way I've found to download the full text of a
public
domain book from Google is to flip through the book a page at a time, copying the text to your clipboard. There are roughly 2-3 million public domain books in Google Books.
That's easy to fix :)
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Easier than scanning, though :)
On Sat, Jun 20, 2009 at 2:04 PM, Brian Brian.Mingus@colorado.edu wrote:
Not likely. I've been banned from Google's regular search at least a dozen times during semi-frenetic search sprees in which I was identified as a bot. There is no doubt that if you try to automate it you will be quickly shot down.
On Sat, Jun 20, 2009 at 12:02 PM, Platonides Platonides@gmail.com wrote:
Brian wrote:
Unfortunately the only way I've found to download the full text of a
public
domain book from Google is to flip through the book a page at a time, copying the text to your clipboard. There are roughly 2-3 million public domain books in Google Books.
That's easy to fix :)
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
So the bot just has to run at human speeds so it does not get banned, it still won't get tired or make unpredictable mistakes. And you can run it from different IPs to parallelize.
--Falcorian
On Sat, Jun 20, 2009 at 11:04 AM, Brian Brian.Mingus@colorado.edu wrote:
Not likely. I've been banned from Google's regular search at least a dozen times during semi-frenetic search sprees in which I was identified as a bot. There is no doubt that if you try to automate it you will be quickly shot down.
On Sat, Jun 20, 2009 at 12:02 PM, Platonides Platonides@gmail.com wrote:
Brian wrote:
Unfortunately the only way I've found to download the full text of a
public
domain book from Google is to flip through the book a page at a time, copying the text to your clipboard. There are roughly 2-3 million public domain books in Google Books.
That's easy to fix :)
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
On Sat, Jun 20, 2009 at 12:34 PM, Falcorian < alex.public.account+WikimediaMailingList@gmail.comalex.public.account%2BWikimediaMailingList@gmail.com
wrote:
So the bot just has to run at human speeds so it does not get banned, it still won't get tired or make unpredictable mistakes. And you can run it from different IPs to parallelize.
--Falcorian
On Sat, Jun 20, 2009 at 11:04 AM, Brian Brian.Mingus@colorado.edu wrote:
Not likely. I've been banned from Google's regular search at least a
dozen
times during semi-frenetic search sprees in which I was identified as a bot. There is no doubt that if you try to automate it you will be quickly shot down.
On Sat, Jun 20, 2009 at 12:02 PM, Platonides Platonides@gmail.com
wrote:
Brian wrote:
Unfortunately the only way I've found to download the full text of a
public
domain book from Google is to flip through the book a page at a time, copying the text to your clipboard. There are roughly 2-3 million public domain books in Google Books.
That's easy to fix :)
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
For some reason, I am reminded of a Supreme Court case about the information in telephone directories. Maybe because of the insanity of trying to put public domain material under copyright.
________________________________ From: Brian Brian.Mingus@colorado.edu To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org Sent: Saturday, June 20, 2009 11:47:28 AM Subject: Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
On Sat, Jun 20, 2009 at 12:34 PM, Falcorian < alex.public.account+WikimediaMailingList@gmail.comalex.public.account%2BWikimediaMailingList@gmail.com
wrote:
So the bot just has to run at human speeds so it does not get banned, it still won't get tired or make unpredictable mistakes. And you can run it from different IPs to parallelize.
--Falcorian
On Sat, Jun 20, 2009 at 11:04 AM, Brian Brian.Mingus@colorado.edu wrote:
Not likely. I've been banned from Google's regular search at least a
dozen
times during semi-frenetic search sprees in which I was identified as a bot. There is no doubt that if you try to automate it you will be quickly shot down.
On Sat, Jun 20, 2009 at 12:02 PM, Platonides Platonides@gmail.com
wrote:
Brian wrote:
Unfortunately the only way I've found to download the full text of a
public
domain book from Google is to flip through the book a page at a time, copying the text to your clipboard. There are roughly 2-3 million public domain books in Google Books.
That's easy to fix :)
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
_______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Brian wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
Where does it forbid them? The most related part is section 5. I understand that doing queries at bot rate may be against #5.3 but I don't see anything against this. Unlike searches, the book OCR result will be cached, so this shouldn't be inconvenience them (and they don't place ads there!).
I'd wikify the html instead of just moving to plain text, though.
On Sat, Jun 20, 2009 at 1:29 PM, Platonides Platonides@gmail.com wrote:
Where does it forbid them?
5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.
Brian wrote:
Where does it forbid them?
5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.
Uh? That's not the TOS I am reading:
5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google.
5.4 You agree that you will not engage in any activity that interferes with or disrupts the Services (or the servers and networks which are connected to the Services).
The second part is missing. Seems that US have different terms than the rest of us.
Wow, what's Wikipedia's policy about using a bot to scrape everything?
On Sat, Jun 20, 2009 at 2:47 PM, Brian Brian.Mingus@colorado.edu wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
On Sat, Jun 20, 2009 at 12:34 PM, Falcorian < alex.public.account+WikimediaMailingList@gmail.comalex.public.account%2BWikimediaMailingList@gmail.com <alex.public.account%2BWikimediaMailingList@gmail.comalex.public.account%252BWikimediaMailingList@gmail.com
wrote:
So the bot just has to run at human speeds so it does not get banned, it still won't get tired or make unpredictable mistakes. And you can run it from different IPs to parallelize.
--Falcorian
On Sat, Jun 20, 2009 at 11:04 AM, Brian Brian.Mingus@colorado.edu
wrote:
Not likely. I've been banned from Google's regular search at least a
dozen
times during semi-frenetic search sprees in which I was identified as a bot. There is no doubt that if you try to automate it you will be quickly
shot
down.
On Sat, Jun 20, 2009 at 12:02 PM, Platonides Platonides@gmail.com
wrote:
Brian wrote:
Unfortunately the only way I've found to download the full text of
a
public
domain book from Google is to flip through the book a page at a
time,
copying the text to your clipboard. There are roughly 2-3 million public domain books in Google Books.
That's easy to fix :)
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe:
https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Anthony wrote:
Wow, what's Wikipedia's policy about using a bot to scrape everything?
I don't know about any policy, but I think it should still be discouraged. For me this has less to do with predation on other sites than with our inability to keep up with the volume of data that would be produced. Proofreading and wikifying are labour-intensive processes. It is very easy for the technically minded to bring the scan and OCR of a 500-page book under our roof, but without the manpower to bring the added value these processes are scarcely better than data dumps.
Ec
On Sat, Jun 20, 2009 at 2:47 PM, Brian Brian.Mingus@colorado.edu wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
On Sat, Jun 20, 2009 at 12:34 PM, Falcorian wrote:
So the bot just has to run at human speeds so it does not get banned, it still won't get tired or make unpredictable mistakes. And you can run it from different IPs to parallelize.
--Falcorian
Evil I tell you. Evil!
On Sat, Jun 20, 2009 at 7:56 PM, Ray Saintonge saintonge@telus.net wrote:
Anthony wrote:
Wow, what's Wikipedia's policy about using a bot to scrape everything?
I don't know about any policy, but I think it should still be discouraged. For me this has less to do with predation on other sites than with our inability to keep up with the volume of data that would be produced. Proofreading and wikifying are labour-intensive processes. It is very easy for the technically minded to bring the scan and OCR of a 500-page book under our roof, but without the manpower to bring the added value these processes are scarcely better than data dumps.
Ec
On Sat, Jun 20, 2009 at 2:47 PM, Brian Brian.Mingus@colorado.edu
wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing
we
can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin
to
their intellectual property and are unwilling to give them away.
On Sat, Jun 20, 2009 at 12:34 PM, Falcorian wrote:
So the bot just has to run at human speeds so it does not get banned,
it
still won't get tired or make unpredictable mistakes. And you can run
it
from different IPs to parallelize.
--Falcorian
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Brian wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
How is violating Google's ToS against the law? Sites put all sorts of meaningless garbage into these documents, and users mostly ignore them.
Of course Google's evil; it's about time that people noticed that. They use their deep pockets as a way to bully other sites ... with a smile. Fortunately the U.S. does not have database protection laws like the E.U. Ideally, every PD item they host should also be hosted on an alternative site, but that's a massive undertaking, ... and they know it. Nothing requires them to be nice to the competition, such as by making it easy to copy their material.
Ec
If a bot has a meaningful effect on server load (i.e. page requests), it falls under the category of malicious software, which is highly illegal.
________________________________ From: Ray Saintonge saintonge@telus.net To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org Sent: Saturday, June 20, 2009 2:35:52 PM Subject: Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
Brian wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
How is violating Google's ToS against the law? Sites put all sorts of meaningless garbage into these documents, and users mostly ignore them.
Of course Google's evil; it's about time that people noticed that. They use their deep pockets as a way to bully other sites ... with a smile. Fortunately the U.S. does not have database protection laws like the E.U. Ideally, every PD item they host should also be hosted on an alternative site, but that's a massive undertaking, ... and they know it. Nothing requires them to be nice to the competition, such as by making it easy to copy their material.
Ec
_______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Geoffrey Plourde wrote:
If a bot has a meaningful effect on server load (i.e. page requests), it falls under the category of malicious software, which is highly illegal.
Malicious software or overloading servers goes well beyond ignoring a ToS. Why should downloading whole books from Google have any greater effect on server load than downloading a whole book of similar length from Internet Archive?
Ec
From: Ray Saintonge
Brian wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
How is violating Google's ToS against the law? Sites put all sorts of meaningless garbage into these documents, and users mostly ignore them.
Of course Google's evil; it's about time that people noticed that. They use their deep pockets as a way to bully other sites ... with a smile. Fortunately the U.S. does not have database protection laws like the E.U. Ideally, every PD item they host should also be hosted on an alternative site, but that's a massive undertaking, ... and they know it. Nothing requires them to be nice to the competition, such as by making it easy to copy their material.
Ec
A bot or bots calling up massive amounts of data at high speed can have a negative effect on a server. While I doubt the bot we use would have the power to take down a Google server, the speed of the requests and the constant number of requests will definitely be noticeable, possibly leading to unpleasant consequences.
________________________________ From: Ray Saintonge saintonge@telus.net To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org Sent: Saturday, June 20, 2009 5:07:44 PM Subject: Re: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
Geoffrey Plourde wrote:
If a bot has a meaningful effect on server load (i.e. page requests), it falls under the category of malicious software, which is highly illegal.
Malicious software or overloading servers goes well beyond ignoring a ToS. Why should downloading whole books from Google have any greater effect on server load than downloading a whole book of similar length from Internet Archive?
Ec
From: Ray Saintonge
Brian wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
How is violating Google's ToS against the law? Sites put all sorts of meaningless garbage into these documents, and users mostly ignore them.
Of course Google's evil; it's about time that people noticed that. They use their deep pockets as a way to bully other sites ... with a smile. Fortunately the U.S. does not have database protection laws like the E.U. Ideally, every PD item they host should also be hosted on an alternative site, but that's a massive undertaking, ... and they know it. Nothing requires them to be nice to the competition, such as by making it easy to copy their material.
Ec
_______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Geoffrey Plourde wrote:
A bot or bots calling up massive amounts of data at high speed can have a negative effect on a server. While I doubt the bot we use would have the power to take down a Google server, the speed of the requests and the constant number of requests will definitely be noticeable, possibly leading to unpleasant consequences.
And data accumulation at such a high speed would also be more than could be properly handled at the Wikisource end as well. We regularly get whole works from Internet Archive and other sources, without any such problems arising. I would not reasonably expect a greater accumulation rate from Google.
Ec
From: Ray Saintonge saintonge@telus.net
Geoffrey Plourde wrote:
If a bot has a meaningful effect on server load (i.e. page requests), it falls under the category of malicious software, which is highly illegal.
Malicious software or overloading servers goes well beyond ignoring a ToS. Why should downloading whole books from Google have any greater effect on server load than downloading a whole book of similar length from Internet Archive?
Ec
From: Ray Saintonge
Brian wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
How is violating Google's ToS against the law? Sites put all sorts of meaningless garbage into these documents, and users mostly ignore them.
Of course Google's evil; it's about time that people noticed that. They use their deep pockets as a way to bully other sites ... with a smile. Fortunately the U.S. does not have database protection laws like the E.U. Ideally, every PD item they host should also be hosted on an alternative site, but that's a massive undertaking, ... and they know it. Nothing requires them to be nice to the competition, such as by making it easy to copy their material.
Ec
On Sat, Jun 20, 2009 at 14:35, Ray Saintongesaintonge@telus.net wrote:
Brian wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
How is violating Google's ToS against the law?
The verdict in _United States v. Lori Drew_ appears to set a precedent that violating a site's Terms of Service is a violation of the Computer Fraud and Abuse Act. It's not a very strong precedent, but it's still there.
The statute supports that as well, providing a private right of action and civil remedy. It's not entirely that cut and dry (there are certain restrictions that must be met) but yeah, it appears that in some cases TOS violations can be illegal.
-Dan On Jun 22, 2009, at 7:49 PM, Mark Wagner wrote:
On Sat, Jun 20, 2009 at 14:35, Ray Saintongesaintonge@telus.net wrote:
Brian wrote:
That is against the law. It violates Google's ToS.
I'm mostly complaining that Google is being Very Evil. There is nothing we can do about it except complain to them. Which I don't know how to do - they apparently believe that the plain text versions of their books are akin to their intellectual property and are unwilling to give them away.
How is violating Google's ToS against the law?
The verdict in _United States v. Lori Drew_ appears to set a precedent that violating a site's Terms of Service is a violation of the Computer Fraud and Abuse Act. It's not a very strong precedent, but it's still there.
-- Mark [[en:User:Carnildo]]
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
Дана Saturday 20 June 2009 18:29:24 Brian написа:
This has reminded me to complain about Google Books. Google has the world's best OCR (in virtue of having the largest OCR'able dataset) and also has a mission to scan in all the public domain books they can get their hand on. They recently updated their interface to, as they put it, "make it easier to find our plain text versions of public domain books. If a book is available in full view, you can click the 'Plain text' button in the toolbar." Unfortunately the only way I've found to download the full text of a public domain book from Google is to flip through the book a page at a time, copying the text to your clipboard.
Often, these books are available in the Million Books Project too.
Yes, but my understanding is that while google provided part of the mbp data and scans, its continued updates to ocr since then are not being shared. I would be glad to learn this was not the case...
samuel klein. sj@laptop.org. +1 617 529 4266
On Jun 21, 2009 3:14 AM, "Nikola Smolenski" smolensk@eunet.yu wrote:
Дана Saturday 20 June 2009 18:29:24 Brian написа:
This has reminded me to complain about Google Books. Google has the
world's > best OCR (in virtue ... Often, these books are available in the Million Books Project too.
_______________________________________________ foundation-l mailing list foundation-l@lists.wikime...
2009/6/23 Samuel Klein meta.sj@gmail.com
Yes, but my understanding is that while google provided part of the mbp data and scans, its continued updates to ocr since then are not being shared. I would be glad to learn this was not the case...
The dataset you need to train an OCR system to be as good as theirs is the raw images and the plain text. They aren't making it easy to get either of those things :( They have presumably improved the software in other ways as well..
WTF GOOG?
Brian wrote:
2009/6/23 Samuel Klein meta.sj@gmail.com
Yes, but my understanding is that while google provided part of the mbp data and scans, its continued updates to ocr since then are not being shared. I would be glad to learn this was not the case...
The dataset you need to train an OCR system to be as good as theirs is the raw images and the plain text. They aren't making it easy to get either of those things :( They have presumably improved the software in other ways as well..
WTF GOOG?
Well, when your shorthand uses their stock ticker symbol, your argument has already been coopted.
--Michael Snow
On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow wikipedia@verizon.netwrote:
The dataset you need to train an OCR system to be as good as theirs is
the
raw images and the plain text. They aren't making it easy to get either
of
those things :( They have presumably improved the software in other ways
as
well..
WTF GOOG?
Well, when your shorthand uses their stock ticker symbol, your argument has already been coopted.
--Michael Snow
I get the joke but um, I used it on purpose and which one of my arguments been "coopted" ??
Brian wrote:
On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow wikipedia@verizon.netwrote:
The dataset you need to train an OCR system to be as good as theirs is
the
raw images and the plain text. They aren't making it easy to get either
of
those things :( They have presumably improved the software in other ways
as
well..
WTF GOOG?
Well, when your shorthand uses their stock ticker symbol, your argument has already been coopted.
--Michael Snow
I get the joke but um, I used it on purpose and which one of my arguments been "coopted" ??
Coopting is not like rebutting; it does not bite chunks out of specific pieces, it swallows whole. Symbols are powerful things, perhaps even more so outside the mathematical logic of argument. They do not serve only your purposes, even if you use them purposefully. My observations may be wry, but they are not entirely in jest.
--Michael Snow
Ok Shakespeare. But in plain english you appear to be saying that corporations are inherently greedy and have a tendency to be evil. Sure, but we expect more out of GOOG. This is not MSFT we are talking about.
On Tue, Jun 23, 2009 at 12:13 PM, Michael Snow wikipedia@verizon.netwrote:
Brian wrote:
On Tue, Jun 23, 2009 at 11:44 AM, Michael Snow <wikipedia@verizon.net wrote:
The dataset you need to train an OCR system to be as good as theirs is
the
raw images and the plain text. They aren't making it easy to get either
of
those things :( They have presumably improved the software in other
ways
as
well..
WTF GOOG?
Well, when your shorthand uses their stock ticker symbol, your argument has already been coopted.
--Michael Snow
I get the joke but um, I used it on purpose and which one of my arguments been "coopted" ??
Coopting is not like rebutting; it does not bite chunks out of specific pieces, it swallows whole. Symbols are powerful things, perhaps even more so outside the mathematical logic of argument. They do not serve only your purposes, even if you use them purposefully. My observations may be wry, but they are not entirely in jest.
--Michael Snow
foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Tue, Jun 23, 2009 at 2:24 PM, Brian Brian.Mingus@colorado.edu wrote:
Ok Shakespeare. But in plain english you appear to be saying that corporations are inherently greedy and have a tendency to be evil. Sure, but we expect more out of GOOG. This is not MSFT we are talking about.
Of course they're inherently greedy. That's the whole purpose of a for-profit corporation - to make as much money as possible for its shareholders. As for "tendency to be evil", I think that rests on your definition of "evil".
On Tue, Jun 23, 2009 at 3:58 PM, Anthony wikimail@inbox.org wrote:
On Tue, Jun 23, 2009 at 2:24 PM, Brian Brian.Mingus@colorado.edu wrote:
Ok Shakespeare. But in plain english you appear to be saying that corporations are inherently greedy and have a tendency to be evil. Sure, but we expect more out of GOOG. This is not MSFT we are talking about.
Of course they're inherently greedy. That's the whole purpose of a for-profit corporation - to make as much money as possible for its shareholders.
I guess even a non-profit is inherently greedy, it's just greedy for something other than money. The WMF is greedy for the spread of free knowledge.
But this is off-topic. Let's take it to another list or something.
On Wed, Jun 24, 2009 at 6:10 AM, Anthony wikimail@inbox.org wrote:
On Tue, Jun 23, 2009 at 3:58 PM, Anthony wikimail@inbox.org wrote:
On Tue, Jun 23, 2009 at 2:24 PM, Brian Brian.Mingus@colorado.edu wrote:
Ok Shakespeare. But in plain english you appear to be saying that corporations are inherently greedy and have a tendency to be evil. Sure, but we expect more out of GOOG. This is not MSFT we are talking about.
Of course they're inherently greedy. That's the whole purpose of a for-profit corporation - to make as much money as possible for its shareholders.
I guess even a non-profit is inherently greedy, it's just greedy for something other than money. The WMF is greedy for the spread of free knowledge.
But this is off-topic. Let's take it to another list or something.
off-topic?? ... surely you jest!!
I think about _three_ of the 50+ emails in this thread have been on the topic of open access journal articles on Wikisource.
-- John Vandenberg
On Tue, Jun 23, 2009 at 1:09 PM, Brian Brian.Mingus@colorado.edu wrote:
2009/6/23 Samuel Klein meta.sj@gmail.com
Yes, but my understanding is that while google provided part of the mbp data and scans, its continued updates to ocr since then are not being shared.
I
would be glad to learn this was not the case...
The dataset you need to train an OCR system to be as good as theirs is the raw images and the plain text. They aren't making it easy to get either of those things :( They have presumably improved the software in other ways as well..
WTF GOOG?
It's almost like they're trying to run a business or something.
Samuel Klein wrote:
There is a wealth of work done all the time by primary source researchers and publishers, which could be improved on by having wikisource entries, translations, &c.
Related question : how appropriate would large numbers of public domain texts, with page scans and the best available OCR [and translations of same], fit with what Wikisource does now? This is clearly a wiki project that needs to happen : OCR even at its best misses rare meaning-bearing words. If not Wikisource, where should this work take place?
From my perspective it fits perfectly with the vision that I had of Wikisource on the first day of its existence. Tim Armstrong [[User:Tarmstro99]] has already done a considerable amount of valuable work relating to law on Wikisource. That has been mostly a one-man project to deal with a massive amount of material. Some have even proposed deleting all the US Code material on the grounds that we don't have the ability to keep it up to date. That has prompted some very interesting questions and ideas about how this kind of stuff might be handled, but taking those questions to the next level requires lots of work. Most regular Wikisourcerors already have long personal to-do lists to keep them busy. So the question is not really about whether Wikisource should host these goods, it's about recruiting volunteers to do the hard work.
Ec
On Sat, Jun 20, 2009 at 11:41 AM, David Gerarddgerard@gmail.com wrote:
http://blogs.law.harvard.edu/infolaw/2009/06/19/using-wikisource-as-an-alter...
Interesting. How well does this fit with what Wikisource does?
- d.
On Sun, Jun 21, 2009 at 1:41 AM, David Gerard dgerard@gmail.com wrote:
http://blogs.law.harvard.edu/infolaw/2009/06/19/using-wikisource-as-an-alter...
Interesting. How well does this fit with what Wikisource does?
Tim Armstrong is a sysop on Wikisource ... :-) more below..
On Sun, Jun 21, 2009 at 4:17 PM, Ray Saintonge saintonge@telus.net wrote:
Samuel Klein wrote:
There is a wealth of work done all the time by primary source researchers and publishers, which could be improved on by having wikisource entries, translations, &c.
Related question : how appropriate would large numbers of public domain texts, with page scans and the best available OCR [and translations of same], fit with what Wikisource does now? This is clearly a wiki project that needs to happen : OCR even at its best misses rare meaning-bearing words. If not Wikisource, where should this work take place?
If it was published, Wikisource accepts it. Notability is not a consideration.
The only other "open" project of comparable size is [[Distributed Proofreaders]]. Here are our statistics:
http://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics
Most of the Wikisource projects accept free translations.
http://wikisource.org/wiki/WS:COORD
The two English Wikisource featured translations are:
http://en.wikisource.org/wiki/Balade_to_Rosemounde http://en.wikisource.org/wiki/J%27accuse (also translated into Dutch)
The two biggest translation projects that I know of are:
http://en.wikisource.org/wiki/Romance_of_the_Three_Kingdoms http://en.wikisource.org/wiki/Bible_(Wikisource)
Another good one is
http://en.wikisource.org/wiki/Max_Havelaar_(Wikisource)
We also have translations of laws, usually relating to copyright.
http://en.wikisource.org/wiki/Ordinance_93-027_of_30_March_1993_on_copyright...
From my perspective it fits perfectly with the vision that I had of Wikisource on the first day of its existence. Tim Armstrong [[User:Tarmstro99]] has already done a considerable amount of valuable work relating to law on Wikisource.
Tim has been doing high impact work in this area.
H.R. Rep. No. 94-1476
http://blogs.law.harvard.edu/infolaw/2008/06/17/an-open-access-success-story...
U.S. Statutes at Large
http://blogs.law.harvard.edu/infolaw/2008/06/02/public-records-one-jpeg-at-a...
http://en.wikisource.org/wiki/United_States_Statutes_at_Large
In regards the USC, the majority of it is a mess, but Title 17 is a great example of where we are heading.
http://en.wikisource.org/wiki/United_States_Code/Title_17
We also have transcription projects for the UK 1911 copyright act, which has influenced so many other countries.
http://en.wikisource.org/wiki/Index:The_copyright_act,_1911,_annotated.djvu http://en.wikisource.org/wiki/Index:A_treatise_upon_the_law_of_copyright.djv...
More can be found from our freshly minted Law index:
http://en.wikisource.org/wiki/Wikisource:Law
Our two featured texts are: http://en.wikisource.org/wiki/South_Africa_Act_1909 http://en.wikisource.org/wiki/ACLU_v._NSA_(District_Court_opinion)
Most regular Wikisourcerors already have long personal to-do lists to keep them busy. So the question is not really about whether Wikisource should host these goods, it's about recruiting volunteers to do the hard work.
If people want to help, but dont know where to start, my recommendation is that they start proofreading the Stat. volume 1, as this is goldmine of interesting documents, and will be an excellent example of crowdsourcing of transcription.
http://en.wikisource.org/wiki/Index:United_States_Statutes_at_Large/Volume_1
Enjoy, John Vandenberg
For Supreme Court cases, would it be possible to have a bot pull the audio decisions from Oyez, and convert them into text?
________________________________ From: David Gerard dgerard@gmail.com To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org Sent: Saturday, June 20, 2009 8:41:45 AM Subject: [Foundation-l] Info/Law blog: Using Wikisource as an Alternative Open Access Repository for Legal Scholarship
http://blogs.law.harvard.edu/infolaw/2009/06/19/using-wikisource-as-an-alter...
Interesting. How well does this fit with what Wikisource does?
- d.
_______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
On Sun, Jun 21, 2009 at 1:41 AM, David Gerard dgerard@gmail.com wrote:
http://blogs.law.harvard.edu/infolaw/2009/06/19/using-wikisource-as-an-alter...
Interesting. How well does this fit with what Wikisource does?
Here are seven articles from PLoS One.
http://en.wikisource.org/wiki/Category:Plosone
We have other published material that has been released under CC licenses:
http://en.wikisource.org/wiki/Unhappy_Thought
And books under various licenses:
http://en.wikisource.org/wiki/Bulgarian_Policies_on_the_Republic_of_Macedoni... http://en.wikisource.org/wiki/A_Short_History_of_Russian_%22Fantastica%22 http://en.wikisource.org/wiki/Free_as_in_Freedom
-- John Vandenberg
wikimedia-l@lists.wikimedia.org