I’d like to see the Commons backups available in the AMZN S3 cloud, even if it is only as “requester pays”. Frankly, my experience is that getting data from the Internet Archive is so slow that I wonder if they are on the Moon.
My infovore framework
http://github.com/paulhoule/infovore
is specifically designed to make Hadoop applications easy to run in your own cluster on in a cluster provisioned automatically in Amazon EMR. In particular, an application can be packaged in the S3 cloud and run by somebody with little Hadoop or AWS experience. This makes handling “big data” much more accessible than it ever has been.
AMZN has had a policy of offering free S3 storage for public data sets – I’d like to see them take this program to the next level with data sets of this nature.
From: Gerard Meijssen Sent: Monday, October 14, 2013 4:38 PM To: Wikimedia Commons Discussion List Subject: Re: [Commons-l] [wikiteam-discuss:699] "Tarballs" of all 2004-2012 Commons files now available at archive.org
Hoi,
Geni, sorry but there is a difference of their being a backup within the WMF of Commons and there being a dataset of Commons at the IA that is not current. People can do all the analysis they want on the old data and it will not make any difference. It will not make the data that is currently in Commons any more accessible.
We have been told repeatedly that the data at the WMF is secure. Beyond that the data is like knowing what the maximum is the insurance policy will pay. You know it will be not enough. It is however very much a hypothetical question. How to make Commons usable is an here and now issue. Thanks, GerardM
On 14 October 2013 22:22, geni geniice@gmail.com wrote:
On 14 October 2013 13:59, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi,
While I do agree that it is good to have the data in many places and, the Internet Archive on its own moves it to several places as well. Many of us have seen the IA servers at the Library of Alexandria.
While it is ok to find a use for the data at the IA, I would like us to concentrate first and foremost on how we can make better use of the media that is in Commons itself. How we can open it up to more use. Make Commons more accessable.
And you need to stop right there. As in don't express a further opinion until you realise how wrong you are. You can't do any analysis on data that is lost. And non backed up data is just data that doesn't know that it is lost yet.
-- geni
_______________________________________________ Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l
-------------------------------------------------------------------------------- _______________________________________________ Commons-l mailing list Commons-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/commons-l
On Tue, Oct 15, 2013 at 12:35 AM, Paul A. Houle paul@ontology2.com wrote:
I’d like to see the Commons backups available in the AMZN S3 cloud, even if it is only as “requester pays”. Frankly, my experience is that getting data from the Internet Archive is so slow that I wonder if they are on the Moon.
They do requests:
https://portal.aws.amazon.com/gp/aws/html-forms-controller/aws-dataset-inqui...
So maybe someone with a lot of experience with this dataset could shoot in a request?
-- Hay
Paul A. Houle, 15/10/2013 00:35:
I’d like to see the Commons backups available in the AMZN S3 cloud, even if it is only as “requester pays”. Frankly, my experience is that getting data from the Internet Archive is so slow that I wonder if they are on the Moon.
When did you try last time? They recently increased their bandwidth.
My infovore framework AMZN has had a policy of offering free S3 storage for public data sets – I’d like to see them take this program to the next level with data sets of this nature.
It seems anyone can request it, anyway I sent an inquiry. The datasets they have (XML dumps) are very outdated: https://aws.amazon.com/datasets/Encyclopedic/4182 https://aws.amazon.com/datasets/Encyclopedic/2506 https://aws.amazon.com/datasets/Encyclopedic/2596 Everybody can ask to other mirrors (I already asked GARR), some ideas at https://sourceforge.net/apps/trac/sourceforge/wiki/Mirrors
Nemo