http://www.nytimes.com/external/readwriteweb/2009/02/25/25readwriteweb-amazo...
According to this, a new project by Amazon that makes a terabyte of public data available includes a full dump of Wikipedia. It also includes the complete dbpedia - so it seems like there are likely to be lots of duplicates. Given the other information it says it includes (the whole human genome, all other publicly available DNA sequences, census data, etc.) I'm not sure how it all fits in a single terabyte. Interesting concept, though. I wonder how old the dump is, since they've been unavailable for some time?
Nathan
2009/2/25 Nathan nawrich@gmail.com:
http://www.nytimes.com/external/readwriteweb/2009/02/25/25readwriteweb-amazo...
According to this, a new project by Amazon that makes a terabyte of public data available includes a full dump of Wikipedia. It also includes the complete dbpedia - so it seems like there are likely to be lots of duplicates. Given the other information it says it includes (the whole human genome, all other publicly available DNA sequences, census data, etc.) I'm not sure how it all fits in a single terabyte. Interesting concept, though. I wonder how old the dump is, since they've been unavailable for some time?
It probably only contains the latest copies of each page in the main namespace, rather than a full dump (I can't see why they would want a full dump). That's pretty small (a bit larger if they've included images, of course). I think there have been article dumps of enwiki reasonably recently, it's just the full dumps that always fail.
thread convergence! It didn't include wikipedia-proper when I looked yesterday, but this was suggested...
On Tue, Feb 24, 2009 at 11:26 PM, Brian Brian.Mingus@colorado.edu wrote:
Why not make the uncompressed dump available as an Amazon Public Dataset? http://aws.amazon.com/publicdatasets/
On Wed, Feb 25, 2009 at 10:43 AM, Nathan nawrich@gmail.com wrote:
http://www.nytimes.com/external/readwriteweb/2009/02/25/25readwriteweb-amazo...
According to this, a new project by Amazon that makes a terabyte of public data available includes a full dump of Wikipedia. It also includes the complete dbpedia - so it seems like there are likely to be lots of duplicates. Given the other information it says it includes (the whole human genome, all other publicly available DNA sequences, census data, etc.) I'm not sure how it all fits in a single terabyte. Interesting concept, though. I wonder how old the dump is, since they've been unavailable for some time?
Nathan
-- Your donations keep Wikipedia running! Support the Wikimedia Foundation today: http://www.wikimediafoundation.org/wiki/Donate _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
wikimedia-l@lists.wikimedia.org