William Pietri wrote:
On 01/07/2010 01:40 AM, Jamie Morken wrote:
I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...]
Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem.
No, bandwidth is not really the problem here. I think the core issue is to have bulk access to images.
There have been a number of these requests in the past and after talking back and forth, it has usually been the case that a smaller subset of the data works just as well.
A good example of this was the Deutsche Fotokek archive made late last year.
http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB )
This provided an easily retrievable high quality subset of our image data which researchers could use.
Now if we were to snapshot image data and store them for a particular project the amount of duplicate image data would become significant. That's because we re-use a ton of image data between projects and rightfully so.
If instead we package all of commons into a tarball then we get roughly 6T's of image data which after numerous conversation has been a bit more then most people want to process.
So what does everyone think of going down the collections route?
If we provide enough different and up to date ones then we could easily give people a large but manageable amount of data to work with.
If there is a page already for this then please feel free to point me to it otherwise I'll create one.
--tomasz