William Pietri wrote:
On 01/07/2010 01:40 AM, Jamie Morken wrote:
I have a
suggestion for wikipedia!! I think that the database dumps including
the image files should be made available by a wikipedia bittorrent
tracker so that people would be able to download the wikipedia backups
including the images (which currently they can't do) and also so that
wikipedia's bandwidth costs would be reduced. [...]
Is the bandwidth used really a big problem? Bandwidth is pretty cheap
these days, and given Wikipedia's total draw, I suspect the occasional
dump download isn't much of a problem.
No, bandwidth is not really the problem here. I think the core issue is
to have bulk access to images.
There have been a number of these requests in the past and after talking
back and forth, it has usually been the case that a smaller subset of
the data works just as well.
A good example of this was the Deutsche Fotokek archive made late last
year.
http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB )
This provided an easily retrievable high quality subset of our image
data which researchers could use.
Now if we were to snapshot image data and store them for a particular
project the amount of duplicate image data would become significant.
That's because we re-use a ton of image data between projects and
rightfully so.
If instead we package all of commons into a tarball then we get roughly
6T's of image data which after numerous conversation has been a bit more
then most people want to process.
So what does everyone think of going down the collections route?
If we provide enough different and up to date ones then we could easily
give people a large but manageable amount of data to work with.
If there is a page already for this then please feel free to point me to
it otherwise I'll create one.
--tomasz