On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmorken@shaw.ca wrote:
I am not sure about the cost of the bandwidth, but the wikipedia image dumps are no longer available on the wikipedia dump anyway. I am guessing they were removed partly because of the bandwidth cost, or else image licensing issues perhaps.
I think we just don't have infrastructure set up to dump images. I'm very sure bandwidth is not an issue -- the number of people with a terabyte (or is it more?) handy that they want to download a Wikipedia image dump to will be vanishingly small compared to normal users. Licensing wouldn't be an issue for Commons, at least, as long as it's easy to link the images up to their license pages. (I imagine it would technically violate some licenses, but no one would probably worry about it.)
Bittorrent is simply a more efficient method to distribute files, especially if the much larger wikipedia image files were made available again. The last dump from english wikipedia including images is over 200GB but is understandably not available for download. Even if there are only 10 people per month who download these large files, bittorrent should be able to reduce the bandwidth cost to wikipedia significantly.
Wikipedia uses an average of multiple gigabits per second of bandwidth, as I recall. One gigabit per second adds up to about 10.5 terabytes per day, so say 300 terabytes per month. I'm pretty sure the average figure is more like five or ten Gbps than one, so let's say a petabyte a month at least Ten people per month downloading an extra terabyte is not a big issue. And I really doubt we'd see that many people downloading a full image dump every month.
The sensible bandwidth-saving way to do it would be to set up an rsync daemon on the image servers, and let people use that. Then you could get an old copy of the files from anywhere (including Bittorrent, if you like) and only have to download the changes. Plus, you could get up-to-the-minute copies if you like, although probably some throttling should be put into place to stop dozens of people from all running rsync in a loop to make sure they have the absolute latest version. I believe rsync 2 doesn't handle such huge numbers of files acceptably, but I heard rsync 3 is supposed to be much better. That sounds like a better direction to look in than Bittorrent -- nobody's going to want to redownload the same files constantly to get an up-to-date set.
Unless there are legal reasons for not allowing images to be downloaded, I think the wikipedia image files should be made available for efficient download again.
I'm pretty sure the reason there's no image dump is purely because not enough resources have been devoted to getting it working acceptably.