On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor Simetrical+wikilist@gmail.com wrote:
On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmorken@shaw.ca wrote:
I am not sure about the cost of the bandwidth, but the wikipedia image dumps are no longer available on the wikipedia dump anyway. I am guessing they were removed partly because of the bandwidth cost, or else image licensing issues perhaps.
I think we just don't have infrastructure set up to dump images. I'm very sure bandwidth is not an issue -- the number of people with a
Correct. The space wasn't available for the required intermediate cop(y|ies).
terabyte (or is it more?) handy that they want to download a Wikipedia image dump to will be vanishingly small compared to normal users.
s/terabyte/several terabytes/ My copy is not up to date, but it's not smaller than 4.
Licensing wouldn't be an issue for Commons, at least, as long as it's easy to link the images up to their license pages. (I imagine it would technically violate some licenses, but no one would probably worry about it.)
We also dump the licensing information. If we can lawfully put the images on website then we can also distribute them in dump form. There is and can be no licensing problem.
Wikipedia uses an average of multiple gigabits per second of bandwidth, as I recall.
http://www.nedworks.org/~mark/reqstats/trafficstats-daily.png
Though only this part is paid for: http://www.nedworks.org/~mark/reqstats/transitstats-daily.png
The rest is peering, etc. which is only paid for in the form of equipment, port fees, and operational costs.
The sensible bandwidth-saving way to do it would be to set up an rsync daemon on the image servers, and let people use that.
This was how I maintained a running mirror for a considerable time.
Unfortunately the process broke when WMF ran out of space and needed to switch servers.
On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmorken@shaw.ca wrote:
Bittorrent is simply a more efficient method to distribute files,
No. In a very real absolute sense bittorrent is considerably less efficient than other means.
Bittorrent moves more of the outbound traffic to the edges of the network where the real cost per gbit/sec is much greater than at major datacenters, because a megabit on a low speed link is more costly than a megabit on a high speed link and a megabit on 1 mile of fiber is more expensive than a megabit on 10 feet of fiber.
More over, bittorrent is topology unaware so the path length tends to approach the internet average mean path length. Datacenters tend to be more centrally located topology wise, and topology aware distribution is easily applied to centralized stores. (E.g. WMF satisfies requests from Europe in europe, though not for the dump downloads as there simply isn't enough traffic to justify it)
Bittorrent also is a more complicated, higher overhead service which requires more memory and more disk IO than traditional transfer mechanisms.
There are certainly cases where bittorrent is valuable, such as the flash mob case of a new OS release. This really isn't one of those cases.
On Thu, Jan 7, 2010 at 11:52 AM, William Pietri william@scissor.com wrote:
On 01/07/2010 01:40 AM, Jamie Morken wrote:
I have a suggestion for wikipedia!! I think that the database dumps including the image files should be made available by a wikipedia bittorrent tracker so that people would be able to download the wikipedia backups including the images (which currently they can't do) and also so that wikipedia's bandwidth costs would be reduced. [...]
Is the bandwidth used really a big problem? Bandwidth is pretty cheap these days, and given Wikipedia's total draw, I suspect the occasional dump download isn't much of a problem.
Bittorrent's real strength is when a lot of people want to download the same thing at once. E.g., when a new Ubuntu release comes out. Since Bittorrent requires all downloaders to be uploaders, it turns the flood of users into a benefit. But unless somebody has stats otherwise, I'd guess that isn't the problem here.
We tried BT for the commons poty archive once while I was watching and we never had a downloader stay connected long enough to help another downloader... and that was only 500mb, much easier to seed.
BT also makes the server costs a lot higher: it has more cpu/memory overhead, and creates a lot of random disk IO. For low volume large files it's often not much of a win.
I haven't seen the numbers for a long time, but when I last looked download.wikimedia.org was producing fairly little traffic... and much of what it was producing was outside of the peak busy hour for the sites. Since the transit is paid for on the 95th percentile and the WMF still has a decent day/night swing out of peak traffic is effectively free. The bandwidth is nothing to worry about.