Todd Allen wrote:
Robert Horning wrote:
In regards to "live" mirrors that are constantly sucking bandwidth off of the Wikimedia server farm, I would have to agree that this is a major problem and something that should be dealt with, both on a legal front as well as through technical means. I would be curious about some comparisons of the bandwidth need of a *very* active Wikimedia user/administrator who is on-line nearly 24/7 vs. one of these mirror sites. I think it would be easy for an active editor/user to suck at least 1 GB of data/day, but it would be along this order of magnitude of bandwidth. It would be an interesting test to see how much it would actually come out to in practice.
Robert Horning
But even so, the Foundation has every right to say "It is totally acceptable for a very active administrator (or even a very voracious reader) to use 1 GB a day if they want to, but it is not acceptable for a live mirror to do so." Legitimate (even if heavy) users of a site are one thing, bandwidth leeches are quite another.
I guess in part here I'm trying to propose a technical solution of sorts. Perhaps bandwidth for a particular IP address could be throttled in some way that would allow a very heavy but legitimate user to access the 50-100 pages or so a day that they actually read in some depth (just to give a figure), but not allow a mirror to suck up every change unless they have made some sort of financial arrangement with the WMF to pay for this extra bandwidth. The WMF will certainly not "make a profit" doing this, and even if it became a problem with the IRS, solutions could still be found to help get these mirrors to pay for the resources they are taking up.
The trick is that some of these mirrors can and do use some very sneaky methods to keep their sites up to date with current data. While you can cull out some of these sites and segregating them from ordinary users, this is a technical arms race to see who can block the live mirrors and those sites with hackers that can "fake" the ability to send requests that look like ordinary user requests to keep the pages up to date. That I can come up with an algorithm right now to seem like random user pages is enough to make me think this could get to the point that would make it nearly impossible to detect, except for seeing the pages on the mirror show up with very recent changes. Most mirrors aren't that cleaver, so this may not apply in practice.