On Nov 24, 2004, at 8:24 PM, Brion Vibber wrote:
On Nov 24, 2004, at 8:02 PM, Rich Holton wrote:
The site 2BuyGood.com/InfoPedia (http://www.2buygood.com/wiki/) is grabbing live content from the English Wikipedia, but gives no link
[snip]
While we work on GFDL issues, could a developer block the live content grabs?
Done.
I should note that not only were they forwarding every request to our servers, stripping off all navigation links and identification text, and stuffing it full of advertising and JavaScript popups, but their requests to the web server used false referrer and user-agent fields to hide their tracks. Here are a couple of hits:
66.152.98.14 - - [25/Nov/2004:04:14:19 +0000] "GET http://www.wikipedia.org/wiki/ HTTP/1.1" 301 564 "en.wikipedia.org" "Mozilla/4.0 (compatible; MSIE 5.01; Windows 98; DigExt)"
66.152.98.18 - - [25/Nov/2004:04:14:50 +0000] "GET http://www.wikipedia.org/wiki/Wikipedia:Community_Portal HTTP/1.1" 301 616 "en.wikipedia.org" "Mozilla/4.0 (compatible; MSIE 5.01; Windows 98; DigExt)"
Instead of using their own site as a referer URL, their digger is using our hostname ("en.wikipedia.org"). That's not even a valid referrer, since it should be a URL! And, they're falsely claiming to be Internet Explorer so it looks like the hits are coming from some human browser.
2buygood.com's front-end address resolves to 66.152.98.201; the hits to our servers come from several IPs on the same /24 network; I've noticed from .12 through .20 in the log extracts I saw. I've blocked the whole subnet at our squid servers, so they're receiving 403 (permission denied) errors.
-- brion vibber (brion @ pobox.com)