On 10/26/07, Anthony wikimail@inbox.org wrote:
Have Google and Yahoo been informed of this policy?
No, since they're our number-one referers.
BTW, that talks about articles, not images. And it contradicts robots.txt, especially "## we're disabling this experimentally 11-09-2006\n#Crawl-delay: 1"
It seems to stem from something said on the Village Pump back in 2003. I for one am going to go with robots.txt, not something someone said on some Wikipedia page.
I believe a more accurate story would be as follows:
1) Live mirrors of the site, however big or small, are discouraged without prior agreement. You're supposed to use the dumps for this. If you want to provide some kind of useful value-added "gateway" or framing or whatever, that for instance marks up the pages in some useful way or whatever, *and* you very clearly acknowledge the source and give a link, *and* you don't run ads or similar, *and* you don't use too much bandwidth, that's probably fine (although best to ask first). If you don't meet the preceding conditions, you may be asked to pay a fee for the mirroring service, or face blocking.
2) Anything that uses enough server resources to slow down the site will probably be blocked or killed if it's noticed. In the old days this was a concern, but nowadays it's probably not.
There was a page I once saw where someone had put up the statement that bots should only request pages once every ten seconds or something. When I looked in the histories, I saw that Brion had added it in like 2003, along with a description of the hardware Wikipedia was being run on: a single server with one Pentium CPU. Later someone removed the part of that edit with the grossly-outdated server description, but neglected to remove the by then ludicrous blanket restriction on crawlers.
Anyway, it comes down to this: it's always courteous to ask, but if you don't cause any actual damage probably nobody will notice or care. Don't take that as any official party line, I'm not a sysadmin, but that seems to hold as far as I can tell.