I started a page [0] to track the problem and some of the emerging solutions. I'm still transferring information from a private wiki over, but it would be great to get others to document what they've been using. I'll start expanding on the tools I know about to give more information about the tradeoffs when using them.
Thanks for all the great info so far! I know this has been consuming a lot of web admins time the last few months.
[0]: https://www.mediawiki.org/wiki/Handling_web_crawlers
On Thu, Apr 24, 2025 at 2:39 PM Bryan Davis bd808@wikimedia.org wrote:
On Thu, Apr 24, 2025 at 3:16 PM MusikAnimal musikanimal@gmail.com wrote:
Note that this exercise of IP range whack-a-mole is nothing new to VPS
tools. I maintain two VPS projects (XTools, WS Export) that constantly suffer from aggressive web crawlers and disruptive automation. We've been doing the manual IP block thing for years :(
An interesting aspect of both of those Cloud VPS projects is that they are directly linked to from a number of content wikis. I think this greatly extends their exposure to crawler traffic in general.
I suggest the IP denylist be applied to all of WMCS <
https://phabricator.wikimedia.org/T226688%3E. We're able to get by for XTools and WS Export because XFF headers were specially enabled for this counter-abuse purpose. However most VPS tools and all of Toolforge don't have such luxury. If there are bots pounding away, there's no means to stop them currently (unless they are good bots with an identifiable UA). Even if we could detect them, it seems better to reduce the repetitive effort and give all of WMCS the same treatment.
You are talking about three completely separate HTTP edges at this point. They all live on the same core Cloud VPS infrastructure, but there is no common HTTPS connection between the *.toolforge.org proxy, the *.wmcloud.org proxy, and the Beta Cluster CDN. The first two share some nginx stack configuration, but in practice are very different deployments with independent public IP addresses. The third is fundamentally a partial clone of the production wiki's CDN edge although scaled down and missing some newer components that nobody has yet done the work to introduce.
I'll also note that some farms of web crawlers can't feasibly be blocked
whack-a-mole style. This is the situation we're currently dealing with over at https://phabricator.wikimedia.org/T384711#10759017.
Truly distributed attack patterns (bot net traffic) are really hard to defend against with just an Apache2 instance. This is actually a place where someone could try experimenting with some filtering proxy like Anubis [0], go-away [1], or openappsec [2]. Having some experience with these tools could then lead us into better discussions about deploying them more widely or making them easier to use in targeted projects.
Bryan
Bryan Davis Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/