>
> The new policy isn’t more restrictive than the older one for general
> crawling of the site or the API; on the contrary we allow higher limits
> than previously stated.
I find this hard to believe, considering this new sentence for
upload.wikimedia.org: «Always keep a total concurrency of at most 2, and
limit your total download speed to 25 Mbps (as measured over 10 second
intervals).»
This is a ridiculously low limit. It's a speed which is easy to breach
in casual browsing of Wikimedia Commons categories, let alone with any
kind of media-related bots.
First of all, each of the limits explicitly exclude web browsers and human activity in general.
This
limit (that we can discuss, see below) is intended to ensure a single
unidentified agent cannot use a significant slice of our available
resources.
Second, there was no stated limit on download of
media files in the policy IIRC, because it was written in 2009 when
media downloads weren't as big of an issue, which is why the quote you
report clearly states "the site or the api" - any limit imposed on media
downloads is indeed by default more restrictive.
It
was never my goal, in updating the policy, to limit what can be done;
but rather to get eventually to a point where we can safely identify if
some traffic is
coming from a user, a high volume bot we've identified, or random traffic from the internet.
It
will help both reduce the constant stream of incidents related to
predatory downloading of our images while reducing impact on legitimate
users[1].
Simply put, I want to be able to know
who's doing what, and be able to put general limits on unidentified
actors that we can determine clearly aren't a user-run browser.
As
you can imagine, I have a personal interest in this - moving from the
game of whack-a-mole SRE plays nowadays to systematic enforcement of
limits on unidentified clients will improve my own quality of life.
I
have no interest nor intention to prevent people from archiving
wikipedia, nor I guess would the community, which I hope could
eventually grant tiers of usage to individual bots, leaving me/us only
the role of defining said tiers.
It was never
my intention, in writing the limits, to impede any activity, but rather
to put ourselves in a position where we're more aware of who is doing
what.
I appreciate that some exceptions for Wikimedia Cloud bots were added
after the discussion at
https://phabricator.wikimedia.org/T391020#10716478 , but the fact
remains that this comes off as a big change.
Actually, the exception for WMCS,
which has been around for years, has been a pillar of the policy since
I've written the first draft. Protecting community use while also
protecting the infrastructure (and, honestly, my weekends :) ) has
always been my main goal.
Having said all of
the above, I see how the 25 Mbps limit seems stringent; in evaluating
it, let me explain how I got to that number:
* Because of the
nature of media downloads, it will be extremely
hard for us to enforce limits that are not per-IP - I don't want to get
into more details on that, but let me just say that rate-limiting fairly
usage of our media serving infrastructure isn't simple, especially if
you're trying very hard to not interfere with human consumption.
*
I calculated what sustained bandwidth we can support in a single
datacenter without saturating more than 80% of our outgoing links, if a
crawler uses a number of different IP addresses as large as the largest
we've seen from one of these abuse actors.
So yes, the number is probably a bit defensive, and we can reason if that's enough for a non-archival bot usage.
I'd
argue I'd be happy if an archival tool uses and needs more resources; I
would also like to be able to not worry about it and/or block it in an
emergency.
Again, the reason I've asked for
feedback is I'm open to changing things, in particular the numbers I've
settled on, which are of course coming from the perspective of someone
trying to preserve resources for consumption.
If
you have a suggestion about what you think would be a more reasonable
default limit, considering the above, please do so on the talk page. If
you have suggestions to make it clearer what's the intention of the
policy, those are also welcome of course.
Cheers,
[1]
To make an example with a screwup of mine: two weekends ago, a
predatory scraper masking as Google Chrome and coming from all over the
internet brought down our media backend serving twice. I and others
intervened and saved the situation, but the ban I created was casting a
little to large a net, and I forgot to remove it eventually which ended
up causing issues to users, see