Hi!
from now on specific per-bot/per-software/per-client User-Agent header is mandatory for contacting Wikimedia sites.
Domas
On Mon, Feb 15, 2010 at 8:54 PM, Domas Mituzas midom.lists@gmail.com wrote:
Hi!
from now on specific per-bot/per-software/per-client User-Agent header is mandatory for contacting Wikimedia sites.
Domas _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
In that case should we tweak the MediaWiki user agent to serve something more unique than "MediaWiki/version?"
-Chad
Domas wrote:
from now on specific per-bot/per-software/per-client User-Agent header is mandatory for contacting Wikimedia sites.
But why?
(This just broke one of my bots.)
Are the details of this policy discussed anywhere?
Is it permissible to send
User-Agent: x
thus providing precisely the same amount of information as if not supplying the header at all?
Hi Steve,
But why?
Because we need to identify malicious behavior.
(This just broke one of my bots.) Are the details of this policy discussed anywhere?
I don't know. Probably. We always told people to specify User-Agent, just the check was broken.
Is it permissible to send
User-Agent: x
thus providing precisely the same amount of information as if not supplying the header at all?
No, you clearly miss very simple idea that with such user-agent you clearly identify yourself as malicious, whereas when you don't specify, you're either malicious or ignorant.
Do note, we're good at detecting spoofed user-agents too, so if your bots disguise as MSIE or Firefox or any other regular browser, your behavior is seen as malicious.
We do not like malicious behavior.
Domas
On 02/15/2010 06:50 PM, Steve Summit wrote:
You're trying to detect / guard against malicious behavior using *User-Agent*?? Good grief. Have fun with the whack-a-mole game, then.
Yes, a simple restriction like this tends to create smarter villains rather than less villainy. Filtering on an obvious, easy-to-change characteristic also destroys a useful source of information on who the bad people are, making future abuse prevention efforts harder.
William
William,
Yes, a simple restriction like this tends to create smarter villains rather than less villainy. Filtering on an obvious, easy-to-change characteristic also destroys a useful source of information on who the bad people are, making future abuse prevention efforts harder.
Thanks for insights. But no.
We don't use UA as first step of analysis, it was helpful tertiary tool, that put these people into "ignorant or malicious" category. If they'd have spoofed their UAs, we'd block the IPs and inform upstreams, as fully malicious behavior. If they had nice UA, we might have attempted to contact them or have isolated their workload until the issue is fixed ;-)
Domas
On 02/15/2010 07:55 PM, Domas Mituzas wrote:
Yes, a simple restriction like this tends to create smarter villains rather than less villainy. Filtering on an obvious, easy-to-change characteristic also destroys a useful source of information on who the bad people are, making future abuse prevention efforts harder.
Thanks for insights. But no.
We don't use UA as first step of analysis, it was helpful tertiary tool, that put these people into "ignorant or malicious" category. If they'd have spoofed their UAs, we'd block the IPs and inform upstreams, as fully malicious behavior. If they had nice UA, we might have attempted to contact them or have isolated their workload until the issue is fixed ;-)
I am saying that going forward you have eliminated WMF's ability to use a tertiary tool that you agree was helpful.
Having spent a lot of time dealing with abuse early in the Web's history, I wouldn't have done it that way. But it's not really my problem and you don't appear to be looking for input, so godspeed.
William
William,
I am saying that going forward you have eliminated WMF's ability to use a tertiary tool that you agree was helpful.
I can't say, that we entirely eliminated it - we transform it a bit, I guess.
Having spent a lot of time dealing with abuse early in the Web's history, I wouldn't have done it that way. But it's not really my problem and you don't appear to be looking for input, so godspeed.
Oh, I'm observing all the input.
The decision made wasn't entirely "oh we must do it", and of course, there could be other courses of action taken, like cherry-picking IPs to ban, or combine subnet-wide bans with URL-based restrictions.
All of that needs work, and if WMF is willing to spend resources on implementing such restrictions, it can sure work on it - none of my choices are binding, all I do is usually to keep the site up in good shape, without wasting too much money ;-)
Domas
Domas wrote:
We don't use UA as first step of analysis, it was helpful tertiary tool...
But it's now being claimed (one might assume, in defense of the new policy) that disallowing missing User-Agent strings is cutting 20-50% of the (presumably undesirable) load. Which sounds pretty primary. So which is it?
Presumably some percentage of that 20-50% will come back as the spammers realize they have to supply the string. Presumably we then start playing whack-a-mole.
Presumably there's a plan for what to do when the spammers begin supplying a new, random string every time.
(I do worry about where this is going, though.)
On 02/16/2010 06:57 PM, Steve Summit wrote:
Presumably some percentage of that 20-50% will come back as the spammers realize they have to supply the string. Presumably we then start playing whack-a-mole.
If you assume every problem is caused by actively malicious intelligent agents, then there are no complete solutions (or very, very few). Luckily the combination of actively malicious and intelligent isn't so common, so you can get away with making the problem just slightly harder than people are willing to deal with to get whatever kicks they get.
Given the lack of of any evidence, I assert that most of the percentage of people who a) notice a problem, b) care, c) know how to fix it; probably deserve to be using the resources anyway. Besides anyone who doesn't deserve but still fixes the problem will likely be able to, and want to, circumvent other measures.
Conrad
Conrad wrote:
Given the lack of of any evidence, I assert that most of the percentage of people who a) notice a problem, b) care, c) know how to fix it; probably deserve to be using the resources anyway. Besides anyone who doesn't deserve but still fixes the problem will likely be able to, and want to, circumvent other measures.
It's the last point that's the kicker. I don't have any evidence, either, nor do I know precisely what problem is attempting to be solved, here. "Spamvertisers" have been mentioned. The impression I get is that when it comes to spamming, the vast majority of the damage is caused by a small minority of operators who are extremely motivated and have the resources to hire arbitrarily talented programmers.
Therefore, an approach like this might block a large number of the nasties, but a small percentage of the total damage.
So, in the end, if the spam problem ends up being more or less exactly as bad as it was before, then all of this is actually a net loss. Not only are the spammers unimpeded, but the collateral damage is still exacted: the unknown numbers of innocent bystanders (who, for whatever reason, don't have User-Agent supplied for them and aren't in a position to complain about or fix it) remain excluded. Furthermore, once we've taught/forced the canny spammers to undetectably spoof the User-Agent string, that string becomes that much more useless, not only to us, but to everyone else on the net, too.
Hello,
But it's now being claimed (one might assume, in defense of the new policy) that disallowing missing User-Agent strings is cutting 20-50% of the (presumably undesirable) load. Which sounds pretty primary. So which is it?
Check the CPU drop in Monday: http://ganglia.wikimedia.org/pmtpa/graph.php?g=cpu_report&z=medium&c...
Network drop on API: http://ganglia.wikimedia.org/pmtpa/graph.php?g=network_report&z=medium&a...
etc.
You can sure assume, that we need to come up with something to "defend a new policy".
Presumably some percentage of that 20-50% will come back as the spammers realize they have to supply the string. Presumably we then start playing whack-a-mole.
Yes, we will ban all IPs participating in this.
Presumably there's a plan for what to do when the spammers begin supplying a new, random string every time.
Random strings are easy to identify, fixed strings are easy to verify.
(I do worry about where this is going, though.)
Going where it always goes, proper operations of the website. Been there, done that.
Domas
On Tue, Feb 16, 2010 at 2:31 PM, Domas Mituzas midom.lists@gmail.comwrote:
Presumably some percentage of that 20-50% will come back as the spammers realize they have to supply the string. Presumably we then start playing whack-a-mole.
Yes, we will ban all IPs participating in this.
Guess it's just a matter of time until *reading* Wikipedia is unavailable to large portions of the world.
Presumably there's a plan for what to do when the spammers begin
supplying a new, random string every time.
Random strings are easy to identify, fixed strings are easy to verify.
And "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT http://whatsmyuseragent.com/CommonUserAgents.asp#5.1)", is pretty much useless, unless you've already identified the spammer through some other process.
(I do worry about where this is going, though.)
Going where it always goes, proper operations of the website. Been there, done that.
Do any of the other major websites completely block traffic when they see blank user agents?
Anthony,
Yes, we will ban all IPs participating in this.
Guess it's just a matter of time until *reading* Wikipedia is unavailable to large portions of the world.
Your insight is entirely bogus here.
And "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT http://whatsmyuseragent.com/CommonUserAgents.asp#5.1)", is pretty much useless, unless you've already identified the spammer through some other process.
It isn't useless. It clearly shows that the user is acting malicious by having automated software that disguises under common user agent.
Do any of the other major websites completely block traffic when they see blank user agents?
I don't know about UA policies but... Various websites have various techniques to deal with such problems. On the other hand, no other major website has such scarcity of hardware and/or human resources as Wikipedia, at such exposure and API complexity provided.
BR, Domas
And "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT http://whatsmyuseragent.com/CommonUserAgents.asp#5.1)", is pretty much useless, unless you've already identified the spammer through some other process.
It isn't useless. It clearly shows that the user is acting malicious by having automated software that disguises under common user agent.
1) Only if you've already identified the spammer through some other process (otherwise, you don't even know if they're using automated software). 2) It doesn't really show that the user is acting malicious even if you can determine that they're using automated software. They might be using software written by someone else. Or they might have read the error message which says "please supply a user agent" and followed it by supplying a user agent. It might be malicious, or it might be an error in judgment. Regardless, what are you going to do about it? Block the IP? For how long? Even if it's dynamic? Even if it's shared by many others?
Hi!
- Only if you've already identified the spammer through some other process
(otherwise, you don't even know if they're using automated software).
You probably don't get scale of wikipedia or scale of the behavior we had to deal with, if you think that it isn't possible to notice behavior patterns of it :-)
- It doesn't really show that the user is acting malicious even if you can
Acts like a duck, quacks like a duck :)
Regardless, what are you going to do about it? Block the IP?
Perhaps.
For how long?
Depends. Probably indefinitely.
Even if it's dynamic?
Dunno, probably not.
Even if it's shared by many others?
Would avoid it.
Anyway, you probably are missing one important point. We're trying to make Wikipedia's service better.
It doesn't always have definite answers, and one has to balance and search for decent solutions. Probably everything looks easier from your armchair. I'd love to have that view! :)
Domas
Anthony I'm only going to say this once, either shut the fuck up or put your money where your mouth is and develop and propose a valid replacement, and stop wining when others take actions that are deemed necessary for the betterment of wikimedia related projects. UserAgents are not a big deal, Domas is doing a good job at addressing problems, when you have a massive drain on the servers for no apparent reason and no method to even attempt to identify the source (except IPs) and a majority of these are probably wannabe live mirrors full of spam, or broken automated algorithms some action is better than nothing especially because most of those idiots wont know how to fix their broken programs.
John
On Tue, Feb 16, 2010 at 9:00 PM, Anthony wikimail@inbox.org wrote:
Anyway, you probably are missing one important point. We're trying to make Wikipedia's service better.
I'm sure you are. But that doesn't mean I agree with your methods.
Probably everything looks easier from your armchair. I'd love to have that
view! :)
Then stop volunteering. _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Wed, Feb 17, 2010 at 1:00 PM, Anthony wikimail@inbox.org wrote:
On Wed, Feb 17, 2010 at 11:57 AM, Domas Mituzas midom.lists@gmail.com wrote:
Probably everything looks easier from your armchair. I'd love to have that view! :)
Then stop volunteering.
Did you miss the point?
The graphs provided in this thread clearly show that the solution had a positive & desired effect.
A few negative side-effects have been put forward, such as preventing browsing without a UA, but Domas has also indicated that other tech team members can overturn the change if they don't like it.
-- John Vandenberg
On Tue, Feb 16, 2010 at 9:47 PM, John Vandenberg jayvdb@gmail.com wrote:
On Wed, Feb 17, 2010 at 1:00 PM, Anthony wikimail@inbox.org wrote:
On Wed, Feb 17, 2010 at 11:57 AM, Domas Mituzas midom.lists@gmail.com
wrote:
Probably everything looks easier from your armchair. I'd love to have
that
view! :)
Then stop volunteering.
Did you miss the point?
I don't think so. I believe his point was to complain about the position that he is in. My response was that he was in that position by choice, and that if he'd love to be in my position, it's a really easy for thing for him to accomplish.
The graphs provided in this thread clearly show that the solution had
a positive & desired effect.
It showed that there was quite a bit of bathwater thrown out. And at least one very large baby (Google translation), which was temporarily resurrected. We still don't know how many other, smaller, babies were thrown out, and likely never will.
In any case, I don't see how your comment follows from mine.
It showed that there was quite a bit of bathwater thrown out. And at least one very large baby (Google translation), which was temporarily resurrected. We still don't know how many other, smaller, babies were thrown out, and likely never will.
I'm pretty sure, that at least 99.9% of the drop in those graphs was exactly the activity I was going after.
Domas
On Wed, Feb 17, 2010 at 6:54 AM, Domas Mituzas midom.lists@gmail.comwrote:
It showed that there was quite a bit of bathwater thrown out. And at
least
one very large baby (Google translation), which was temporarily resurrected. We still don't know how many other, smaller, babies were thrown out, and likely never will.
I'm pretty sure, that at least 99.9% of the drop in those graphs was exactly the activity I was going after.
I know you are. That's why I started out by trying to figure out who your boss is.
On Wed, Feb 17, 2010 at 8:51 AM, Anthony wikimail@inbox.org wrote:
On Wed, Feb 17, 2010 at 6:54 AM, Domas Mituzas midom.lists@gmail.comwrote:
It showed that there was quite a bit of bathwater thrown out. And at
least
one very large baby (Google translation), which was temporarily resurrected. We still don't know how many other, smaller, babies were thrown out, and likely never will.
I'm pretty sure, that at least 99.9% of the drop in those graphs was exactly the activity I was going after.
I know you are. That's why I started out by trying to figure out who your boss is.
However, with that said, I'm sure you have the best of intentions, Domas, and I assume this is an isolated misjudgment in a sea of positive and useful contributions. I'm sorry if I underestimated your valuable contributions to Wikimedia. The fact is, sitting where I'm sitting (which is out of my difficulty in getting along with rather than my lack of interest in helping), I don't get to see all the work you do behind the scenes. So please don't take offense to my lack of praise for them. Sorry.
Anthony
Hi,
On Tue, Feb 16, 2010 at 8:31 PM, Domas Mituzas midom.lists@gmail.com wrote:
You can sure assume, that we need to come up with something to "defend a new policy".
Yeah, ban no/broken-UA clients for these things that do cause CPU load, but leave article reading unharmed. Normal readers with Privoxy or other privacy filters (you know, people DO still use them, even if their percentage is small!) can at least READ, then.
Presumably some percentage of that 20-50% will come back as the spammers realize they have to supply the string. Presumably we then start playing whack-a-mole.
Yes, we will ban all IPs participating in this.
Good luck fighting a dynamic bot herder (though I do ask me, with the spam blacklist and the captchas for URLs, what the hell can a botnet master achieve by hitting Wikipedia?!).
Presumably there's a plan for what to do when the spammers begin supplying a new, random string every time.
Random strings are easy to identify, fixed strings are easy to verify.
The point is, what should bot writers do: 1) no UA at all, that's the typical newbie mistake who just supplies GET /w/index.php?action=edit, which works with his localhost wiki and every other wegs. 2) default UA of the programming language (PHPs thingy, cURL, Python, some bots may even use wget and bash scripting, it's not THAT difficult to write a Wikibot in bashscript!) 3) own UA (stuff like "HDBot v1.1 (http://xyz.tld)", which I couldn't use some longer time ago) 4) spoof a browser UA (bad, as the site cant differ between bot and browser)
To avoid the ban, only 3 and 4 are possible, as the default UAs are blocked for most cases. But as 3 not really works, or at least is hard to troubleshoot, it leaves only 4, which you do not want.
Please write some doc that answers this once and for all.
Marco
PS: Oh, and please, please make the 403 msg something that people can figure out what's wrong, it takes AGES if you are a newbie to scripting.
Domas wrote:
from now on specific per-bot/per-software/per-client User-Agent header is mandatory for contacting Wikimedia sites.
Oh, my. And not just to be a bot, or to edit the site manually, but even to view it. You can't even fetch a single, simple page now without supplying that header.
If this has been discussed to death elsewhere and represents some bizarrely-informed consensus, I'll try to spare this list my belated rantings, but this is a terrible, terrible idea. Relying on User-Agent represents the very antithesis of [[Postel's Law]], a rock-solid principle o which the Internet (used to be) based.
Steve,
If this has been discussed to death elsewhere and represents some bizarrely-informed consensus, I'll try to spare this list my belated rantings, but this is a terrible, terrible idea. Relying on User-Agent represents the very antithesis of [[Postel's Law]], a rock-solid principle o which the Internet (used to be) based.
RFC2616: 14.43 User-Agent
The User-Agent request-header field contains information about the user agent originating the request. This is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations. User agents SHOULD include this field with requests. The field can contain multiple product tokens (section 3.8) and comments identifying the agent and any subproducts which form a significant part of the user agent. By convention, the product tokens are listed in order of their significance for identifying the application.
User-Agent = "User-Agent" ":" 1*( product | comment )
Example:
User-Agent: CERN-LineMode/2.15 libwww/2.17b3
RFC2119: 3. SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.
I guess you just found one more implication to carefully weight before not specifying U-A.
Domas
Relying on User-Agent represents the very antithesis of [[Postel's Law]], a rock-solid principle o which the Internet (used to be) based.
RFC2616: 14.43 User-Agent The User-Agent request-header field... is for... automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations.
Yes, that's precisely the violation of Postel's Law I was thinking of.
Yes, that's precisely the violation of Postel's Law I was thinking of.
Steve, someone is sending us this User-Agent, is that you?:))
User-Agent: Mozilla 5.0 (Compatible; with Safari, with Opera, Chrome, Netscape, MSIE etc. You get the idea. It's compatible with everything!) Let me tell you a story. Once upon a time, there was a browser named SeaMonkey. It was of noble inheritance, as it was a direct descendant of the famous Netscape of old. Oh, it could have been so proud, this browser, it could have stood up, radiant and tall and strong. But no, that was not to be. Many websites closed their doors for this browser, saying, We don't know who you are, go away! And poor SeaMonkey would have no other choice than to go in disguise, to make the websites believe it was some other browser. And you who read this, you are one of those websites that are putting SeaMonkey to shame. Listen. All browsers support HTML. If you just send the same HTML, all browsers will accept it. Only in some borderline cases will the page you send to the client need to be tailored to a specific client. In the majority of cases, the standard HTML you send to, say, Opera, can be sent to every other browser as well; K-Meleon, Galeon, etc. There is no need to scan the user agent string for keywords like Firefox, Konqueror, Midori or whatever. Just send the standard HTML, OK? But even if you do scan the user agent string, even if you do insist on sending different stuff to different browsers, you should look for distinguishing signs of the rendering engines: Gecko, Webkit, KHTML, Trident, Presto, and so on, not for different browser names. SeaMonkey works the same way as Firefox, Netscape and Flock; they all have Gecko/yyyymmdd in the user agent string. Similarly, Google Chrome works the same way as Safari, Midori and SRWare Iron; they can be identified by the word AppleWebKit in the string. And so on. Distinguishing browsers by name is not only overkill, but it even can backfire in cases such as this, when a misrepresentation is made. Anyway, even more important than that, above all, what you never ever should to is send fatal errors! The real error, incidentally, is not in the content part of the page you send, but in the header: putting info in the header that identifies the page as XHTML, while sending the content of the page as HTML, which will trigger the fatal error message. Errors like that will make you look silly, or worse, they make you look like you're doing it on purpose; sending bad stuff to some browsers, while sending perfectly OK looking pages to others. You're not REALLY doing it on purpose, are you? Trying to make people think that their browsers aren't good enough, that for instance MS Internet Explorer is a better browser than the ones they're using now, because MSIE can display the site while their own browsers can't? No, let's just give you the benefit of the doubt; you're not doing it on purpose. Blame your content management system if you want. Still, it's nothing I can help from here; you will have to make the change to your site to make sure to not saddle some browsers like SeaMonkey or Kazehakaze with your errors. So please, would you consider having a look? Thanks in advance!
Cheers, Domas
Yes, that's precisely the violation of Postel's Law I was thinking of.
Steve, someone is sending us this User-Agent, is that you?:))
No. :-|
Let me tell you a story. Once upon a time, there was a browser named SeaMonkey...
I have no idea what point you were trying to make there (I had considerable difficulty reading it at all, crammed onto one line as it was), but never mind.
On 02/15/2010 05:54 PM, Domas Mituzas wrote:
Hi!
from now on specific per-bot/per-software/per-client User-Agent header is mandatory for contacting Wikimedia sites.
Two questions:
Was there some urgent production impact that required doing this with no notice?
Was any impact analysis done on this? Given Wikipedia's mission, we can't be as casual about rejecting traffic as a commercial site would be. If a commercial site accidentally gets rid of some third-world traffic running behind a shoddy ISP, it's no loss; nobody wants to advertise to them anyhow. But for us, those are the people who gain the most from being able to reach us.
William
Hi!
Was there some urgent production impact that required doing this with no notice?
Actually we had User-Agent header requirement for ages, it just failed to do what it had to do for a while. Consider this to be a bugfix.
Was any impact analysis done on this?
Yup!
Given Wikipedia's mission, we can't be as casual about rejecting traffic as a commercial site would be. If a commercial site accidentally gets rid of some third-world traffic running behind a shoddy ISP, it's no loss; nobody wants to advertise to them anyhow. But for us, those are the people who gain the most from being able to reach us.
Actually, at the moment this mostly affects crap sites that hot-load data from us to display spamvertisements on hacked sites on internet. I don't know where your 'shoddy ISP' speculation fits in.
Domas
On 02/15/2010 07:25 PM, Domas Mituzas wrote:
Was there some urgent production impact that required doing this with no notice?
Actually we had User-Agent header requirement for ages, it just failed to do what it had to do for a while. Consider this to be a bugfix.
Ok. I'm going to take that as "no". In the future, I think it would be better to let people know in advance about non-urgent changes that may break things for them.
Was any impact analysis done on this?
Yup!
Would you care to share the results with us?
In the future, I'd suggest giving basic info like that as part of an announcement.
Actually, at the moment this mostly affects crap sites that hot-load data from us to display spamvertisements on hacked sites on internet.
That's another good thing to share as part of a change announcement: motivation for the change.
I don't know where your 'shoddy ISP' speculation fits in.
Last I looked, there were a lot of poorly maintained proxies out there, some of which mangle headers. It seemed reasonable to me that some of those are on low-rent ISPs in poor countries. If you have already done the work to prove that no legitimate users anywhere in the world are impacted by this change, then perhaps you could save us further discussion and just explain that?
Thanks,
William
"William Pietri" william@scissor.com wrote in message news:4B7A141E.9000808@scissor.com...
On 02/15/2010 07:25 PM, Domas Mituzas wrote:
Was there some urgent production impact that required doing this with no notice?
Ok. I'm going to take that as "no".
As best I understand the discussion in #wikimedia-tech last night, ~20% of search server load was being taken by aforementioned spamvertisers. That sounds like an "urgent production impact" to me.
--HM
As best I understand the discussion in #wikimedia-tech last night, ~20% of search server load was being taken by aforementioned spamvertisers. That sounds like an "urgent production impact" to me.
50% of load, which at that time was using ~20% of search server CPU load. It also cut our API node traffic into half (some CPU too), and got our average response times for API way nicer: http://www.nedworks.org/~mark/reqstats/svctimestats-daily.png ;-) Also it removed some pressure on API squids, which were misbehaving yesterday, and caused API outage.
Also, currently 20% of cluster CPU is being spent on generating atom feeds for people who never really subscribed to them. We don't know why, yet, though :)
Domas
Hello, Am Dienstag 16 Februar 2010 04:15:57 schrieb William Pietri:
some third-world traffic
why should browser in the 3. world not send user-agents like our browsers (I doubt that they use others then we)? The change by domas just blocks 2 kinds of requests: 1.) By broken bots and crawlers and 2.) by paranoid users who removed the user-agents in their browsers. The >99% of normal users (with normal browser) will not notice a difference.
Sincerly, DaB.
On Mon, Feb 15, 2010 at 8:54 PM, Domas Mituzas midom.lists@gmail.comwrote:
Hi!
from now on specific per-bot/per-software/per-client User-Agent header is mandatory for contacting Wikimedia sites.
Domas
Hi,
Whose decision was this? Were Erik, Sue, or Danese involved?
On Tue, Feb 16, 2010 at 10:31 AM, Domas Mituzas midom.lists@gmail.comwrote:
Hi!
Whose decision was this?
Mine.
Were Erik, Sue, or Danese involved?
No.
Cool. Who's your boss, and who's your boss's boss? Sorry, I couldn't find you in the org chart or I'd just have looked that up myself.
On Tue, Feb 16, 2010 at 10:39 AM, Domas Mituzas midom.lists@gmail.comwrote:
Cool. Who's your boss, and who's your boss's boss? Sorry, I couldn't
find
you in the org chart or I'd just have looked that up myself.
Nobody?
Really? Were you doing this work as a contractor, or as a volunteer? Someone's gotta be in charge of the contractors and/or the volunteers, no?
Been like that for ages, haven't it?
No idea. For ages you've been able to just go onto the Wikimedia servers and change whatever you feel like, and answer to nobody? You must be misunderstanding my question or something.
On Tue, Feb 16, 2010 at 4:44 PM, Anthony wikimail@inbox.org wrote:
On Tue, Feb 16, 2010 at 10:39 AM, Domas Mituzas midom.lists@gmail.comwrote:
Been like that for ages, haven't it?
No idea. For ages you've been able to just go onto the Wikimedia servers and change whatever you feel like, and answer to nobody? You must be misunderstanding my question or something.
Correct me if I'm wrong, but AFAIR Domas was the MySQL admin guy since pretty much the beginning, and I think that the fact he's sysop won't change anyway, no matter what happens. For the notes, the step of banning everything without UA sucks totally. Sure, the API and other abuse-prone stuff can be blocked, but ordinary article watching should ALWAYS be possible, no matter what fucked-up UA you use.
Marco
Hi!
Really? Were you doing this work as a contractor, or as a volunteer?
Volunteer.
Someone's gotta be in charge of the contractors and/or the volunteers, no?
Dunno, Cary maybe? :) On the other hand, even if they are in charge, it doesn't mean that they are my bosses :-)
No idea. For ages you've been able to just go onto the Wikimedia servers and change whatever you feel like, and answer to nobody? You must be misunderstanding my question or something.
Kind of. Isn't that a good enough motivation? :-)
Though of course, I tend to consult with tech team members, and they're free to overturn anything I change, especially if they come up with better solutions (and they usually do!). And indeed, I guess WMF owns the ultimate power of terminating my access :)
Cheers, Domas
On Tue, Feb 16, 2010 at 11:04 AM, Domas Mituzas midom.lists@gmail.comwrote:
No idea. For ages you've been able to just go onto the Wikimedia servers and change whatever you feel like, and answer to nobody? You must be misunderstanding my question or something.
Kind of. Isn't that a good enough motivation? :-)
Though of course, I tend to consult with tech team members, and they're free to overturn anything I change, especially if they come up with better solutions (and they usually do!). And indeed, I guess WMF owns the ultimate power of terminating my access :)
In all honesty, I find that fascinating. If someone manages to write a book about how that system works, I'll probably buy it.
On the other hand, I guess it's off topic. So enough about that.
On 16 February 2010 16:13, Anthony wikimail@inbox.org wrote:
On Tue, Feb 16, 2010 at 11:04 AM, Domas Mituzas midom.lists@gmail.comwrote:
No idea. For ages you've been able to just go onto the Wikimedia servers and change whatever you feel like, and answer to nobody? You must be misunderstanding my question or something.
Kind of. Isn't that a good enough motivation? :-) Though of course, I tend to consult with tech team members, and they're free to overturn anything I change, especially if they come up with better solutions (and they usually do!). And indeed, I guess WMF owns the ultimate power of terminating my access :)
In all honesty, I find that fascinating. If someone manages to write a book about how that system works, I'll probably buy it.
That should be the next book about Wiki[mp]edia. I had *never* thought that volunteer sysadmins would be feasible until I saw it happening here.
Wikimedia: Things Are Different Here(tm).
- d.
In fact some WMF paid employees (including me) were in the channel at that time and agreed with the decision. It seemed then and still seems to me a reasonable course of action given the circumstances. I understand it's aggravating to people who didn't get notice; let's look forward. PLease just add the UA header and your tools / bots/ etc. will be back to working. Thanks.
Ariel Glenn ariel@wikimedia.org
Στις 16-02-2010, ημέρα Τρι, και ώρα 16:21 +0000, ο/η David Gerard έγραψε:
On 16 February 2010 16:13, Anthony wikimail@inbox.org wrote:
On Tue, Feb 16, 2010 at 11:04 AM, Domas Mituzas midom.lists@gmail.comwrote:
No idea. For ages you've been able to just go onto the Wikimedia servers and change whatever you feel like, and answer to nobody? You must be misunderstanding my question or something.
Kind of. Isn't that a good enough motivation? :-) Though of course, I tend to consult with tech team members, and they're free to overturn anything I change, especially if they come up with better solutions (and they usually do!). And indeed, I guess WMF owns the ultimate power of terminating my access :)
In all honesty, I find that fascinating. If someone manages to write a book about how that system works, I'll probably buy it.
That should be the next book about Wiki[mp]edia. I had *never* thought that volunteer sysadmins would be feasible until I saw it happening here.
Wikimedia: Things Are Different Here(tm).
- d.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Ariel Glenn wrote:
I understand it's aggravating to people who didn't get notice; let's look forward. PLease just add the UA header and your tools / bots/ etc. will be back to working. Thanks.
Well, sorry, no, it's not quite like that. A few of us -- though I fear an inconsequential minority -- are concerned that this is a destabilizing change, being made in a hurry, by a top-10 website, with consequences that aren't easy to predict and (apparently) haven't even been thought about. The more of us who "just" go along with it, the less the consequences will be thought and talked about, and the more they'll be further hidden, and made inevitable.
If you going to do such blocking can we PLEASE finally find a way to set up a more informative error message for blocked user agents.
I have long ago lost track of how many people come to WP:VPT and other places complaining that they are trying to write a bot / script / etc., and it isn't working because they are using a blocked user agent (such as the default Python agent) and they don't understand what is wrong.
The current English error message text that I see from Python reads:
"Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please try again in a few minutes.
You may be able to get further information in the #wikipedia channel on the Freenode IRC network.
The Wikimedia Foundation is a non-profit organisation which hosts some of the most popular sites on the Internet, including Wikipedia. It has a constant need to purchase new hardware. If you would like to help, please donate.
If you report this error to the Wikimedia System Administrators, please include the details below.
Request: GET http://en.wikipedia.org/wiki/Cat, from 99.60.6.239 via sq77.wikimedia.org (squid/2.7.STABLE7) to () Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 16 Feb 2010 22:25:20 GMT"
Everything except the very last line of that is either irrelevant or wrong. And ERR_ACCESS_DENIED, though vaguely informative, provides no detail about what happened or how to do things properly.
This is bad enough for bot operators who are likely to be fairly intelligent people, but if we are going to give this to everyone with a missing user agent string too (which includes people behind poorly behaved proxies and people who use certain anonminizing software out of intense desire for "privacy"), then this kind of response really starts to send the wrong message.
-Robert Rohde
Robert Rohde wrote:
If you going to do such blocking can we PLEASE finally find a way to set up a more informative error message for blocked user agents...
When the new code blocks requests with missing User Agent strings (which is, oddly, not all of the time), it is with a 403 Forbidden response and the very simple message
Please provide a User-Agent header
(No <html> tags, no nothing.)
What made you guys think it was a good idea to block these or the apple rss reader without prior notice on the mailing lists (perhaps 24hrs worth)? I'm aware of server bots that have broken because of this.......
-Peachey
Robert,
The current English error message text that I see from Python reads:
our error message system became overkill, with all those nice designs and multiple languages. in certain cases serving that message too certain requests caused gigabits of bandwidth. it is also not practical to update it with policies, because, um, it has all those nice designs and multiple languages.
we may actually end up having better error messages at some point in the future.
Everything except the very last line of that is either irrelevant or wrong. And ERR_ACCESS_DENIED, though vaguely informative, provides no detail about what happened or how to do things properly.
I agree.
This is bad enough for bot operators who are likely to be fairly intelligent people, but if we are going to give this to everyone with a missing user agent string too (which includes people behind poorly behaved proxies and people who use certain anonminizing software out of intense desire for "privacy"), then this kind of response really starts to send the wrong message.
We're not sending this response to missing UAs, as this response is being sent by Squid ACLs, and the UA check is done at the MW side.
Domas
Combined replies to various posts below.
Steve Summit wrote:
A few of us -- though I fear an inconsequential minority -- are concerned that this is a destabilizing change, being made in a hurry, by a top-10 website, with consequences that aren't easy to predict and (apparently) haven't even been thought about.
Not entirely an inconsequential minority. Google complained by email that it broke a Google Translate feature, they got an IP-based exemption while they develop and deploy a fix.
In another post:
Domas wrote:
Hi Steve,
But why?
Because we need to identify malicious behavior.
You're trying to detect / guard against malicious behavior using *User-Agent*?? Good grief. Have fun with the whack-a-mole game, then.
Well yeah. We've had malicious traffic in the past that hasn't been easily filterable by request headers. The response was to create a list of the IP addresses causing the most traffic and to block them at Squid. Squid is reasonably well-optimised for this, it stores blocked IPs and ranges in a tree, giving you lookup in O(log N) time in the number of blocked IPs.
That would have been more work, and I appreciate that the sysadmin team is small and needs to allocate their time carefully. It's not my job to tell them how to do that and I wasn't offerring to help.
But note that the action taken wasn't to block all list=search API queries that have a blank user agent header. The overly broad response should give you a hint that there was another motive at work.
I think they want to make their work easier in the future. Although it doesn't help much with malicious traffic, requiring a User-Agent header does help to distinguish different sources of non-malicious but excessively expensive traffic.
In another post:
When the new code blocks requests with missing User Agent strings (which is, oddly, not all of the time), it is with a 403 Forbidden response and the very simple message
Please provide a User-Agent header
(No <html> tags, no nothing.)
Be glad it doesn't just say "sigh".
if( $wgDBname == 'kuwiki' && preg_match( '/[[Image:Flag_of_Turkey.svg]]/', @$_REQUEST['wpTextbox1'] ) ) { die("Sigh.\n"); }
Seriously.
http://ku.wikipedia.org/w/index.php?wpTextbox1=%5B%5BImage:Flag_of_Turkey.sv...]]
Domas Mituzas wrote:
Actually we had User-Agent header requirement for ages, it just failed to do what it had to do for a while. Consider this to be a bugfix.
For the record, I didn't like the idea the first time around either.
-- Tim Starling
On Tue, Feb 16, 2010 at 11:32 AM, Ariel T. Glenn ariel@wikimedia.orgwrote:
In fact some WMF paid employees (including me) were in the channel at that time and agreed with the decision. It seemed then and still seems to me a reasonable course of action given the circumstances. I understand it's aggravating to people who didn't get notice; let's look forward. PLease just add the UA header and your tools / bots/ etc. will be back to working. Thanks.
It's not a big deal for anyone aware of the problem. Adding "User-Agent: Janna" to my scripts is no big deal, nor is adding a randomized UA from a list of common UAs in privoxy (I'm pretty sure there's a plugin for that).
I do wonder how many people are going to wind up getting strange errors that they don't know how to fix due to this, though. Is it at all feasible to throttle such traffic rather than blocking it completely?
On 16 February 2010 02:54, Domas Mituzas midom.lists@gmail.com wrote:
Hi!
from now on specific per-bot/per-software/per-client User-Agent header is mandatory for contacting Wikimedia sites.
Domas
Looks OK to me. But this is the type of decission that often break existing stuff somewhere on the internet. With user-agent, I can imagine some overzealot firewall or anonymization service removing it. But you can always use the X's strategy: break something and way 6 years, If no one is angry seems no one was using the feature.
tei@localhost:~$ telnet en.wikipedia.org 80 Trying 91.198.174.2... Connected to rr.esams.wikimedia.org. Escape character is '^]'. GET / Please provide a User-Agent header Connection closed by foreign host.
I have put some basic info about requireing the User-Agent header at https://wiki.toolserver.org/view/User-Agent_policy. This way, there's a place where can point people for more info. Please add any useful info to that page.
-- daniel
Opps, put the page on the wrong wiki :) Here'S the correct URL: http://meta.wikimedia.org/wiki/User-Agent_policy.
Daniel Kinzler schrieb:
I have put some basic info about requireing the User-Agent header at https://wiki.toolserver.org/view/User-Agent_policy. This way, there's a place where can point people for more info. Please add any useful info to that page.
-- daniel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
daniel wrote:
I have put some basic info about requireing the User-Agent header at... This way, there's a place where can point people for more info.
Thanks, but FWIW, the very first sentence:
Wikimedia sites require a HTTP User-Agent header for all requests.
is false. (As near as I can tell, the header is required only for those requests that include an "action=" modifier.)
wikitech-l@lists.wikimedia.org