Through a message on another list, I found that when one tries to reach wikipedia (or at least wikipedia-en) specifying the User Agent as "Python-urllib/1.17", the server gives a "403 Forbidden" response, together with the content of the page.
Two questions: 1. Why is this User Agent getting this response? If I remember correctly, this was installed in the early days of the pywikipediabot, when Brion wanted to block it because it had a programming error causing it to fetch each page twice (sometimes even more?). If that is the actual reason, I see no reason why it should still be active years afterward... 2. If this User Agent is really to be blocked, why do we still provide the content of the page that is forbidden?
Andre Engels schrieb:
- Why is this User Agent getting this response? If I remember
correctly, this was installed in the early days of the pywikipediabot, when Brion wanted to block it because it had a programming error causing it to fetch each page twice (sometimes even more?). If that is the actual reason, I see no reason why it should still be active years afterward...
The default UA-Strings of many popular libraries (pythion, perl, java, php...) are blocked from accessing wikipedia.
The idea is to force people to provide a descriptive UA string for their particular tool, so it can be blocked selectively when it breaks. Ideally, the UA string should give some way of contacting the operator, or at least the author.
Good netizenship dictates: don't use default UA strings, use something unique and descriptive. Always, not only when accessing wikipedia.
As to whythe content is served anyway: I don't know. May be a bug even. or it's intentional. Would be interesting to hear about this.
-- daniel
On 1/23/09 2:36 AM, Andre Engels wrote:
Two questions:
- Why is this User Agent getting this response? If I remember
correctly, this was installed in the early days of the pywikipediabot, when Brion wanted to block it because it had a programming error causing it to fetch each page twice (sometimes even more?). If that is the actual reason, I see no reason why it should still be active years afterward...
This has nothing to do with pywikipediabot.
We too frequently encountered poorly-written bots and site-scrapers which slammed the servers too hard and caused problems. Blocking default UAs of common libraries cut these incidents down dramatically, and helps encourage thoughtful bot writers to put specific information into their user-agent string, making it possible to track them down more easily if they are problematic.
- If this User Agent is really to be blocked, why do we still provide
the content of the page that is forbidden?
We don't; you get a big fat Wikimedia-customized error page with a generic multilingual message, and this bit somewhere in the middle:
<!-- Technical details of the error; shows all the time, with any language --> <div class="TechnicalStuff"> <bdo dir="ltr"> Request: GET http://en.wikipedia.org/wiki/Foo, from 69.17.48.227 via sq24.wikimedia.org (squid/2.6.STABLE21) to ()<br/> Error: ERR_ACCESS_DENIED, errno [No Error] at Fri, 23 Jan 2009 17:59:46 GMT </bdo> <div id="AdditionalTechnicalStuff"></div> </div>
-- brion
On Fri, Jan 23, 2009 at 7:03 PM, Brion Vibber brion@wikimedia.org wrote:
On 1/23/09 2:36 AM, Andre Engels wrote:
Two questions:
- Why is this User Agent getting this response? If I remember
correctly, this was installed in the early days of the pywikipediabot, when Brion wanted to block it because it had a programming error causing it to fetch each page twice (sometimes even more?). If that is the actual reason, I see no reason why it should still be active years afterward...
This has nothing to do with pywikipediabot.
We too frequently encountered poorly-written bots and site-scrapers which slammed the servers too hard and caused problems. Blocking default UAs of common libraries cut these incidents down dramatically, and helps encourage thoughtful bot writers to put specific information into their user-agent string, making it possible to track them down more easily if they are problematic.
Is there any list of those UAs or UA parts available? I had this problem some time ago with my bot which used a custom UA string and got access denied, so I changed its UA to Firefox as I had no nerves to track down WHICH part of the UA triggered the filter.
Marco
Marco Schuster wrote:
Is there any list of those UAs or UA parts available? I had this problem some time ago with my bot which used a custom UA string and got access denied, so I changed its UA to Firefox as I had no nerves to track down WHICH part of the UA triggered the filter.
Marco
Perhaps they were blocking *your* bot? Faking your user agent to match a browser make sysadmins assume bad faith...
On Sat, Jan 24, 2009 at 3:48 PM, Platonides Platonides@gmail.com wrote:
Marco Schuster wrote:
Is there any list of those UAs or UA parts available? I had this problem some time ago with my bot which used a custom UA string and got access denied, so I changed its UA to Firefox as I had no nerves to track down WHICH part of the UA triggered the filter.
Marco
Perhaps they were blocking *your* bot? Faking your user agent to match a browser make sysadmins assume bad faith...
No, as the bot was not active before (and I'm pretty sure the UA also).
Marco
On Sat, Jan 24, 2009 at 4:05 AM, Marco Schuster marco@harddisk.is-a-geek.org wrote:
Is there any list of those UAs or UA parts available? I had this problem some time ago with my bot which used a custom UA string and got access denied, so I changed its UA to Firefox as I had no nerves to track down WHICH part of the UA triggered the filter.
Just change it to something like "YourBotName, run by Marco Schuster your@e-mail.address". That will certainly avoid any filters, and provide the desired info.
I don't know why the error page doesn't give this info already. The current message only confuses people and -- if they can figure out it's UA-based -- tempts them to mimic browser UA strings. That stands a good chance of getting your IP address blocked if it's noticed (and it's pretty easy to tell when a script is pretending to be a browser, if you look at the whole HTTP request).
The error message is in SVN, but it's the same message provided for all errors. I don't know what sort of config would needed to be done to get a custom message for this error.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Sun, Jan 25, 2009 at 1:11 AM, Aryeh Gregor wrote:
On Sat, Jan 24, 2009 at 4:05 AM, Marco Schuster wrote:
Is there any list of those UAs or UA parts available? I had this problem some time ago with my bot which used a custom UA string and got access denied, so I changed its UA to Firefox as I had no nerves to track down WHICH part of the UA triggered the filter.
Just change it to something like "YourBotName, run by Marco Schuster ". That will certainly avoid any filters, and provide the desired info.
I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered the filters.
I don't know why the error page doesn't give this info already. The current message only confuses people and -- if they can figure out it's UA-based -- tempts them to mimic browser UA strings.
Anyone skilled enough to write a bot is skilled enough to find that out, IMO. Anyway, it should also be in the error message what part of the UA is forbidden.
Marco
Simetrical wrote:
Just change it to something like "YourBotName, run by Marco Schuster your@e-mail.address". That will certainly avoid any filters, and provide the desired info.
The email should be at a From: header. Although I don't know if it's logged or not. In general, anyone responsible enough to set a From: header (with their valid email) shouldn't get automatically blocked.
Marco Schuster wrote:
I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered the filters.
Perhaps the mention to "php", although I'm not being blocked when using that UA, so can't test.
On Sun, Jan 25, 2009 at 8:50 AM, Platonides Platonides@gmail.com wrote:
The email should be at a From: header. Although I don't know if it's logged or not. In general, anyone responsible enough to set a From: header (with their valid email) shouldn't get automatically blocked.
A From: header? In HTTP? What standard specifies that header's existence and semantics? It's not at [[List of HTTP headers]].
Aryeh Gregor wrote:
On Sun, Jan 25, 2009 at 8:50 AM, Platonides Platonides@gmail.com wrote:
The email should be at a From: header. Although I don't know if it's logged or not. In general, anyone responsible enough to set a From: header (with their valid email) shouldn't get automatically blocked.
A From: header? In HTTP? What standard specifies that header's existence and semantics? It's not at [[List of HTTP headers]].
I also thought that it was a confusion when I first saw it on HTTP article at wikipedia.
RFC 2616 (HTTP/1.1) section 14.22
The From request-header field, if given, SHOULD contain an Internet e-mail address for the human user who controls the requesting user agent. The address SHOULD be machine-usable, as defined by "mailbox" in RFC 822 [9] as updated by RFC 1123 [8]:
From = "From" ":" mailbox
An example is:
From: webmaster@w3.org
This header field MAY be used for logging purposes and as a means for identifying the source of invalid or unwanted requests. It SHOULD NOT be used as an insecure form of access protection. The interpretation of this field is that the request is being performed on behalf of the person given, who accepts responsibility for the method performed. In particular, robot agents SHOULD include this header so that the person responsible for running the robot can be contacted if problems occur on the receiving end.
The Internet e-mail address in this field MAY be separate from the Internet host which issued the request. For example, when a request is passed through a proxy the original issuer's address SHOULD be used.
The client SHOULD NOT send the From header field without the user's approval, as it might conflict with the user's privacy interests or their site's security policy. It is strongly recommended that the user be able to disable, enable, and modify the value of this field at any time prior to a request.
On Sun, Jan 25, 2009 at 4:42 PM, Platonides Platonides@gmail.com wrote:
I also thought that it was a confusion when I first saw it on HTTP article at wikipedia.
RFC 2616 (HTTP/1.1) section 14.22
The From request-header field, if given, SHOULD contain an Internet e-mail address for the human user who controls the requesting user agent. The address SHOULD be machine-usable, as defined by "mailbox" in RFC 822 [9] as updated by RFC 1123 [8]: ...
Well, since I doubt most people have ever heard of that, it's probably not logged.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Sun, Jan 25, 2009 at 2:50 PM, Platonides wrote:
Marco Schuster wrote:
I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered the filters.
Perhaps the mention to "php", although I'm not being blocked when using that UA, so can't test.
Yeah, I'm also not blocked anymore...nice to hear that. But again, it'd be nice to see in an error message what part of the UA triggered the filter and why this part is blocked. Brion, do you have a list of blocked UA (parts)?
Marco
On Sun, Jan 25, 2009 at 8:29 AM, Marco Schuster marco@harddisk.is-a-geek.org wrote:
Brion, do you have a list of blocked UA (parts)?
Squid configuration files are available at http://noc.wikimedia.org/conf. It should be in there.
Andrew Garrett wrote:
On Sun, Jan 25, 2009 at 8:29 AM, Marco Schuster marco@harddisk.is-a-geek.org wrote:
Brion, do you have a list of blocked UA (parts)?
Squid configuration files are available at http://noc.wikimedia.org/conf. It should be in there.
Which of them are for the squids? I think they server config there is just for the apaches.
wikitech-l@lists.wikimedia.org