403 with content to Python?

List overview All Threads
Download

newer

older

Bugzilla Weekly Report

Downloading Wikipedia HTML 2008-06

Andre Engels

23 Jan 2009 23 Jan '09

4:36 a.m.

Through a message on another list, I found that when one tries to reach wikipedia (or at least wikipedia-en) specifying the User Agent as "Python-urllib/1.17", the server gives a "403 Forbidden" response, together with the content of the page.

Two questions: 1. Why is this User Agent getting this response? If I remember correctly, this was installed in the early days of the pywikipediabot, when Brion wanted to block it because it had a programming error causing it to fetch each page twice (sometimes even more?). If that is the actual reason, I see no reason why it should still be active years afterward... 2. If this User Agent is really to be blocked, why do we still provide the content of the page that is forbidden?

-- André Engels, andreengels@gmail.com

Show replies by date

Daniel Kinzler

23 Jan 23 Jan

5:47 a.m.

Andre Engels schrieb:

...

Why is this User Agent getting this response? If I remember

correctly, this was installed in the early days of the pywikipediabot, when Brion wanted to block it because it had a programming error causing it to fetch each page twice (sometimes even more?). If that is the actual reason, I see no reason why it should still be active years afterward...

The default UA-Strings of many popular libraries (pythion, perl, java, php...) are blocked from accessing wikipedia.

The idea is to force people to provide a descriptive UA string for their particular tool, so it can be blocked selectively when it breaks. Ideally, the UA string should give some way of contacting the operator, or at least the author.

Good netizenship dictates: don't use default UA strings, use something unique and descriptive. Always, not only when accessing wikipedia.

As to whythe content is served anyway: I don't know. May be a bug even. or it's intentional. Would be interesting to hear about this.

-- daniel

Brion Vibber

12:03 p.m.

On 1/23/09 2:36 AM, Andre Engels wrote:

...

Two questions:

Why is this User Agent getting this response? If I remember

correctly, this was installed in the early days of the pywikipediabot, when Brion wanted to block it because it had a programming error causing it to fetch each page twice (sometimes even more?). If that is the actual reason, I see no reason why it should still be active years afterward...

This has nothing to do with pywikipediabot.

We too frequently encountered poorly-written bots and site-scrapers which slammed the servers too hard and caused problems. Blocking default UAs of common libraries cut these incidents down dramatically, and helps encourage thoughtful bot writers to put specific information into their user-agent string, making it possible to track them down more easily if they are problematic.

...

If this User Agent is really to be blocked, why do we still provide

the content of the page that is forbidden?

We don't; you get a big fat Wikimedia-customized error page with a generic multilingual message, and this bit somewhere in the middle:

<div class="TechnicalStuff"> <bdo dir="ltr"> Request: GET http://en.wikipedia.org/wiki/Foo, from 69.17.48.227 via sq24.wikimedia.org (squid/2.6.STABLE21) to ()<br/> Error: ERR_ACCESS_DENIED, errno [No Error] at Fri, 23 Jan 2009 17:59:46 GMT </bdo> <div id="AdditionalTechnicalStuff"></div> </div>

-- brion

Marco Schuster

24 Jan 24 Jan

3:05 a.m.

On Fri, Jan 23, 2009 at 7:03 PM, Brion Vibber brion@wikimedia.org wrote:

...

On 1/23/09 2:36 AM, Andre Engels wrote:

...
Two questions:

Why is this User Agent getting this response? If I remember

correctly, this was installed in the early days of the pywikipediabot, when Brion wanted to block it because it had a programming error causing it to fetch each page twice (sometimes even more?). If that is the actual reason, I see no reason why it should still be active years afterward...

This has nothing to do with pywikipediabot.

We too frequently encountered poorly-written bots and site-scrapers which slammed the servers too hard and caused problems. Blocking default UAs of common libraries cut these incidents down dramatically, and helps encourage thoughtful bot writers to put specific information into their user-agent string, making it possible to track them down more easily if they are problematic.

Is there any list of those UAs or UA parts available? I had this problem some time ago with my bot which used a custom UA string and got access denied, so I changed its UA to Firefox as I had no nerves to track down WHICH part of the UA triggered the filter.

Marco

Platonides

8:48 a.m.

Marco Schuster wrote:

...

Is there any list of those UAs or UA parts available? I had this problem some time ago with my bot which used a custom UA string and got access denied, so I changed its UA to Firefox as I had no nerves to track down WHICH part of the UA triggered the filter.

Marco

Perhaps they were blocking *your* bot? Faking your user agent to match a browser make sysadmins assume bad faith...

Marco Schuster

10:51 a.m.

On Sat, Jan 24, 2009 at 3:48 PM, Platonides Platonides@gmail.com wrote:

...

Marco Schuster wrote:

...
Is there any list of those UAs or UA parts available? I had this problem some time ago with my bot which used a custom UA string and got access denied, so I changed its UA to Firefox as I had no nerves to track down WHICH part of the UA triggered the filter.

Marco

Perhaps they were blocking *your* bot? Faking your user agent to match a browser make sysadmins assume bad faith...

No, as the bot was not active before (and I'm pretty sure the UA also).

Marco

Aryeh Gregor

6:11 p.m.

On Sat, Jan 24, 2009 at 4:05 AM, Marco Schuster marco@harddisk.is-a-geek.org wrote:

...

Is there any list of those UAs or UA parts available? I had this problem some time ago with my bot which used a custom UA string and got access denied, so I changed its UA to Firefox as I had no nerves to track down WHICH part of the UA triggered the filter.

Just change it to something like "YourBotName, run by Marco Schuster your@e-mail.address". That will certainly avoid any filters, and provide the desired info.

I don't know why the error page doesn't give this info already. The current message only confuses people and -- if they can figure out it's UA-based -- tempts them to mimic browser UA strings. That stands a good chance of getting your IP address blocked if it's noticed (and it's pretty easy to tell when a script is pretending to be a browser, if you look at the whole HTTP request).

The error message is in SVN, but it's the same message provided for all errors. I don't know what sort of config would needed to be done to get a custom message for this error.

Marco Schuster

6:21 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Sun, Jan 25, 2009 at 1:11 AM, Aryeh Gregor wrote:

...

On Sat, Jan 24, 2009 at 4:05 AM, Marco Schuster wrote:

...
Is there any list of those UAs or UA parts available? I had this problem some time ago with my bot which used a custom UA string and got access denied, so I changed its UA to Firefox as I had no nerves to track down WHICH part of the UA triggered the filter.

Just change it to something like "YourBotName, run by Marco Schuster ". That will certainly avoid any filters, and provide the desired info.

I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered the filters.

...

I don't know why the error page doesn't give this info already. The current message only confuses people and -- if they can figure out it's UA-based -- tempts them to mimic browser UA strings.

Anyone skilled enough to write a bot is skilled enough to find that out, IMO. Anyway, it should also be in the error message what part of the UA is forbidden.

Marco

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: http://getfiregpg.org iD8DBQFJe7C4W6S2GapJUuQRAvcgAJ9YY1N0ckE9DzqG21K45teAiG1QVQCfcGBJ hFtOQisDPnYlLyXjTwKaTTI= =iuTY -----END PGP SIGNATURE-----

Platonides

25 Jan 25 Jan

7:50 a.m.

Simetrical wrote:

...

Just change it to something like "YourBotName, run by Marco Schuster your@e-mail.address". That will certainly avoid any filters, and provide the desired info.

The email should be at a From: header. Although I don't know if it's logged or not. In general, anyone responsible enough to set a From: header (with their valid email) shouldn't get automatically blocked.

Marco Schuster wrote:

...

I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered the filters.

Perhaps the mention to "php", although I'm not being blocked when using that UA, so can't test.

Aryeh Gregor

9:11 a.m.

On Sun, Jan 25, 2009 at 8:50 AM, Platonides Platonides@gmail.com wrote:

...

The email should be at a From: header. Although I don't know if it's logged or not. In general, anyone responsible enough to set a From: header (with their valid email) shouldn't get automatically blocked.

A From: header? In HTTP? What standard specifies that header's existence and semantics? It's not at [[List of HTTP headers]].

Platonides

3:42 p.m.

Aryeh Gregor wrote:

...

On Sun, Jan 25, 2009 at 8:50 AM, Platonides Platonides@gmail.com wrote:

...
The email should be at a From: header. Although I don't know if it's logged or not. In general, anyone responsible enough to set a From: header (with their valid email) shouldn't get automatically blocked.

A From: header? In HTTP? What standard specifies that header's existence and semantics? It's not at [[List of HTTP headers]].

I also thought that it was a confusion when I first saw it on HTTP article at wikipedia.

RFC 2616 (HTTP/1.1) section 14.22

The From request-header field, if given, SHOULD contain an Internet e-mail address for the human user who controls the requesting user agent. The address SHOULD be machine-usable, as defined by "mailbox" in RFC 822 [9] as updated by RFC 1123 [8]:

From = "From" ":" mailbox

An example is:

From: webmaster@w3.org

This header field MAY be used for logging purposes and as a means for identifying the source of invalid or unwanted requests. It SHOULD NOT be used as an insecure form of access protection. The interpretation of this field is that the request is being performed on behalf of the person given, who accepts responsibility for the method performed. In particular, robot agents SHOULD include this header so that the person responsible for running the robot can be contacted if problems occur on the receiving end.

The Internet e-mail address in this field MAY be separate from the Internet host which issued the request. For example, when a request is passed through a proxy the original issuer's address SHOULD be used.

The client SHOULD NOT send the From header field without the user's approval, as it might conflict with the user's privacy interests or their site's security policy. It is strongly recommended that the user be able to disable, enable, and modify the value of this field at any time prior to a request.

Aryeh Gregor

5:11 p.m.

On Sun, Jan 25, 2009 at 4:42 PM, Platonides Platonides@gmail.com wrote:

...

I also thought that it was a confusion when I first saw it on HTTP article at wikipedia.

RFC 2616 (HTTP/1.1) section 14.22

The From request-header field, if given, SHOULD contain an Internet e-mail address for the human user who controls the requesting user agent. The address SHOULD be machine-usable, as defined by "mailbox" in RFC 822 [9] as updated by RFC 1123 [8]: ...

Well, since I doubt most people have ever heard of that, it's probably not logged.

Marco Schuster

10:29 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Sun, Jan 25, 2009 at 2:50 PM, Platonides wrote:

...

Marco Schuster wrote:

...
I used "HDBot API x.y (PHP $phpversion)" as UA. No idea what triggered the filters.

Perhaps the mention to "php", although I'm not being blocked when using that UA, so can't test.

Yeah, I'm also not blocked anymore...nice to hear that. But again, it'd be nice to see in an error message what part of the UA triggered the filter and why this part is blocked. Brion, do you have a list of blocked UA (parts)?

Marco

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: http://getfiregpg.org iD4DBQFJfJOQW6S2GapJUuQRAiwgAJdXucmjZ4d9BToMAnK3uKuzq3ooAJ4mFGFZ AeFuiPnC+cSzTuseHDtAUg== =OwNP -----END PGP SIGNATURE-----

Andrew Garrett

5:10 p.m.

On Sun, Jan 25, 2009 at 8:29 AM, Marco Schuster marco@harddisk.is-a-geek.org wrote:

...

Brion, do you have a list of blocked UA (parts)?

Squid configuration files are available at http://noc.wikimedia.org/conf. It should be in there.

-- Andrew Garrett

Platonides

5:58 p.m.

Andrew Garrett wrote:

...

On Sun, Jan 25, 2009 at 8:29 AM, Marco Schuster marco@harddisk.is-a-geek.org wrote:

...
Brion, do you have a list of blocked UA (parts)?

Squid configuration files are available at http://noc.wikimedia.org/conf. It should be in there.

Which of them are for the squids? I think they server config there is just for the apaches.

5657

Age (days ago)

5659

Last active (days ago)

wikitech-l@lists.wikimedia.org

14 comments

7 participants

tags (0)

participants (7)

Andre Engels
Andrew Garrett
Aryeh Gregor
Brion Vibber
Daniel Kinzler
Marco Schuster
Platonides