[Labs-l] Some using a Python framework is relentlessly hammering Harvard sites, resulting an IP range ban.

Martin Urbanec martin.urbanec at wikimedia.cz
Sun Dec 4 17:29:48 UTC 2016


Hi all,
I was running weblinkchecker.py for whole cswiki (job was submited to the
grid at Sun, 20 Nov 2016 16:54:24 GMT) because I wished to have a list of
deadlinks. This may correspond with the UA (because I used script named
weblinkschecker.py). I trusted this script it won't do anything wrong
because this script was and still is in standard core package. I also use
3.0-dev version of pywikibot and Python 2.7.6.

But this job was completed already so if those GET requests didn't stop I'm
not the cause. Or I lost access to the job, qstat at all my tools
(urbanecmbot, missingpages) and my personal account (urbanecm) is
empty/show only webserver.

If I was the cause, I'm very sorry for it. As I said I didn't know the
script does not throttle GET requests enoguh.

Also minorplanetcenter.net is inserted only in 22 articles (as
https://cs.wikipedia.org/w/index.php?search=insource%3Aminorplanetcenter.net&title=Speci%C3%A1ln%C3%AD:Hled%C3%A1n%C3%AD&go=J%C3%ADt+na&searchToken=507gqzzqk3eyplk5s6gsii2bv
says)
so it shouldn't be so massive as there is said.

My .bash_history says the following. I guess 1479660864 is Unix timestamp,
human time is Sun, 20 Nov 2016 16:54:24 GMT.

#1479660864
jsub -l release=trusty python ~/pwb/scripts/weblinkchecker.py -start:!

My user-config.py is at http://pastebin.com/cUAwQuWt, without OAUTH.
Complete user-config is at /home/urbanecm/.pywikibot/user-config.py and
only roots can see it.

Again, if I was the cause, I'm sorry for it. I only used standard scripts
and I trusted them that they works correctly.

Martin Urbanec alias Urbanecm
https://cs.wikipedia.org/wiki/Wikipedista:Martin_Urbanec
https://meta.wikimedia.org/wiki/User:Martin_Urbanec
https://wikitech.wikimedia.org/wiki/User:Urbanecm

ne 4. 12. 2016 v 18:03 odesílatel Maximilian Doerr <
maximilian.doerr at gmail.com> napsal:

> https://phabricator.wikimedia.org/F4978348 Done.
>
> Cyberpower678
> English Wikipedia Account Creation Team
> ACC Mailing List Moderator
> Global User Renamer
>
> On Dec 4, 2016, at 11:49, Merlijn van Deen (valhallasw) <
> valhallasw at arctus.nl> wrote:
>
> Hi Maximilian,
>
> https://phabricator.wikimedia.org/file/upload/ allows you to specify
> 'Visible to'. You can select 'Custom policy' and select the relevant users,
> i.e.
> <image.png>
>
> In the meanwhile, I'll try to figure out if I can get some information
> from netstat.
>
> Cheers,
> Merlijn
>
> On 4 December 2016 at 17:36, Maximilian Doerr <maximilian.doerr at gmail.com>
> wrote:
>
> Sure, how would I be able to restrict it’s visibility?  Harvard is kind
> enough to unblock, if the culprit is stopped.
>
>
>
> As for exact URLs, it’s the entire domains owned by Harvard.  But the
> access log can provide specifics.  The Python script is attempting to get
> all 140,000 pieces of data about minor planets from
> www.minorplanetcenter.net according to IT, who also claims that such an
> action the way being done now would severely tie up their servers for quite
> a while, which they cannot afford.
>
>
>
> Cyberpower678
>
> English Wikipedia Account Creation Team
>
> Mailing List Moderator
>
> Global User Renamer
>
>
>
> *From:* Merlijn van Deen (valhallasw) [mailto:valhallasw at arctus.nl]
> *Sent:* Sunday, December 4, 2016 10:59
> *To:* maximilian.doerr at gmail.com
> *Subject:* Re: [Labs-l] Some using a Python framework is relentlessly
> hammering Harvard sites, resulting an IP range ban.
>
>
>
> Hi Maximilian,
>
>
>
> On 4 December 2016 at 05:51, Maximilian Doerr <maximilian.doerr at gmail.com>
> wrote:
>
> Would the user who is querying the Harvard sites for planet data, that is
> carrying the UA “weblinkchecker Pywikibot/3.0-dev (g7171) requests/2.2.1
> Python/2.7.6.final.0”, please stop, or severely throttle the GET requests.
> It’s making 168 requests to that site a minute, and consequently they
> banned labs from accessing it, according to the IT department there, who
> kindly shared with me the access log.
>
>
>
>
>
> Would you be able to share the access log with the Tools admins (say, via
> Phabricator, only shared to Yuvi, Bryan Davis, Andrew Bogott, Chase, scfc
> and me)? From the combination of external IP and timestamp we may be able
> to pinpoint which tool was causing this.
>
>
>
> Can you also clarify which exact URLs we are talking about?
>
>
>
> Cheers,
>
> Merlijn
>
>
>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20161204/7abf641e/attachment-0001.html>


More information about the Labs-l mailing list