[Labs-l] Some using a Python framework is relentlessly hammering Harvard sites, resulting an IP range ban.

Maximilian Doerr maximilian.doerr at gmail.com
Sun Dec 4 17:38:11 UTC 2016


I doubt it was that then, if you only scanned 22.  This user according to IT was attempting to fetch all 140,000 pieces of data about minor planets and was making 160 requests to that site a minute, which was severely bogging their servers when combined with the load they already have.  I think the ban was put into effect on Nov. 2.

Maybe it would be wise to have labs simply throttle consecutive outgoing connections from tool labs, if possible.  That is connections being made from scripts to external sites, while maintaining statue quo with the webservices.  This has to have some kind of impact on IO Network bandwidth usage for both host and client servers.

Cyberpower678
English Wikipedia Account Creation Team
ACC Mailing List Moderator
Global User Renamer

> On Dec 4, 2016, at 12:29, Martin Urbanec <martin.urbanec at wikimedia.cz> wrote:
> 
> Hi all, 
> I was running weblinkchecker.py for whole cswiki (job was submited to the grid at Sun, 20 Nov 2016 16:54:24 GMT) because I wished to have a list of deadlinks. This may correspond with the UA (because I used script named weblinkschecker.py). I trusted this script it won't do anything wrong because this script was and still is in standard core package. I also use 3.0-dev version of pywikibot and Python 2.7.6. 
> 
> But this job was completed already so if those GET requests didn't stop I'm not the cause. Or I lost access to the job, qstat at all my tools (urbanecmbot, missingpages) and my personal account (urbanecm) is empty/show only webserver. 
> 
> If I was the cause, I'm very sorry for it. As I said I didn't know the script does not throttle GET requests enoguh. 
> 
> Also minorplanetcenter.net <http://minorplanetcenter.net/> is inserted only in 22 articles (as https://cs.wikipedia.org/w/index.php?search=insource%3Aminorplanetcenter.net&title=Speci%C3%A1ln%C3%AD:Hled%C3%A1n%C3%AD&go=J%C3%ADt+na&searchToken=507gqzzqk3eyplk5s6gsii2bv <https://cs.wikipedia.org/w/index.php?search=insource%3Aminorplanetcenter.net&title=Speci%C3%A1ln%C3%AD:Hled%C3%A1n%C3%AD&go=J%C3%ADt+na&searchToken=507gqzzqk3eyplk5s6gsii2bv> says) so it shouldn't be so massive as there is said. 
> 
> My .bash_history says the following. I guess 1479660864 is Unix timestamp, human time is Sun, 20 Nov 2016 16:54:24 GMT. 
> 
> #1479660864
> jsub -l release=trusty python ~/pwb/scripts/weblinkchecker.py -start:!
> 
> My user-config.py is at http://pastebin.com/cUAwQuWt <http://pastebin.com/cUAwQuWt>, without OAUTH. Complete user-config is at /home/urbanecm/.pywikibot/user-config.py and only roots can see it. 
> 
> Again, if I was the cause, I'm sorry for it. I only used standard scripts and I trusted them that they works correctly. 
> 
> Martin Urbanec alias Urbanecm
> https://cs.wikipedia.org/wiki/Wikipedista:Martin_Urbanec <https://cs.wikipedia.org/wiki/Wikipedista:Martin_Urbanec>
> https://meta.wikimedia.org/wiki/User:Martin_Urbanec <https://meta.wikimedia.org/wiki/User:Martin_Urbanec>
> https://wikitech.wikimedia.org/wiki/User:Urbanecm <https://wikitech.wikimedia.org/wiki/User:Urbanecm>
> 
> ne 4. 12. 2016 v 18:03 odesílatel Maximilian Doerr <maximilian.doerr at gmail.com <mailto:maximilian.doerr at gmail.com>> napsal:
> https://phabricator.wikimedia.org/F4978348 <https://phabricator.wikimedia.org/F4978348> Done.
> 
> Cyberpower678
> English Wikipedia Account Creation Team
> ACC Mailing List Moderator
> Global User Renamer
> 
>> On Dec 4, 2016, at 11:49, Merlijn van Deen (valhallasw) <valhallasw at arctus.nl <mailto:valhallasw at arctus.nl>> wrote:
>> 
>> Hi Maximilian,
>> 
>> https://phabricator.wikimedia.org/file/upload/ <https://phabricator.wikimedia.org/file/upload/> allows you to specify 'Visible to'. You can select 'Custom policy' and select the relevant users, i.e.
>> <image.png>
>> 
>> In the meanwhile, I'll try to figure out if I can get some information from netstat.
>> 
>> Cheers,
>> Merlijn
>> 
>> On 4 December 2016 at 17:36, Maximilian Doerr <maximilian.doerr at gmail.com <mailto:maximilian.doerr at gmail.com>> wrote:
>> Sure, how would I be able to restrict it’s visibility?  Harvard is kind enough to unblock, if the culprit is stopped.
>> 
>>  
>> 
>> As for exact URLs, it’s the entire domains owned by Harvard.  But the access log can provide specifics.  The Python script is attempting to get all 140,000 pieces of data about minor planets from www.minorplanetcenter.net <http://www.minorplanetcenter.net/> according to IT, who also claims that such an action the way being done now would severely tie up their servers for quite a while, which they cannot afford.
>> 
> 
>>  
>> 
>> Cyberpower678
>> 
>> English Wikipedia Account Creation Team
>> 
>> Mailing List Moderator
>> 
>> Global User Renamer
>> 
>>  
>> 
> 
>> From: Merlijn van Deen (valhallasw) [mailto:valhallasw at arctus.nl <mailto:valhallasw at arctus.nl>] 
>> Sent: Sunday, December 4, 2016 10:59
>> To: maximilian.doerr at gmail.com <mailto:maximilian.doerr at gmail.com>
>> Subject: Re: [Labs-l] Some using a Python framework is relentlessly hammering Harvard sites, resulting an IP range ban.
>> 
>>  
>> 
>> Hi Maximilian,
>> 
>>  
>> 
> 
>> On 4 December 2016 at 05:51, Maximilian Doerr <maximilian.doerr at gmail.com <mailto:maximilian.doerr at gmail.com>> wrote:
>> 
>> Would the user who is querying the Harvard sites for planet data, that is carrying the UA “weblinkchecker Pywikibot/3.0-dev (g7171) requests/2.2.1 Python/2.7.6.final.0”, please stop, or severely throttle the GET requests.  It’s making 168 requests to that site a minute, and consequently they banned labs from accessing it, according to the IT department there, who kindly shared with me the access log.
>> 
>>  
>> 
>>  
>> 
> 
>> Would you be able to share the access log with the Tools admins (say, via Phabricator, only shared to Yuvi, Bryan Davis, Andrew Bogott, Chase, scfc and me)? From the combination of external IP and timestamp we may be able to pinpoint which tool was causing this.
>> 
>>  
>> 
>> Can you also clarify which exact URLs we are talking about?
>> 
>>  
>> 
>> Cheers,
>> 
>> Merlijn
>> 
>> 
> 
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org <mailto:Labs-l at lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/labs-l <https://lists.wikimedia.org/mailman/listinfo/labs-l>
> _______________________________________________
> Labs-l mailing list
> Labs-l at lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/labs-l

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.wikimedia.org/pipermail/labs-l/attachments/20161204/4c610615/attachment-0001.html>


More information about the Labs-l mailing list