Bugs item #1878986, was opened at 2008-01-24 10:59 Message generated for change (Comment added) made by russblau You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1878986...
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: General Group: None Status: Open Resolution: None Priority: 7 Private: No Submitted By: Filnik (filnik) Assigned to: Nobody/Anonymous (nobody) Summary: getUrl() has a problem. No timeout?
Initial Comment: Hello, I've seen that in my processes there are some scripts that are started something like 1-2 weeks ago that are still running.
The problem is that the function getUrl() of wikipedia.py doesn't raise any error after x time (or, I suppose that's this the reason, otherwise we have a bot that is trying to get a page for 1 week without a specific reason...).
I've not fixed the Bug only because I've no idea how to fix it (I have never handle with HTTP connections directly on python) but Bryan has said:
<Bryan> yes, but that would require you to modify the socket settings <Bryan> sock.settimeout(1500) <Bryan> or you do select.select on the socket <Bryan> which is very hard in pywiki
Some ideas? :-) The 1500 by the way is only a number, we should/can set it on config.py. I've set this bug with high priority because infinite loops on toolserver are really a big problem.
Thanks, Filnik
----------------------------------------------------------------------
Comment By: Russell Blau (russblau)
Date: 2008-01-24 17:41
Message: Logged In: YES user_id=855050 Originator: NO
Sorry, that last comment was me, and the revision was r4936
----------------------------------------------------------------------
Comment By: Nobody/Anonymous (nobody) Date: 2008-01-24 17:37
Message: Logged In: NO
Added a 120-second timeout in r4796; seems to work in initial testing.
The problem with libcurl suggestion is that it would require every user of every bot to download and install one or more third-party packages.
----------------------------------------------------------------------
Comment By: Francesco Cosoleto (cosoleto) Date: 2008-01-24 12:21
Message: Logged In: YES user_id=181280 Originator: NO
I am not sure PyWikipediaBot cause intensive cpu usage in Toolserver due to this problem, anyway to fix temporary the no timeout problem seems there is this easy solution:
import socket socket.setdefaulttimeout(0.1) urllib2.urlopen("http://cosoleto.free.fr%22).read() [...] urllib2.URLError: <urlopen error timed out> urllib.urlopen("http://cosoleto.free.fr%22).read() [...] IOError: [Errno socket error] timed out
But I suggest libcurl (http://curl.haxx.se/libcurl/) to improve easily and simplify the net side of the PyWikipedia code. libcurl is a feature rich (persistant connections, trasparent compression support, etc...) and portable URL transfer library written in C. Why not?
----------------------------------------------------------------------
Comment By: Bryan (btongminh) Date: 2008-01-24 11:06
Message: Logged In: YES user_id=1806226 Originator: NO
Note that it is much easier to do settimeout if persistent_http was working. Unfortunately, it is not. I disabled it some time ago (http://fisheye.ts.wikimedia.org/browse/pywikipedia/trunk/pywikipedia/wikiped...) saying it needs investigation. Anybody here who is having to do this investigation? It would not only solve Filnik's bug (site.conn.sock.settimeout), but it would also greatly improve performance for single threaded bots.
----------------------------------------------------------------------
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1878986...