Revision: 4944
Author: btongminh
Date: 2008-01-27 18:14:25 +0000 (Sun, 27 Jan 2008)
Log Message:
-----------
Make socket_timeout a config setting. Also make a note in the documentation that persistent_http has been disabled.
Modified Paths:
--------------
trunk/pywikipedia/config.py
trunk/pywikipedia/wikipedia.py
Modified: trunk/pywikipedia/config.py
===================================================================
--- trunk/pywikipedia/config.py 2008-01-27 16:09:57 UTC (rev 4943)
+++ trunk/pywikipedia/config.py 2008-01-27 18:14:25 UTC (rev 4944)
@@ -365,8 +365,13 @@
# Use a persistent http connection. An http connection has to be established
# only once per site object, making stuff a whole lot faster. Do NOT EVER
# use this if you share Site objects across threads without proper locking.
+## DISABLED FUNCTION. Setting this variable will not have any effect.
persistent_http = False
+# Default socket timeout. Set to None to disable timeouts.
+socket_timeout = 120 # set a pretty long timeout just in case...
+
+
############## FURTHER SETTINGS ##############
# The bot can make some additional changes to each page it edits, e.g. fix
Modified: trunk/pywikipedia/wikipedia.py
===================================================================
--- trunk/pywikipedia/wikipedia.py 2008-01-27 16:09:57 UTC (rev 4943)
+++ trunk/pywikipedia/wikipedia.py 2008-01-27 18:14:25 UTC (rev 4944)
@@ -113,7 +113,6 @@
import os, sys
import httplib, socket, urllib
-socket.setdefaulttimeout(120) # set a pretty long timeout just in case...
import traceback
import time, threading, Queue
import math
@@ -5270,6 +5269,8 @@
""")
sys.exit(1)
+# Set socket timeout
+socket.setdefaulttimeout(config.socket_timeout)
# Languages to use for comment text after the actual language but before
# en:. For example, if for language 'xx', you want the preference of
Bugs item #1880625, was opened at 2008-01-27 12:42
Message generated for change (Comment added) made by rotemliss
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1880625&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: interwiki
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: Marc-Etienne Vargenau (vargenau)
Assigned to: Nobody/Anonymous (nobody)
Summary: Interwiki bot removing existing page "0"
Initial Comment:
http://pt.wikipedia.org/w/index.php?title=Zero&diff=prev&oldid=9113649
python interwiki.py zh:0
Checked for running processes. 1 processes currently running, including the current process.
TitleTranslate: 0 was recognized as Number with value 0
Getting 1 pages from wikipedia:zh...
NOTE: [[zh:0]] does not exist
======Post-processing [[zh:0]]======
Not editing [[zh:0]]: page does not exist
WARNING: Page 0 does no longer exist?!
But the page DOES exist.
----------------------------------------------------------------------
>Comment By: Rotem Liss (rotemliss)
Date: 2008-01-27 19:19
Message:
Logged In: YES
user_id=1327030
Originator: NO
This is a bug in MediaWiki export page. The bug is now fixed there.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1880625&group_…
Revision: 4943
Author: russblau
Date: 2008-01-27 16:09:57 +0000 (Sun, 27 Jan 2008)
Log Message:
-----------
Fix (common) Python syntax error in last rev.
Modified Paths:
--------------
trunk/pywikipedia/wikipedia.py
Modified: trunk/pywikipedia/wikipedia.py
===================================================================
--- trunk/pywikipedia/wikipedia.py 2008-01-27 14:08:09 UTC (rev 4942)
+++ trunk/pywikipedia/wikipedia.py 2008-01-27 16:09:57 UTC (rev 4943)
@@ -2771,7 +2771,7 @@
line = line.split(' ')
pid = int(line[0])
ptime = int(line[1].split('.')[0])
- except IndexError, ValueError:
+ except (IndexError, ValueError):
# I go a lot of crontab errors because line is not a number.
# Better to prevent that. If you find out the error, feel free
# to fix it better.
Bugs item #1878986, was opened at 2008-01-24 15:59
Message generated for change (Comment added) made by filnik
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1878986&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: General
Group: None
>Status: Closed
Resolution: None
Priority: 7
Private: No
Submitted By: Filnik (filnik)
Assigned to: Nobody/Anonymous (nobody)
Summary: getUrl() has a problem. No timeout?
Initial Comment:
Hello, I've seen that in my processes there are some scripts that are started something like 1-2 weeks ago that are still running.
The problem is that the function getUrl() of wikipedia.py doesn't raise any error after x time (or, I suppose that's this the reason, otherwise we have a bot that is trying to get a page for 1 week without a specific reason...).
I've not fixed the Bug only because I've no idea how to fix it (I have never handle with HTTP connections directly on python) but Bryan has said:
<Bryan> yes, but that would require you to modify the socket settings
<Bryan> sock.settimeout(1500)
<Bryan> or you do select.select on the socket
<Bryan> which is very hard in pywiki
Some ideas? :-) The 1500 by the way is only a number, we should/can set it on config.py. I've set this bug with high priority because infinite loops on toolserver are really a big problem.
Thanks, Filnik
----------------------------------------------------------------------
>Comment By: Filnik (filnik)
Date: 2008-01-27 14:59
Message:
Logged In: YES
user_id=1834469
Originator: YES
Seems that also my scripts are working correctly. Bug closed (thanks to
all :-))
----------------------------------------------------------------------
Comment By: Russell Blau (russblau)
Date: 2008-01-27 12:31
Message:
Logged In: YES
user_id=855050
Originator: NO
OK to close. I ran a lengthy script on my home machine that has had
timeout problems in the past, and it worked fine.
----------------------------------------------------------------------
Comment By: Filnik (filnik)
Date: 2008-01-25 13:54
Message:
Logged In: YES
user_id=1834469
Originator: YES
Ok, thanks russblau, should I close the topic or you aren't sure at 100%
that it has been fixed? :-) Bye Filnik
----------------------------------------------------------------------
Comment By: Russell Blau (russblau)
Date: 2008-01-24 22:41
Message:
Logged In: YES
user_id=855050
Originator: NO
Sorry, that last comment was me, and the revision was r4936
----------------------------------------------------------------------
Comment By: Nobody/Anonymous (nobody)
Date: 2008-01-24 22:37
Message:
Logged In: NO
Added a 120-second timeout in r4796; seems to work in initial testing.
The problem with libcurl suggestion is that it would require every user of
every bot to download and install one or more third-party packages.
----------------------------------------------------------------------
Comment By: Francesco Cosoleto (cosoleto)
Date: 2008-01-24 17:21
Message:
Logged In: YES
user_id=181280
Originator: NO
I am not sure PyWikipediaBot cause intensive cpu usage in Toolserver due
to this problem, anyway to fix temporary the no timeout problem seems there
is this easy solution:
import socket
socket.setdefaulttimeout(0.1)
urllib2.urlopen("http://cosoleto.free.fr").read()
[...]
urllib2.URLError: <urlopen error timed out>
urllib.urlopen("http://cosoleto.free.fr").read()
[...]
IOError: [Errno socket error] timed out
But I suggest libcurl (http://curl.haxx.se/libcurl/) to improve easily and
simplify the net side of the PyWikipedia code. libcurl is a feature rich
(persistant connections, trasparent compression support, etc...) and
portable URL transfer library written in C. Why not?
----------------------------------------------------------------------
Comment By: Bryan (btongminh)
Date: 2008-01-24 16:06
Message:
Logged In: YES
user_id=1806226
Originator: NO
Note that it is much easier to do settimeout if persistent_http was
working. Unfortunately, it is not. I disabled it some time ago
(http://fisheye.ts.wikimedia.org/browse/pywikipedia/trunk/pywikipedia/wikipe…)
saying it needs investigation. Anybody here who is having to do this
investigation? It would not only solve Filnik's bug
(site.conn.sock.settimeout), but it would also greatly improve performance
for single threaded bots.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1878986&group_…
Revision: 4942
Author: filnik
Date: 2008-01-27 14:08:09 +0000 (Sun, 27 Jan 2008)
Log Message:
-----------
Uhm, seems that this part gives both IndexError and ValueError! Fix
Modified Paths:
--------------
trunk/pywikipedia/wikipedia.py
Modified: trunk/pywikipedia/wikipedia.py
===================================================================
--- trunk/pywikipedia/wikipedia.py 2008-01-26 20:38:46 UTC (rev 4941)
+++ trunk/pywikipedia/wikipedia.py 2008-01-27 14:08:09 UTC (rev 4942)
@@ -2771,7 +2771,7 @@
line = line.split(' ')
pid = int(line[0])
ptime = int(line[1].split('.')[0])
- except IndexError:
+ except IndexError, ValueError:
# I go a lot of crontab errors because line is not a number.
# Better to prevent that. If you find out the error, feel free
# to fix it better.
Bugs item #1878986, was opened at 2008-01-24 10:59
Message generated for change (Comment added) made by russblau
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1878986&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: General
Group: None
Status: Open
Resolution: None
Priority: 7
Private: No
Submitted By: Filnik (filnik)
Assigned to: Nobody/Anonymous (nobody)
Summary: getUrl() has a problem. No timeout?
Initial Comment:
Hello, I've seen that in my processes there are some scripts that are started something like 1-2 weeks ago that are still running.
The problem is that the function getUrl() of wikipedia.py doesn't raise any error after x time (or, I suppose that's this the reason, otherwise we have a bot that is trying to get a page for 1 week without a specific reason...).
I've not fixed the Bug only because I've no idea how to fix it (I have never handle with HTTP connections directly on python) but Bryan has said:
<Bryan> yes, but that would require you to modify the socket settings
<Bryan> sock.settimeout(1500)
<Bryan> or you do select.select on the socket
<Bryan> which is very hard in pywiki
Some ideas? :-) The 1500 by the way is only a number, we should/can set it on config.py. I've set this bug with high priority because infinite loops on toolserver are really a big problem.
Thanks, Filnik
----------------------------------------------------------------------
>Comment By: Russell Blau (russblau)
Date: 2008-01-27 07:31
Message:
Logged In: YES
user_id=855050
Originator: NO
OK to close. I ran a lengthy script on my home machine that has had
timeout problems in the past, and it worked fine.
----------------------------------------------------------------------
Comment By: Filnik (filnik)
Date: 2008-01-25 08:54
Message:
Logged In: YES
user_id=1834469
Originator: YES
Ok, thanks russblau, should I close the topic or you aren't sure at 100%
that it has been fixed? :-) Bye Filnik
----------------------------------------------------------------------
Comment By: Russell Blau (russblau)
Date: 2008-01-24 17:41
Message:
Logged In: YES
user_id=855050
Originator: NO
Sorry, that last comment was me, and the revision was r4936
----------------------------------------------------------------------
Comment By: Nobody/Anonymous (nobody)
Date: 2008-01-24 17:37
Message:
Logged In: NO
Added a 120-second timeout in r4796; seems to work in initial testing.
The problem with libcurl suggestion is that it would require every user of
every bot to download and install one or more third-party packages.
----------------------------------------------------------------------
Comment By: Francesco Cosoleto (cosoleto)
Date: 2008-01-24 12:21
Message:
Logged In: YES
user_id=181280
Originator: NO
I am not sure PyWikipediaBot cause intensive cpu usage in Toolserver due
to this problem, anyway to fix temporary the no timeout problem seems there
is this easy solution:
import socket
socket.setdefaulttimeout(0.1)
urllib2.urlopen("http://cosoleto.free.fr").read()
[...]
urllib2.URLError: <urlopen error timed out>
urllib.urlopen("http://cosoleto.free.fr").read()
[...]
IOError: [Errno socket error] timed out
But I suggest libcurl (http://curl.haxx.se/libcurl/) to improve easily and
simplify the net side of the PyWikipedia code. libcurl is a feature rich
(persistant connections, trasparent compression support, etc...) and
portable URL transfer library written in C. Why not?
----------------------------------------------------------------------
Comment By: Bryan (btongminh)
Date: 2008-01-24 11:06
Message:
Logged In: YES
user_id=1806226
Originator: NO
Note that it is much easier to do settimeout if persistent_http was
working. Unfortunately, it is not. I disabled it some time ago
(http://fisheye.ts.wikimedia.org/browse/pywikipedia/trunk/pywikipedia/wikipe…)
saying it needs investigation. Anybody here who is having to do this
investigation? It would not only solve Filnik's bug
(site.conn.sock.settimeout), but it would also greatly improve performance
for single threaded bots.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1878986&group_…
Bugs item #1880625, was opened at 2008-01-27 11:42
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1880625&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: interwiki
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Marc-Etienne Vargenau (vargenau)
Assigned to: Nobody/Anonymous (nobody)
Summary: Interwiki bot removing existing page "0"
Initial Comment:
http://pt.wikipedia.org/w/index.php?title=Zero&diff=prev&oldid=9113649
python interwiki.py zh:0
Checked for running processes. 1 processes currently running, including the current process.
TitleTranslate: 0 was recognized as Number with value 0
Getting 1 pages from wikipedia:zh...
NOTE: [[zh:0]] does not exist
======Post-processing [[zh:0]]======
Not editing [[zh:0]]: page does not exist
WARNING: Page 0 does no longer exist?!
But the page DOES exist.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1880625&group_…
Feature Requests item #1880563, was opened at 2008-01-27 00:54
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603141&aid=1880563&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Priority: 5
Private: No
Submitted By: Mike.lifeguard (mike_lifeguard)
Assigned to: Nobody/Anonymous (nobody)
Summary: upgrade parameters in fixing_redirects.py
Initial Comment:
1) add support for -file and -cat at a minimum
2) there may be a bug in -start (I haven't tested this much, so I can't be sure)
3) fixing_redirects.py doesn't tell you that -namespace and -start are options; you have to look in the code
4) some way of logging redirects which have been orphaned would be nice. Depending on how this is done, it might require a new module; User:Mike.lifeguard@enwikibooks would be happy to attempt to figure out the logic required with someone who knows python
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603141&aid=1880563&group_…
Bugs item #1615700, was opened at 2006-12-14 14:03
Message generated for change (Comment added) made by tavernier
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1615700&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Purodha B Blissenbach (purodha)
Assigned to: Nobody/Anonymous (nobody)
Summary: "category.py move [-inplace]" ignores some instances
Initial Comment:
I observed (very few) instances when the command:
python category.py ... move -inplace -from:... -to:...
did not catch all occurences of the -from category, so that ater the run it was not empty, since some pages belonging to it were not altered at all. I believe, replace.py did not find the category tag inside these pages.
I all such cases, the following was true, but I cannot tell, which of those (if any) is a trigger for replace.py's erroneous behaviour:
- "move -inplace" was used.
- the category tag was formatted like "[[namespace:pagename|sortkey]]", with a non-empty sort key present.
- the "[[namespace:" was
neither "[[Category:",
nor the default local name for the wiki/language (as present in $namespaceNames),
but another name which can be found only in the $namespaceAliases of the wiki language.
(My *guess* is the latter being the cause)
There may have been cases when category.py did find and correctly replaced the category tag in pages despite the conditions above, but I am not aware of any.
----------------------------------------------------------------------
Comment By: Tavernier (tavernier)
Date: 2008-01-27 05:48
Message:
Logged In: YES
user_id=1705732
Originator: NO
can you give links ? does the problem still occur ?
sometimes he don't catch the category because it's transcluded from a
template.
----------------------------------------------------------------------
Comment By: Russell Blau (russblau)
Date: 2007-11-21 16:13
Message:
Logged In: YES
user_id=855050
Originator: NO
I have made many revisions to the "replaceCategoryInPlace()" function
since this bug was opened. If there is still a problem, please identify
specific articles/sites that don't get moved correctly. If not, this bug
can be closed.
----------------------------------------------------------------------
Comment By: Purodha B Blissenbach (purodha)
Date: 2007-06-01 12:32
Message:
Logged In: YES
user_id=46450
Originator: YES
The sort key seems of no importance.
Here you can see that there are three ways to write "Category" in the
Wikipedia of Ripuarian languages:
http://ksh.wikipedia.org/w/index.php?title=Betupper_%28Minsch%29&diff=prev&…
- the list of categories below the page shows only one name, and there is
no wikilink of either at the end of the page.
I ran:
python category.py move -inplace -from:Mynsch -to:Minsch
and it resonded:
Checked for running processes. 1 processes currently running, including
the current process.
Getting [[Saachjrupp:Mynsch]]...
Getting 1 pages from wikipedia:ksh...
Getting a page to check if we're logged in on wikipedia:ksh
Sleeping for 5.6 seconds, 2007-06-01 10:05:04
Changing page [[ksh:Betupper (Minsch)]]
There are no subcategories in category Saachjrupp:Mynsch
Dumping to category.dump.bz2, please wait...
Here is the change made by the bot:
http://ksh.wikipedia.org/w/index.php?title=Betupper_%28Minsch%29&diff=next&…
1. It got the generic english name (category:Mynsch altered to
Category:Minsch) and normalized it.
2. It got the default localized name (Saachjrupp:Mynsch) and normalized it
to the generic english one. The default localized form is defined in
$namespaceNames.
3. It did NOT catch the alternate localized name (Kategorie:Mynsch, left
unchanged). BUT as you can see from the box at the bottom of the page, the
page is now in two categories. I.e. Mediawiki understands the alternate
localized form. Several alternate localized names can be defined in
$namespaceAliases, see
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/languages/messages/M…
I believe, category.py does either not know of $namespaceAliases, or uses
only one of the alternate names listed there.
----------------------------------------------------------------------
Comment By: siebrand (siebrand)
Date: 2007-04-26 21:30
Message:
Logged In: YES
user_id=1107255
Originator: NO
Please let us know if this bug report is still applicable to the current
code. If no response is given, the bug report will be closed one month from
now. This message was added in an effort to reduce the number of open
issues on this project. Siebrand
----------------------------------------------------------------------
Comment By: Cyde Weys (cydeweys)
Date: 2007-01-28 21:26
Message:
Logged In: YES
user_id=1506848
Originator: NO
Hrmm, I'm confused. How many different ways are there to write
"Category:"? I thought there was just one, whatever the language's wiki
uses. Can you give a specific example of one that wasn't caught, but that
should have been? Thanks.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1615700&group_…
Bugs item #1504707, was opened at 2006-06-12 13:40
Message generated for change (Comment added) made by tavernier
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1504707&group_…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Nobody/Anonymous (nobody)
Assigned to: Nobody/Anonymous (nobody)
Summary: Replace.py's -fix:HTML breaks articles' formatting
Initial Comment:
While on most articles, -fix:HTML simply works, because
of the flaw in wikimarkup (wherein it is difficult to
distinguish between ''' as bolding, and ''' as
italicizing with a single apostrophe (a frequently used
construct)), when it comes to italics and bolding,
-fix:HTML can cause unintional bolding, a major problem.
See this diff:
http://en.wikipedia.org/w/index.php?title=Multiplication_table&diff=prev&ol…
Before:
http://en.wikipedia.org/w/index.php?title=Multiplication_table&oldid=572703…
After:
http://en.wikipedia.org/w/index.php?title=Multiplication_table&oldid=579984…
I think this could be fixed by scanning the region
defined by the previous and the next linebreak, and if
there are any other '''s, either using a <nowiki>
construct to protect the apostrophe from protection,
use HTML <i> tagsm or simply not do any replacements.
~maru
----------------------------------------------------------------------
Comment By: Tavernier (tavernier)
Date: 2008-01-27 04:55
Message:
Logged In: YES
user_id=1705732
Originator: NO
it could be fixed by adding exceptions to the doReplacements method
i suggest 'comment', 'math', 'nowiki', 'pre' and 'source'
it will looks like
def doReplacements(self, original_text):
"""
Returns the text which is generated by applying all replacements
to the
given text.
"""
new_text = original_text
exceptions = ['comment', 'math', 'nowiki', 'pre', 'source']
if self.exceptions.has_key('inside-tags'):
exceptions += self.exceptions['inside-tags']
if self.exceptions.has_key('inside'):
exceptions += self.exceptions['inside']
for old, new in self.replacements:
new_text = wikipedia.replaceExcept(new_text, old, new,
exceptions, allowoverlap = self.allowoverlap)
return new_text
----------------------------------------------------------------------
Comment By: Rotem Liss (rotemliss)
Date: 2007-11-25 14:05
Message:
Logged In: YES
user_id=1327030
Originator: NO
This is a "bug" in MediaWiki (or in the text) and isn't related to pre or
nowiki tags. The problem was that there was an invalid <i> tag, while the
five apostrophes already made the text italic. Changing it would cause the
same problems in a line that starts with a space and in a regular line.
About nowiki and pre, these are already scanned and the text inside them is
ignored (space before line is not scanned, though it's possible, but such
scan is not needed, as tags in such line are parsed, unlike pre or nowiki).
This is not a bug in the framework - I think it wasn't a bug when reported,
and it's definitely not a bug now. The problem is of the original text or
MediaWiki.
----------------------------------------------------------------------
Comment By: Russell Blau (russblau)
Date: 2007-11-21 16:12
Message:
Logged In: YES
user_id=855050
Originator: NO
The specific bug identified on "Multiplication table" no longer exists.
Is there still a problem? If not, this bug can be closed.
----------------------------------------------------------------------
Comment By: siebrand (siebrand)
Date: 2007-04-26 21:29
Message:
Logged In: YES
user_id=1107255
Originator: NO
Please let us know if this bug report is still applicable to the current
code. If no response is given, the bug report will be closed one month from
now. This message was added in an effort to reduce the number of open
issues on this project. Siebrand
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1504707&group_…