Bugs item #1831698, was opened at 2007-11-14 11:44
Message generated for change (Comment added) made by btongminh
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=183169…
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: General
Group: None
Status: Open
Resolution: Fixed
Priority: 8
Private: No
Submitted By: siebrand (siebrand)
Assigned to: Nobody/Anonymous (nobody)
Summary: Partial pages saved on connection reset
Initial Comment:
I have gotten 2 reports of CommonsDelinker saving incomplete pages:
http://bg.wikipedia.org/w/index.php?title=19_%D0%BD%D0%BE%D0%B5%D0%BC%D0%B2…
and
http://ru.wikipedia.org/w/index.php?title=%D0%93%D0%BB%D0%B0%D0%B7%D0%B3%D0…
Discussing this on IRC it was thought to originate from a connection reset. My question
was why the wiki would save a page that was not sent completely. This appears to have to
do with sending an 'edit token' before the edit data. If possible, this should be
reversed so these mistakes cannot happen (worst case is the edit is not made).
IRC chat (freenode #pywikipediabot):
[11:57] <siebrand> Bryan: big "oops": <URL at ru.wp>
[11:58] <valhallasw> siebrand: looks like some connection reset
[..]
[11:58] <Bryan> what valhallasw said
[..]
[11:58] <Bryan> maybe we should change wikipedia.py such that it sends wpEditToken
as last item
[11:59] <Bryan> so stuff like this doesn't happen
[11:59] <siebrand> Connection reset sounds logical, although, why would MediaWiki
accept that?
[11:59] <valhallasw> because connection closed is connection closed
[11:59] <siebrand> ah, I see bryan explained that :)
[12:00] <valhallasw> Bryan: sounds like a good idea in any case :)
[12:00] <siebrand> indeed the bot needs to signal in some way that it is actually
"done submitting" before the wiki accepts its changes. If that is the "edit
token", then it may be a good idea to send that as the last item.
[12:00] <valhallasw> well, mediawiki won't accept an edit without edit token
[12:00] <valhallasw> and it has no way to check if the complete request has been
sent
[12:00] <valhallasw> so the only way to prevent saving is sending some required
header last
[12:01] <valhallasw> Bryan: are we not using some content-length header? that should
fix the problem, too
[12:01] <Bryan> no idea
----------------------------------------------------------------------
Comment By: Bryan (btongminh)
Date: 2007-11-23
23:36
Message:
Logged In: YES
user_id=1806226
Originator: NO
I looked some more into this and the problem seems that you can't rely on
the server to send the Content-Length header. Something else that might be
worthwhile is to have the gzip module raise an error if the content is
incomplete.
----------------------------------------------------------------------
Comment By: Bryan (btongminh)
Date: 2007-11-19 21:42
Message:
Logged In: YES
user_id=1806226
Originator: NO
Reopened: Fix only works when using persistent_http = True. A similar
solution must come up for persistent_http = False.
----------------------------------------------------------------------
Comment By: Bryan (btongminh)
Date: 2007-11-15 20:18
Message:
Logged In: YES
user_id=1806226
Originator: NO
Fixed in r4560.
----------------------------------------------------------------------
Comment By: Bryan (btongminh)
Date: 2007-11-15 19:12
Message:
Logged In: YES
user_id=1806226
Originator: NO
Since it is also happening to SieBot, I assume that it is an error in the
framework.
I think that it originates from the function that gets the data from the
server. As valhallasw points out, the server would just reject incomplete
postdata, as we do set content-length.
The last character from
http://ru.wikipedia.org/w/index.php?title=%D0%93%D0%BB%D0%B0%D0%B7%D0%B3%D0…
is '\xef\xbf\xbd' or u'\ufffd', more commonly known 'Unicode replace
character'. This character is inserted when an invalid unicode sequence is
read. I think we should very thoroughly look at the get routines and add
more checks, such as a check for content-length, in order to prevent stuff
like this from happening.
----------------------------------------------------------------------
Comment By: siebrand (siebrand)
Date: 2007-11-15 13:49
Message:
Logged In: YES
user_id=1107255
Originator: YES
Two more:
http://es.wikipedia.org/w/index.php?title=Argentina&curid=10&diff=1…
http://es.wikipedia.org/w/index.php?title=Argentina&curid=10&diff=1…
----------------------------------------------------------------------
Comment By: siebrand (siebrand)
Date: 2007-11-15 10:50
Message:
Logged In: YES
user_id=1107255
Originator: YES
A very recent one:
http://it.wikipedia.org/w/index.php?title=Discussione:Socialdemocrazia&…
----------------------------------------------------------------------
Comment By: siebrand (siebrand)
Date: 2007-11-15 08:51
Message:
Logged In: YES
user_id=1107255
Originator: YES
Additional report:
http://fr.wikipedia.org/w/index.php?title=Guerre_de_Cent_Ans&diff=22942…
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=183169…