Bugs item #1831698, was opened at 2007-11-14 11:44 Message generated for change (Comment added) made by btongminh You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1831698...
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: General Group: None Status: Open Resolution: Fixed Priority: 8 Private: No Submitted By: siebrand (siebrand) Assigned to: Nobody/Anonymous (nobody) Summary: Partial pages saved on connection reset
Initial Comment: I have gotten 2 reports of CommonsDelinker saving incomplete pages: http://bg.wikipedia.org/w/index.php?title=19_%D0%BD%D0%BE%D0%B5%D0%BC%D0%B2%... and http://ru.wikipedia.org/w/index.php?title=%D0%93%D0%BB%D0%B0%D0%B7%D0%B3%D0%...
Discussing this on IRC it was thought to originate from a connection reset. My question was why the wiki would save a page that was not sent completely. This appears to have to do with sending an 'edit token' before the edit data. If possible, this should be reversed so these mistakes cannot happen (worst case is the edit is not made).
IRC chat (freenode #pywikipediabot): [11:57] <siebrand> Bryan: big "oops": <URL at ru.wp> [11:58] <valhallasw> siebrand: looks like some connection reset [..] [11:58] <Bryan> what valhallasw said [..] [11:58] <Bryan> maybe we should change wikipedia.py such that it sends wpEditToken as last item [11:59] <Bryan> so stuff like this doesn't happen [11:59] <siebrand> Connection reset sounds logical, although, why would MediaWiki accept that? [11:59] <valhallasw> because connection closed is connection closed [11:59] <siebrand> ah, I see bryan explained that :) [12:00] <valhallasw> Bryan: sounds like a good idea in any case :) [12:00] <siebrand> indeed the bot needs to signal in some way that it is actually "done submitting" before the wiki accepts its changes. If that is the "edit token", then it may be a good idea to send that as the last item. [12:00] <valhallasw> well, mediawiki won't accept an edit without edit token [12:00] <valhallasw> and it has no way to check if the complete request has been sent [12:00] <valhallasw> so the only way to prevent saving is sending some required header last [12:01] <valhallasw> Bryan: are we not using some content-length header? that should fix the problem, too [12:01] <Bryan> no idea
----------------------------------------------------------------------
Comment By: Bryan (btongminh)
Date: 2007-11-23 23:36
Message: Logged In: YES user_id=1806226 Originator: NO
I looked some more into this and the problem seems that you can't rely on the server to send the Content-Length header. Something else that might be worthwhile is to have the gzip module raise an error if the content is incomplete.
----------------------------------------------------------------------
Comment By: Bryan (btongminh) Date: 2007-11-19 21:42
Message: Logged In: YES user_id=1806226 Originator: NO
Reopened: Fix only works when using persistent_http = True. A similar solution must come up for persistent_http = False.
----------------------------------------------------------------------
Comment By: Bryan (btongminh) Date: 2007-11-15 20:18
Message: Logged In: YES user_id=1806226 Originator: NO
Fixed in r4560.
----------------------------------------------------------------------
Comment By: Bryan (btongminh) Date: 2007-11-15 19:12
Message: Logged In: YES user_id=1806226 Originator: NO
Since it is also happening to SieBot, I assume that it is an error in the framework.
I think that it originates from the function that gets the data from the server. As valhallasw points out, the server would just reject incomplete postdata, as we do set content-length.
The last character from http://ru.wikipedia.org/w/index.php?title=%D0%93%D0%BB%D0%B0%D0%B7%D0%B3%D0%... is '\xef\xbf\xbd' or u'\ufffd', more commonly known 'Unicode replace character'. This character is inserted when an invalid unicode sequence is read. I think we should very thoroughly look at the get routines and add more checks, such as a check for content-length, in order to prevent stuff like this from happening.
----------------------------------------------------------------------
Comment By: siebrand (siebrand) Date: 2007-11-15 13:49
Message: Logged In: YES user_id=1107255 Originator: YES
Two more: http://es.wikipedia.org/w/index.php?title=Argentina&curid=10&diff=12... http://es.wikipedia.org/w/index.php?title=Argentina&curid=10&diff=12...
----------------------------------------------------------------------
Comment By: siebrand (siebrand) Date: 2007-11-15 10:50
Message: Logged In: YES user_id=1107255 Originator: YES
A very recent one: http://it.wikipedia.org/w/index.php?title=Discussione:Socialdemocrazia&d...
----------------------------------------------------------------------
Comment By: siebrand (siebrand) Date: 2007-11-15 08:51
Message: Logged In: YES user_id=1107255 Originator: YES
Additional report: http://fr.wikipedia.org/w/index.php?title=Guerre_de_Cent_Ans&diff=229427...
----------------------------------------------------------------------
You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1831698...