[Pywikipedia-l] [ pywikipediabot-Bugs-1831698 ] Partial pages saved on connection reset

SourceForge.net noreply at sourceforge.net
Mon Dec 10 21:22:52 UTC 2007


Bugs item #1831698, was opened at 2007-11-14 11:44
Message generated for change (Comment added) made by btongminh
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1831698&group_id=93107

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: General
Group: None
>Status: Closed
Resolution: Fixed
Priority: 8
Private: No
Submitted By: siebrand (siebrand)
>Assigned to: Bryan (btongminh)
Summary: Partial pages saved on connection reset

Initial Comment:
I have gotten 2 reports of CommonsDelinker saving incomplete pages: http://bg.wikipedia.org/w/index.php?title=19_%D0%BD%D0%BE%D0%B5%D0%BC%D0%B2%D1%80%D0%B8&diff=prev&oldid=1294165 and http://ru.wikipedia.org/w/index.php?title=%D0%93%D0%BB%D0%B0%D0%B7%D0%B3%D0%BE&diff=6253594&oldid=6249508

Discussing this on IRC it was thought to originate from a connection reset. My question was why the wiki would save a page that was not sent completely. This appears to have to do with sending an 'edit token' before the edit data. If possible, this should be reversed so these mistakes cannot happen (worst case is the edit is not made).

IRC chat (freenode #pywikipediabot):
[11:57] <siebrand> Bryan: big "oops": <URL at ru.wp>
[11:58] <valhallasw> siebrand: looks like some connection reset
[..]
[11:58] <Bryan> what valhallasw said
[..]
[11:58] <Bryan> maybe we should change wikipedia.py such that it sends wpEditToken as last item
[11:59] <Bryan> so stuff like this doesn't happen
[11:59] <siebrand> Connection reset sounds logical, although, why would MediaWiki accept that?
[11:59] <valhallasw> because connection closed is connection closed
[11:59] <siebrand> ah, I see bryan explained that :)
[12:00] <valhallasw> Bryan: sounds like a good idea in any case :)
[12:00] <siebrand> indeed the bot needs to signal in some way that it is actually "done submitting" before the wiki accepts its changes. If that is the "edit token", then it may be a good idea to send that as the last item.
[12:00] <valhallasw> well, mediawiki won't accept an edit without edit token
[12:00] <valhallasw> and it has no way to check if the complete request has been sent
[12:00] <valhallasw> so the only way to prevent saving is sending some required header last
[12:01] <valhallasw> Bryan: are we not using some content-length header? that should fix the problem, too
[12:01] <Bryan> no idea


----------------------------------------------------------------------

>Comment By: Bryan (btongminh)
Date: 2007-12-10 22:22

Message:
Logged In: YES 
user_id=1806226
Originator: NO

r4692

----------------------------------------------------------------------

Comment By: Bryan (btongminh)
Date: 2007-12-10 22:22

Message:
Logged In: YES 
user_id=1806226
Originator: NO

Fixed in r...

----------------------------------------------------------------------

Comment By: Bryan (btongminh)
Date: 2007-11-23 23:36

Message:
Logged In: YES 
user_id=1806226
Originator: NO

I looked some more into this and the problem seems that you can't rely on
the server to send the Content-Length header. Something else that might be
worthwhile is to have the gzip module raise an error if the content is
incomplete. 

----------------------------------------------------------------------

Comment By: Bryan (btongminh)
Date: 2007-11-19 21:42

Message:
Logged In: YES 
user_id=1806226
Originator: NO

Reopened: Fix only works when using persistent_http = True. A similar
solution must come up for persistent_http = False.

----------------------------------------------------------------------

Comment By: Bryan (btongminh)
Date: 2007-11-15 20:18

Message:
Logged In: YES 
user_id=1806226
Originator: NO

Fixed in r4560.

----------------------------------------------------------------------

Comment By: Bryan (btongminh)
Date: 2007-11-15 19:12

Message:
Logged In: YES 
user_id=1806226
Originator: NO

Since it is also happening to SieBot, I assume that it is an error in  the
framework. 

I think that it originates from the function that gets the data from the
server. As valhallasw points out, the server would just reject incomplete
postdata, as we do set content-length.

The last character from
http://ru.wikipedia.org/w/index.php?title=%D0%93%D0%BB%D0%B0%D0%B7%D0%B3%D0%BE&diff=6253594&oldid=6249508
is '\xef\xbf\xbd' or u'\ufffd', more commonly known 'Unicode replace
character'. This character is inserted when an invalid unicode sequence is
read. I think we should very thoroughly look at the get routines and add
more checks, such as a check for content-length, in order to prevent stuff
like this from happening.

----------------------------------------------------------------------

Comment By: siebrand (siebrand)
Date: 2007-11-15 13:49

Message:
Logged In: YES 
user_id=1107255
Originator: YES

Two more:
http://es.wikipedia.org/w/index.php?title=Argentina&curid=10&diff=12811540&oldid=12800482
http://es.wikipedia.org/w/index.php?title=Argentina&curid=10&diff=12811713&oldid=12811600

----------------------------------------------------------------------

Comment By: siebrand (siebrand)
Date: 2007-11-15 10:50

Message:
Logged In: YES 
user_id=1107255
Originator: YES

A very recent one:
http://it.wikipedia.org/w/index.php?title=Discussione:Socialdemocrazia&diff=prev&oldid=12281931

----------------------------------------------------------------------

Comment By: siebrand (siebrand)
Date: 2007-11-15 08:51

Message:
Logged In: YES 
user_id=1107255
Originator: YES

Additional report:
http://fr.wikipedia.org/w/index.php?title=Guerre_de_Cent_Ans&diff=22942716&oldid=22930671

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=603138&aid=1831698&group_id=93107



More information about the Pywikipedia-l mailing list