Hello to both the wikitech and pywikipedia lists -- please keep both informed when replying. Thanks.
A few days ago, we - the pywikipedia developers - received alarming reports of interwiki bots removing content from pages. This does not seem to happen often, and we have not been able to reproduce the conditions in which this happens.
However, the common denominator is the fact it seems to be happening only on the wikipedia's that run MediaWiki 1.18 wikis. As such, I think this topic might be relevant for wikitech-l, too. In addition, there is no-one in the pywikipedia team with a clear idea of why this is happening. As such, we would appreciate any ideas.
1. What happens? Essentially, the interwiki bot does its job, retrieves the graph and determines the correct interwiki links. It should then add it to the page, but instead, /only/ the interwiki links are stored. For example: http://nl.wikipedia.org/w/index.php?title=Blankenbach&diff=next&oldi... http://eo.wikipedia.org/w/index.php?title=Anton%C3%ADn_Kl%C3%A1%C5%A1tersk%C... http://simple.wikipedia.org/w/index.php?title=Mettau%2C_Switzerland&acti...
2. Why does this happen? This is unclear. On the one hand, interwiki.py is somewhat black magic: none of the current developers intimately knows its workings. On the other hand, the bug is not reproducible: running it on the exact same page with the exact same page text does not result in a cleared page. It could very well be something like broken network error handling - but mainly, we have no idea. Did anything change in Special:Export (which is still used in interwiki.py) or the API which might cause something like this? I couldn't find anything in the release notes.
3. Reasons for relating it to MW 1.18 To find out on which wikis this problem happens, I used a quick-and-dirty heuristic: select rc_comment, rc_cur_time, rc_user, rc_namespace, rc_title, rc_old_len, rc_new_len from recentchanges left join user_groups on ug_user=rc_user where rc_new_len < rc_old_len * 0.1 and ug_group = 'bot' and rc_namespace=0 limit 10 /* SLOW OK */;
This is a slow query (~30s for nlwiki_p on the toolserver), but it gives some interesting results: nlwiki: 9 rows, all broken interwiki bots eowiki: 25 rows, all interwiki bots simplewiki: 3 rows, of which 2 are interwiki bots dewiki: 0 rows using rc_old_len * 0.3: 14 rows, all double redirect fixes frwiki: 9 rows, but *none* from interwiki bots (all edits are by the same antivandalism bot) itwiki: 0 rows ptwiki: 0 rows
All ideas and hints are very welcome. Hopefully we will be able to solve this before tuesday...
Best regards, Merlijn van Deen
On Thu, Sep 29, 2011 at 1:08 PM, Merlijn van Deen valhallasw@arctus.nlwrote:
- Why does this happen?
This is unclear. On the one hand, interwiki.py is somewhat black magic: none of the current developers intimately knows its workings. On the other hand, the bug is not reproducible: running it on the exact same page with the exact same page text does not result in a cleared page. It could very well be something like broken network error handling - but mainly, we have no idea. Did anything change in Special:Export (which is still used in interwiki.py) or the API which might cause something like this? I couldn't find anything in the release notes.
The thing I'd recommend is enabling some debug instrumentation in the bots, so that next time one makes a bad edit y'all can review those logs and see what it was doing.
I don't know what logging is already available, but you basically want to see every HTTP request it makes (URL and POST data if any), and the response received.
This should help narrow it down significantly to one of: * something in MW is outputting wrong data (visibly wrong output from api/export) * something in pywikipediabot is processing data wrong (all right output from api/export, but input data being sent on edit is already wrong) * something in MW is processing input data wrong (all right output from api/export, all input being sent looks correct)
Note that there may be legitimate differences in api or export data that the bot is processing incorrectly, so look close. :)
-- brion
Merlijn van Deen wrote:
Hello to both the wikitech and pywikipedia lists -- please keep both informed when replying. Thanks.
A few days ago, we - the pywikipedia developers - received alarming reports of interwiki bots removing content from pages. This does not seem to happen often, and we have not been able to reproduce the conditions in which this happens.
However, the common denominator is the fact it seems to be happening only on the wikipedia's that run MediaWiki 1.18 wikis. As such, I think this topic might be relevant for wikitech-l, too. In addition, there is no-one in the pywikipedia team with a clear idea of why this is happening. As such, we would appreciate any ideas.
- What happens?
Essentially, the interwiki bot does its job, retrieves the graph and determines the correct interwiki links.
Does it use the page content to retrieve the interwiki links? Or is it retrieved eg. by doing a different query to the API? I.e. would receiving no content (from the bot POV) produce that behavior?
On Fri, Sep 30, 2011 at 1:06 AM, Platonides platonides@gmail.com wrote:
Merlijn van Deen wrote:
Hello to both the wikitech and pywikipedia lists -- please keep both informed when replying. Thanks.
A few days ago, we - the pywikipedia developers - received alarming reports of interwiki bots removing content from pages. This does not seem to happen often, and we have not been able to reproduce the conditions in which this happens.
However, the common denominator is the fact it seems to be happening only on the wikipedia's that run MediaWiki 1.18 wikis. As such, I think this topic might be relevant for wikitech-l, too. In addition, there is no-one in the pywikipedia team with a clear idea of why this is happening. As such, we would appreciate any ideas.
- What happens?
Essentially, the interwiki bot does its job, retrieves the graph and determines the correct interwiki links.
Does it use the page content to retrieve the interwiki links? Or is it retrieved eg. by doing a different query to the API?
The interwiki links are retrieved from page content. The page content has been received through a call to Special:Export.
I.e. would receiving no content (from the bot POV) produce that behavior?
Yes, the only reasonable explanation seems to be that the bot interprets what it gets from the server as an empty page.
On Fri, Sep 30, 2011 at 12:56 PM, Andre Engels andreengels@gmail.comwrote:
The interwiki links are retrieved from page content. The page content has been received through a call to Special:Export.
I.e. would receiving no content (from the bot POV) produce that behavior?
Yes, the only reasonable explanation seems to be that the bot interprets what it gets from the server as an empty page.
So you screen-scrape? No surprise it breaks. Why? For example, due to protocol-relative URLs. Or some other changes to HTML output. Why not just use API?
On Fri, Sep 30, 2011 at 11:12 AM, Max Semenik maxsem.wiki@gmail.com wrote:
On Fri, Sep 30, 2011 at 12:56 PM, Andre Engels <andreengels@gmail.com
wrote:
The interwiki links are retrieved from page content. The page content has been received through a call to Special:Export.
I.e. would receiving no content (from the bot POV) produce that
behavior?
Yes, the only reasonable explanation seems to be that the bot interprets what it gets from the server as an empty page.
So you screen-scrape? No surprise it breaks. Why? For example, due to protocol-relative URLs. Or some other changes to HTML output. Why not just use API?
Basically, because most of the core functionality comes from before the API came into existence. At least, that would be my explanation.
Στις 29-09-2011, ημέρα Πεμ, και ώρα 22:08 +0200, ο/η Merlijn van Deen έγραψε:
Hello to both the wikitech and pywikipedia lists -- please keep both informed when replying. Thanks.
A few days ago, we - the pywikipedia developers - received alarming reports of interwiki bots removing content from pages. This does not seem to happen often, and we have not been able to reproduce the conditions in which this happens.
However, the common denominator is the fact it seems to be happening only on the wikipedia's that run MediaWiki 1.18 wikis. As such, I think this topic might be relevant for wikitech-l, too. In addition, there is no-one in the pywikipedia team with a clear idea of why this is happening. As such, we would appreciate any ideas.
- What happens?
Essentially, the interwiki bot does its job, retrieves the graph and determines the correct interwiki links. It should then add it to the page, but instead, /only/ the interwiki links are stored. For example: http://nl.wikipedia.org/w/index.php?title=Blankenbach&diff=next&oldi... http://eo.wikipedia.org/w/index.php?title=Anton%C3%ADn_Kl%C3%A1%C5%A1tersk%C... http://simple.wikipedia.org/w/index.php?title=Mettau%2C_Switzerland&acti...
- Why does this happen?
This is unclear. On the one hand, interwiki.py is somewhat black magic: none of the current developers intimately knows its workings. On the other hand, the bug is not reproducible: running it on the exact same page with the exact same page text does not result in a cleared page. It could very well be something like broken network error handling - but mainly, we have no idea. Did anything change in Special:Export (which is still used in interwiki.py) or the API which might cause something like this? I couldn't find anything in the release notes.
Out of curiosity... If the new revisions of one of these badly edited pages are deleted, leaving the top revision as the one just before the bad iw bot edit, does a rerun of the bot on the page fail?
Ariel
On Fri, Sep 30, 2011 at 9:39 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
Out of curiosity... If the new revisions of one of these badly edited pages are deleted, leaving the top revision as the one just before the bad iw bot edit, does a rerun of the bot on the page fail?
I did a test, and the result was very interesting, which might point to the cause of this bug:
I deleted the page [[nl:Blankenbach]], then restored the 2 versions before the problematic bot edit. When now I look at the page, instead of the page content I get:
In de database is geen inhoud aangetroffen voor de pagina met .
Dit kan voorkomen als u een verouderde verwijzing naar het verschil tussen twee versies van een pagina volgt of een versie opvraagt die is verwijderd.
Als dit niet het geval is, hebt u wellicht een fout in de software gevonden. Maak hiervan melding bij een systeembeheerderhttp://nl.wikipedia.org/wiki/Speciaal:Gebruikerslijst/sysopvan Wikipedia en vermeld daarbij de URL van deze pagina.
Going to the specific version that after the deletion-and-partial-restore should be the newest ( http://nl.wikipedia.org/w/index.php?title=Blankenbach&oldid=10676248), it claims that there is a newer version, but going to the newer version or the newest version, I get the abovementioned message again.
As an extra test, I did the delete-then-restore-some-versions-but-not-the-most-recent action with another page (http://nl.wikipedia.org/wiki/Gebruiker:Andre_Engels/Test), and there I found no such problem. From this I conclude that the bug has not been caused by that process, but that for some reason the page had a wrong (or empty) version number for its 'most recent' version, or something like that.
On Fri, Sep 30, 2011 at 11:13 AM, Andre Engels andreengels@gmail.comwrote:
On Fri, Sep 30, 2011 at 9:39 AM, Ariel T. Glenn ariel@wikimedia.orgwrote:
Out of curiosity... If the new revisions of one of these badly edited pages are deleted, leaving the top revision as the one just before the bad iw bot edit, does a rerun of the bot on the page fail?
I did a test, and the result was very interesting, which might point to the cause of this bug:
I deleted the page [[nl:Blankenbach]], then restored the 2 versions before the problematic bot edit. When now I look at the page, instead of the page content I get:
In de database is geen inhoud aangetroffen voor de pagina met .
Dit kan voorkomen als u een verouderde verwijzing naar het verschil tussen twee versies van een pagina volgt of een versie opvraagt die is verwijderd.
Als dit niet het geval is, hebt u wellicht een fout in de software gevonden. Maak hiervan melding bij een systeembeheerderhttp://nl.wikipedia.org/wiki/Speciaal:Gebruikerslijst/sysopvan Wikipedia en vermeld daarbij de URL van deze pagina.
Going to the specific version that after the deletion-and-partial-restore should be the newest ( http://nl.wikipedia.org/w/index.php?title=Blankenbach&oldid=10676248), it claims that there is a newer version, but going to the newer version or the newest version, I get the abovementioned message again.
As an extra test, I did the delete-then-restore-some-versions-but-not-the-most-recent action with another page (http://nl.wikipedia.org/wiki/Gebruiker:Andre_Engels/Test), and there I found no such problem. From this I conclude that the bug has not been caused by that process, but that for some reason the page had a wrong (or empty) version number for its 'most recent' version, or something like that.C
Curiouser and curiouser... I now see that when I click the edit button from the abovementioned page, I do get to edit the page at is it shown, even though that one is not in the history (the page is a copy of [[ MediaWiki:Missing-articlehttp://nl.wikipedia.org/wiki/MediaWiki:Missing-article]] with the empty string filled in for $2).
On Fri, Sep 30, 2011 at 9:21 PM, Andre Engels andreengels@gmail.com wrote:
On Fri, Sep 30, 2011 at 11:13 AM, Andre Engels andreengels@gmail.comwrote:
On Fri, Sep 30, 2011 at 9:39 AM, Ariel T. Glenn ariel@wikimedia.orgwrote:
Out of curiosity... If the new revisions of one of these badly edited pages are deleted, leaving the top revision as the one just before the bad iw bot edit, does a rerun of the bot on the page fail?
I did a test, and the result was very interesting, which might point to the cause of this bug:
I deleted the page [[nl:Blankenbach]], then restored the 2 versions before the problematic bot edit. When now I look at the page, instead of the page content I get:
In de database is geen inhoud aangetroffen voor de pagina met .
Dit kan voorkomen als u een verouderde verwijzing naar het verschil tussen twee versies van een pagina volgt of een versie opvraagt die is verwijderd.
Als dit niet het geval is, hebt u wellicht een fout in de software gevonden. Maak hiervan melding bij een systeembeheerderhttp://nl.wikipedia.org/wiki/Speciaal:Gebruikerslijst/sysopvan Wikipedia en vermeld daarbij de URL van deze pagina.
Going to the specific version that after the deletion-and-partial-restore should be the newest ( http://nl.wikipedia.org/w/index.php?title=Blankenbach&oldid=10676248), it claims that there is a newer version, but going to the newer version or the newest version, I get the abovementioned message again.
As an extra test, I did the delete-then-restore-some-versions-but-not-the-most-recent action with another page (http://nl.wikipedia.org/wiki/Gebruiker:Andre_Engels/Test), and there I found no such problem. From this I conclude that the bug has not been caused by that process, but that for some reason the page had a wrong (or empty) version number for its 'most recent' version, or something like that.C
Curiouser and curiouser... I now see that when I click the edit button from the abovementioned page, I do get to edit the page at is it shown, even though that one is not in the history (the page is a copy of [[ MediaWiki:Missing-articlehttp://nl.wikipedia.org/wiki/MediaWiki:Missing-article]] with the empty string filled in for $2).
-- André Engels, andreengels@gmail.com _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
page_latest = 0 ... WTF?
Hi Ariel and Andre,
On Fri, Sep 30, 2011 at 9:39 AM, Ariel T. Glenn ariel@wikimedia.orgwrote:
Out of curiosity... If the new revisions of one of these badly edited pages are deleted, leaving the top revision as the one just before the bad iw bot edit, does a rerun of the bot on the page fail?
On Fri, Sep 30, 2011 at 11:13 AM, Andre Engels andreengels@gmail.com wrote:
I deleted the page [[nl:Blankenbach]], then restored the 2 versions before the problematic bot edit. When now I look at the page, instead of the page content I get:
(...)
Using this undeleted version, and running interwiki.py, gives the expected result: valhallasw@dorthonion:~/src/pywikipedia/trunk$ python interwiki.py -page:Blankenbach NOTE: Number of pages queued is 0, trying to add 60 more. Getting 1 pages from wikipedia:nl... WARNING: Family file wikipedia contains version number 1.17wmf1, but it should be 1.18wmf1 NOTE: [[nl:Blankenbach]] does not exist. Skipping.
This also happens for running it from dewiki (python interwiki.py -lang:de -page:Blankenbach%20%28Begriffskl%C3%A4rung%29) or running as 'full-auto' bot (python interwiki.py -all -async -cleanup -log -auto -ns:0 -start:Blankenbach).
Special:Export acts like the page just does not exist (http://nl.wikipedia.org/w/index.php?title=Speciaal:Exporteren&useskin=mo... shows page Blanzac but not Blankenbach)
api.php also more or less does the expected thing: http://nl.wikipedia.org/w/api.php?action=query&prop=revisions&titles... - that is, unless you supply rvlimit=1: http://nl.wikipedia.org/w/api.php?action=query&prop=revisions&titles...
However, none of them seem to return an empty page - and playing around with pywikipediabot does not allow be to get an empty page (depending on settings, it can either be the result on the edit page (page.get(), use_api=False / screen scraping), a pywikibot.exceptions.NoPage exception (PreloadingGenerator / wikipedia.getall, which uses Special:Export) or the correct page text (page.get(), use_api=True).
Anyway, thanks a huge heap for trying this (and for everyone, for thinking about it). Unfortunately, I won't have much time this weekend to debug -- hopefully some other pwb developer has.
Best regards, and thanks again, Merlijn
P.S. On 30 September 2011 11:12, Max Semenik maxsem.wiki@gmail.com wrote:
So you screen-scrape? No surprise it breaks. Why? For example, due to protocol-relative URLs. Or some other changes to HTML output. Why not just use API?
No, most of pywikipedia has been adapted to the api and/or special:export, which, imo, is just an 'old' mediawiki api. Keep in mind interwiki.py is old (2003!), and pywikipedia initally was an extension of the interwiki bot. Thus, there could very well be some code that is seldom used which still uses screen scraping. And actually, in practice, screen scraping worked pretty well.
Merlijn,
Not bothered by any actual knowledge of pywikibot (which makes it far easier to comment!), is it possible that the bot assumes it's fetching a page, but actually raises an error instead, and this is not handled, interpeting the lack of response as an empty string?
Regards,
Martijn
On Fri, Sep 30, 2011 at 10:37 PM, Merlijn van Deen valhallasw@arctus.nl wrote:
Hi Ariel and Andre,
On Fri, Sep 30, 2011 at 9:39 AM, Ariel T. Glenn ariel@wikimedia.orgwrote:
Out of curiosity... If the new revisions of one of these badly edited pages are deleted, leaving the top revision as the one just before the bad iw bot edit, does a rerun of the bot on the page fail?
On Fri, Sep 30, 2011 at 11:13 AM, Andre Engels andreengels@gmail.com wrote:
I deleted the page [[nl:Blankenbach]], then restored the 2 versions before the problematic bot edit. When now I look at the page, instead of the page content I get:
(...)
Using this undeleted version, and running interwiki.py, gives the expected result: valhallasw@dorthonion:~/src/pywikipedia/trunk$ python interwiki.py -page:Blankenbach NOTE: Number of pages queued is 0, trying to add 60 more. Getting 1 pages from wikipedia:nl... WARNING: Family file wikipedia contains version number 1.17wmf1, but it should be 1.18wmf1 NOTE: [[nl:Blankenbach]] does not exist. Skipping.
This also happens for running it from dewiki (python interwiki.py -lang:de -page:Blankenbach%20%28Begriffskl%C3%A4rung%29) or running as 'full-auto' bot (python interwiki.py -all -async -cleanup -log -auto -ns:0 -start:Blankenbach).
Special:Export acts like the page just does not exist (http://nl.wikipedia.org/w/index.php?title=Speciaal:Exporteren&useskin=mo... shows page Blanzac but not Blankenbach)
api.php also more or less does the expected thing: http://nl.wikipedia.org/w/api.php?action=query&prop=revisions&titles...
- that is, unless you supply rvlimit=1:
http://nl.wikipedia.org/w/api.php?action=query&prop=revisions&titles...
However, none of them seem to return an empty page - and playing around with pywikipediabot does not allow be to get an empty page (depending on settings, it can either be the result on the edit page (page.get(), use_api=False / screen scraping), a pywikibot.exceptions.NoPage exception (PreloadingGenerator / wikipedia.getall, which uses Special:Export) or the correct page text (page.get(), use_api=True).
Anyway, thanks a huge heap for trying this (and for everyone, for thinking about it). Unfortunately, I won't have much time this weekend to debug -- hopefully some other pwb developer has.
Best regards, and thanks again, Merlijn
P.S. On 30 September 2011 11:12, Max Semenik maxsem.wiki@gmail.com wrote:
So you screen-scrape? No surprise it breaks. Why? For example, due to protocol-relative URLs. Or some other changes to HTML output. Why not just use API?
No, most of pywikipedia has been adapted to the api and/or special:export, which, imo, is just an 'old' mediawiki api. Keep in mind interwiki.py is old (2003!), and pywikipedia initally was an extension of the interwiki bot. Thus, there could very well be some code that is seldom used which still uses screen scraping. And actually, in practice, screen scraping worked pretty well.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On Fri, Sep 30, 2011 at 11:13 AM, Andre Engels andreengels@gmail.com wrote:
On Fri, Sep 30, 2011 at 9:39 AM, Ariel T. Glenn ariel@wikimedia.org wrote:
Out of curiosity... If the new revisions of one of these badly edited pages are deleted, leaving the top revision as the one just before the bad iw bot edit, does a rerun of the bot on the page fail?
I did a test, and the result was very interesting, which might point to the cause of this bug:
I deleted the page [[nl:Blankenbach]], then restored the 2 versions before the problematic bot edit. When now I look at the page, instead of the page content I get:
Can you try this on another of the problem pages?
wikitech-l@lists.wikimedia.org