Hi Ariel and Andre,
On Fri, Sep 30, 2011 at 9:39 AM, Ariel T. Glenn ariel@wikimedia.orgwrote:
Out of curiosity... If the new revisions of one of these badly edited pages are deleted, leaving the top revision as the one just before the bad iw bot edit, does a rerun of the bot on the page fail?
On Fri, Sep 30, 2011 at 11:13 AM, Andre Engels andreengels@gmail.com wrote:
I deleted the page [[nl:Blankenbach]], then restored the 2 versions before the problematic bot edit. When now I look at the page, instead of the page content I get:
(...)
Using this undeleted version, and running interwiki.py, gives the expected result: valhallasw@dorthonion:~/src/pywikipedia/trunk$ python interwiki.py -page:Blankenbach NOTE: Number of pages queued is 0, trying to add 60 more. Getting 1 pages from wikipedia:nl... WARNING: Family file wikipedia contains version number 1.17wmf1, but it should be 1.18wmf1 NOTE: [[nl:Blankenbach]] does not exist. Skipping.
This also happens for running it from dewiki (python interwiki.py -lang:de -page:Blankenbach%20%28Begriffskl%C3%A4rung%29) or running as 'full-auto' bot (python interwiki.py -all -async -cleanup -log -auto -ns:0 -start:Blankenbach).
Special:Export acts like the page just does not exist (http://nl.wikipedia.org/w/index.php?title=Speciaal:Exporteren&useskin=mo... shows page Blanzac but not Blankenbach)
api.php also more or less does the expected thing: http://nl.wikipedia.org/w/api.php?action=query&prop=revisions&titles... - that is, unless you supply rvlimit=1: http://nl.wikipedia.org/w/api.php?action=query&prop=revisions&titles...
However, none of them seem to return an empty page - and playing around with pywikipediabot does not allow be to get an empty page (depending on settings, it can either be the result on the edit page (page.get(), use_api=False / screen scraping), a pywikibot.exceptions.NoPage exception (PreloadingGenerator / wikipedia.getall, which uses Special:Export) or the correct page text (page.get(), use_api=True).
Anyway, thanks a huge heap for trying this (and for everyone, for thinking about it). Unfortunately, I won't have much time this weekend to debug -- hopefully some other pwb developer has.
Best regards, and thanks again, Merlijn
P.S. On 30 September 2011 11:12, Max Semenik maxsem.wiki@gmail.com wrote:
So you screen-scrape? No surprise it breaks. Why? For example, due to protocol-relative URLs. Or some other changes to HTML output. Why not just use API?
No, most of pywikipedia has been adapted to the api and/or special:export, which, imo, is just an 'old' mediawiki api. Keep in mind interwiki.py is old (2003!), and pywikipedia initally was an extension of the interwiki bot. Thus, there could very well be some code that is seldom used which still uses screen scraping. And actually, in practice, screen scraping worked pretty well.