[Pywikipedia-l] [Wikitech-l] serious interwiki.py issues on MW 1.18 wikis

Merlijn van Deen valhallasw at arctus.nl
Fri Sep 30 20:37:31 UTC 2011


Hi Ariel and Andre,

On Fri, Sep 30, 2011 at 9:39 AM, Ariel T. Glenn <ariel at wikimedia.org>wrote:
> Out of curiosity... If the new revisions of one of these badly edited
> pages are deleted, leaving the top revision as the one just before the
> bad iw bot edit, does a rerun of the bot on the page fail?

On Fri, Sep 30, 2011 at 11:13 AM, Andre Engels <andreengels at gmail.com> wrote:
> I deleted the page [[nl:Blankenbach]], then restored the 2 versions before
> the problematic bot edit. When now I look at the page, instead of the page
> content I get:
(...)

Using this undeleted version, and running interwiki.py, gives the
expected result:
valhallasw at dorthonion:~/src/pywikipedia/trunk$ python interwiki.py
-page:Blankenbach
NOTE: Number of pages queued is 0, trying to add 60 more.
Getting 1 pages from wikipedia:nl...
WARNING: Family file wikipedia contains version number 1.17wmf1, but
it should be 1.18wmf1
NOTE: [[nl:Blankenbach]] does not exist. Skipping.

This also happens for running it from dewiki (python interwiki.py
-lang:de -page:Blankenbach%20%28Begriffskl%C3%A4rung%29) or running as
'full-auto' bot (python interwiki.py -all -async -cleanup -log -auto
-ns:0 -start:Blankenbach).

Special:Export acts like the page just does not exist
(http://nl.wikipedia.org/w/index.php?title=Speciaal:Exporteren&useskin=monobook&action=submit&curonly=True&pages=Blankenbach%0D%0ABlanzac
shows page Blanzac but not Blankenbach)

api.php also more or less does the expected thing:
http://nl.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Blankenbach&rvprop=timestamp|user|comment|content
- that is, unless you supply rvlimit=1:
http://nl.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Blankenbach&rvprop=timestamp|user|comment|content&rvlimit=1

However, none of them seem to return an empty page - and playing
around with pywikipediabot does not allow be to get an empty page
(depending on settings, it can either be the result on the edit page
(page.get(), use_api=False / screen scraping), a
pywikibot.exceptions.NoPage exception (PreloadingGenerator /
wikipedia.getall, which uses Special:Export) or the correct page text
(page.get(), use_api=True).

Anyway, thanks a huge heap for trying this (and for everyone, for
thinking about it). Unfortunately, I won't have much time this weekend
to debug -- hopefully some other pwb developer has.

Best regards, and thanks again,
Merlijn

P.S.
On 30 September 2011 11:12, Max Semenik <maxsem.wiki at gmail.com> wrote:
> So you screen-scrape? No surprise it breaks. Why? For example, due to
> protocol-relative URLs. Or some other changes to HTML output. Why not just
> use API?
No, most of pywikipedia has been adapted to the api and/or
special:export, which, imo, is just an 'old' mediawiki api. Keep in
mind interwiki.py is old (2003!), and pywikipedia initally was an
extension of the interwiki bot. Thus, there could very well be some
code that is seldom used which still uses screen scraping. And
actually, in practice, screen scraping worked pretty well.



More information about the Pywikipedia-l mailing list