Re: [Pywikipedia-l] [Wikitech-l] serious interwiki.py issues on MW 1.18 wikis - Pywikipedia-l

30 Sep 2011

Hi Ariel and Andre,

On Fri, Sep 30, 2011 at 9:39 AM, Ariel T. Glenn &lt;ariel(a)wikimedia.org&gt;wrote;wrote:
...
  Out of curiosity... If the new revisions of one of
these badly edited
 pages are deleted, leaving the top revision as the one just before the
 bad iw bot edit, does a rerun of the bot on the page fail? 
On Fri, Sep 30, 2011 at 11:13 AM, Andre Engels &lt;andreengels(a)gmail.com&gt; wrote:
...
  I deleted the page [[nl:Blankenbach]], then restored
the 2 versions before
 the problematic bot edit. When now I look at the page, instead of the page
 content I get: (...)

Using this undeleted version, and running interwiki.py, gives the
expected result:
valhallasw@dorthonion:~/src/pywikipedia/trunk$ python interwiki.py
-page:Blankenbach
NOTE: Number of pages queued is 0, trying to add 60 more.
Getting 1 pages from wikipedia:nl...
WARNING: Family file wikipedia contains version number 1.17wmf1, but
it should be 1.18wmf1
NOTE: [[nl:Blankenbach]] does not exist. Skipping.

This also happens for running it from dewiki (python interwiki.py
-lang:de -page:Blankenbach%20%28Begriffskl%C3%A4rung%29) or running as
'full-auto' bot (python interwiki.py -all -async -cleanup -log -auto
-ns:0 -start:Blankenbach).

Special:Export acts like the page just does not exist
(http://nl.wikipedia.org/w/index.php?title=Speciaal:Exporteren&useskin=m…
shows page Blanzac but not Blankenbach)

api.php also more or less does the expected thing:
http://nl.wikipedia.org/w/api.php?action=query&prop=revisions&title…
- that is, unless you supply rvlimit=1:
http://nl.wikipedia.org/w/api.php?action=query&prop=revisions&title…

However, none of them seem to return an empty page - and playing
around with pywikipediabot does not allow be to get an empty page
(depending on settings, it can either be the result on the edit page
(page.get(), use_api=False / screen scraping), a
pywikibot.exceptions.NoPage exception (PreloadingGenerator /
wikipedia.getall, which uses Special:Export) or the correct page text
(page.get(), use_api=True).

Anyway, thanks a huge heap for trying this (and for everyone, for
thinking about it). Unfortunately, I won't have much time this weekend
to debug -- hopefully some other pwb developer has.

Best regards, and thanks again,
Merlijn

P.S.
On 30 September 2011 11:12, Max Semenik &lt;maxsem.wiki(a)gmail.com&gt; wrote:
...
  So you screen-scrape? No surprise it breaks. Why? For
example, due to
 protocol-relative URLs. Or some other changes to HTML output. Why not just
 use API? No, most of pywikipedia has been adapted to the api and/or
special:export, which, imo, is just an 'old' mediawiki api. Keep in
mind interwiki.py is old (2003!), and pywikipedia initally was an
extension of the interwiki bot. Thus, there could very well be some
code that is seldom used which still uses screen scraping. And
actually, in practice, screen scraping worked pretty well.