Re: [Toolserver-l] Little Project Idea

27 Jun 2009

On Sat, Jun 27, 2009 at 5:40 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯&lt;daxim(a)cpan.org&gt; wrote:
...
  The tools exist. Ample documentation exists. Both
programmatic
 interfaces and easy form-based interfaces exist.

 Screen scraping still happens not only because of laziness but also
 because the correct way is not promoted. For example, if I access the
 English Wikipedia main page with libwww-perl (a banned UA), the
 response body says (among other things):

    Our servers are currently experiencing a technical problem. This is
    probably temporary and should be fixed soon. Please _try again_ in a
    few minutes.

 This is of course bullshit of the highest degree. It's certainly a
 permanent problem, and no amount of retrying will get me around the
 ban. The response body should say something like this:

    Screen scraping is forbidden as it causes undue burden on the
    infrastructure and servers. Use the export feature to parse single
    pages, use the dump feature to parse a whole wiki.

    http://mediawiki.org/wiki/Special:Export
    http://en.wikipedia.org/wiki/WP:Export

    http://download.wikimedia.org/
    http://en.wikipedia.org/wiki/WP:Download

 Can anyone with access to the appropriate bugtracker file a bug on this,
 please? The [[Special:Export]] function exports the page in a xml format
whilst most people want a plain/static html dump of the page, and the
download packages are for the whole database collections which for the
people i'm suggesting this tool for is way too much, they just want
seveal articles not a several gigabyte collection of every article
page.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [Toolserver-l] Little Project Idea