[Toolserver-l] Little Project Idea
Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯
daxim at cpan.org
Sat Jun 27 07:40:43 UTC 2009
The tools exist. Ample documentation exists. Both programmatic
interfaces and easy form-based interfaces exist.
Screen scraping still happens not only because of laziness but also
because the correct way is not promoted. For example, if I access the
English Wikipedia main page with libwww-perl (a banned UA), the
response body says (among other things):
Our servers are currently experiencing a technical problem. This is
probably temporary and should be fixed soon. Please _try again_ in a
few minutes.
This is of course bullshit of the highest degree. It's certainly a
permanent problem, and no amount of retrying will get me around the
ban. The response body should say something like this:
Screen scraping is forbidden as it causes undue burden on the
infrastructure and servers. Use the export feature to parse single
pages, use the dump feature to parse a whole wiki.
http://mediawiki.org/wiki/Special:Export
http://en.wikipedia.org/wiki/WP:Export
http://download.wikimedia.org/
http://en.wikipedia.org/wiki/WP:Download
Can anyone with access to the appropriate bugtracker file a bug on this,
please?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part.
Url : http://lists.wikimedia.org/pipermail/toolserver-l/attachments/20090627/14f6d8e9/attachment.pgp
More information about the Toolserver-l
mailing list