K. Peachey schrieb:
On Sat, Jun 27, 2009 at 5:40 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯daxim@cpan.org wrote:
The tools exist. Ample documentation exists. Both programmatic interfaces and easy form-based interfaces exist.
Screen scraping still happens not only because of laziness but also because the correct way is not promoted. For example, if I access the English Wikipedia main page with libwww-perl (a banned UA), the response body says (among other things):
Our servers are currently experiencing a technical problem. This is probably temporary and should be fixed soon. Please _try again_ in a few minutes.
This is of course bullshit of the highest degree. It's certainly a permanent problem, and no amount of retrying will get me around the ban. The response body should say something like this:
Screen scraping is forbidden as it causes undue burden on the infrastructure and servers. Use the export feature to parse single pages, use the dump feature to parse a whole wiki.
http://mediawiki.org/wiki/Special:Export http://en.wikipedia.org/wiki/WP:Export
http://download.wikimedia.org/ http://en.wikipedia.org/wiki/WP:Download
Can anyone with access to the appropriate bugtracker file a bug on this, please?
The [[Special:Export]] function exports the page in a xml format whilst most people want a plain/static html dump of the page, and the download packages are for the whole database collections which for the people i'm suggesting this tool for is way too much, they just want seveal articles not a several gigabyte collection of every article page.
You are looking for action=render then. I don't think there is a form based interface for it, but creating one would be very trivial.
Example: http://en.wikipedia.org/wiki/Tool?action=render
-- daniel