On Fri, 12 Mar 2004 19:31:05 +0000, David Rodeback wrote:
Download and install the texts. Spider your installation and extract images references. Convert the filenames to those matching the pictures at the WP site. Download the files of this list using 'wget'.
Or something like that could work.
Since our current process includes all these steps except the last, at which point we link to the file, not get it, this is easily done.
Am I to gather that a reasonably well-behaved spider is preferred to linking back to Wikipedia's site as we have been doing?
Can someone define for me what would be the off-peak hours in which such a spider should run?
See http://wikimedia.org/stats/live/org.wikimedia.all.squid.requests-hits.html
Finally, is there a place at Wikipedia (I know of several elsewhere) for registering such spiders with descriptions and contact information, in case someone observes the spider working and wonders, or in case there is some sort of problem?
Set the user agent to something descriptive, like 'worldhistory'. Be sure not to include typical spider UA strings. And throttle the requests, wget offers a rate setting for that.