On 18/11/12 12:36, Sumana Harihareswara wrote:
The Internet Archive wants to particularly make sure to archive pages that Wikipedians use as citations. A GSoC project last year got most of the way to that goal but never quite finished making the feed of new links for use by the Archive. Would anyone else like to take this up?
More information:
https://www.mediawiki.org/wiki/User:Kevin_Brown/ArchiveLinks
http://toolserver.org/~nn123645/toolserver-feed/cronscript.php (You could ask Kevin to make his Toolserver project a MMP or you could just write your own script.)
https://www.mediawiki.org/wiki/Extension:ArchiveLinks - would have to be moved into Git from Subversion.
http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(policy)&am...
- there is a real hunger for this!
Hi -- instead of the implementation suggested above, which seems to combine link discovery with its own archiving engine, how about just generating an RSS feed of external links present (or possibly just those newly inserted) in pages edited in the last (say) five minutes, for other entities such as the Internet Archive to consume?
This would only require soft state, would not require the WMF to fetch or store any external web content, with all of the related possible problems associated with web archiving (retries, security, copyright, legality...), and would not require the WMF to keep track of what resources had been archived: each external archive could do that for itself.
The guts of something like this could be written using only the http://www.mediawiki.org/wiki/API:Recentchanges and http://www.mediawiki.org/wiki/API:Exturlusage APIs.
It looks like Kevin's "cronscript" link above does something just like this already -- adapting its output to generate RSS, and caching its output to prevent massive CPU overhead on repeated calls, would surely be trivial.
Neil