On 18/11/12 12:36, Sumana Harihareswara wrote:
The Internet Archive wants to particularly make sure
to archive pages
that Wikipedians use as citations. A GSoC project last year got most of
the way to that goal but never quite finished making the feed of new
links for use by the Archive. Would anyone else like to take this up?
More information:
https://www.mediawiki.org/wiki/User:Kevin_Brown/ArchiveLinks
http://toolserver.org/~nn123645/toolserver-feed/cronscript.php (You
could ask Kevin to make his Toolserver project a MMP or you could just
write your own script.)
https://www.mediawiki.org/wiki/Extension:ArchiveLinks - would have to be
moved into Git from Subversion.
http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(policy)&a…
- there is a real hunger for this!
Hi -- instead of the implementation suggested above, which seems to
combine link discovery with its own archiving engine, how about just
generating an RSS feed of external links present (or possibly just those
newly inserted) in pages edited in the last (say) five minutes, for
other entities such as the Internet Archive to consume?
This would only require soft state, would not require the WMF to fetch
or store any external web content, with all of the related possible
problems associated with web archiving (retries, security, copyright,
legality...), and would not require the WMF to keep track of what
resources had been archived: each external archive could do that for itself.
The guts of something like this could be written using only the
http://www.mediawiki.org/wiki/API:Recentchanges and
http://www.mediawiki.org/wiki/API:Exturlusage APIs.
It looks like Kevin's "cronscript" link above does something just like
this already -- adapting its output to generate RSS, and caching its
output to prevent massive CPU overhead on repeated calls, would surely
be trivial.
Neil