Vinay from the Internet Archive asked me, with reference to
Is there someone I can contact regarding parsing out the URLs from the stream of recent
changes? The idea being to grab the text of the recent change and extract out anything
that looks like a URL and feed it into a queue at IA's end for archiving.
Looking at the Recent Changes feed, it looks like I'd need to parse the
'diff' page to find any new links, or in the case of 'new' pages, parse
the new page to find all external links. Is there a better way? A live feed that includes
the text that's changed for every article?
Vinay, #mediawiki on Freenode IRC, and possibly also the mediawiki-api
mailing list, will be helpful to you.
Engineering Community Manager