Vinay from the Internet Archive asked me, with reference to http://meta.wikimedia.org/wiki/Help:Recent_changes :
Hi Sumana,
Is there someone I can contact regarding parsing out the URLs from the stream of recent changes? The idea being to grab the text of the recent change and extract out anything that looks like a URL and feed it into a queue at IA's end for archiving.
Looking at the Recent Changes feed, it looks like I'd need to parse the 'diff' page to find any new links, or in the case of 'new' pages, parse the new page to find all external links. Is there a better way? A live feed that includes the text that's changed for every article?
Thanks, Vinay
Vinay, #mediawiki on Freenode IRC, and possibly also the mediawiki-api mailing list, will be helpful to you.
Thanks all.
* Sumana Harihareswara wrote:
Vinay from the Internet Archive asked me, with reference to http://meta.wikimedia.org/wiki/Help:Recent_changes :
Is there someone I can contact regarding parsing out the URLs from the stream of recent changes? The idea being to grab the text of the recent change and extract out anything that looks like a URL and feed it into a queue at IA's end for archiving.
Note that http://www.mediawiki.org/wiki/API:Properties#extlinks_.2F_el there is an API to get all the external links on a given page and there may well be bots that monitor new external links already as anti-spam measure. Also, the IRC feeds are probably a better way to keep track of recent changes.
Vinay, could you tell us what you diff URLs and external links for? If it is for regular crawling of Wikimedia projects pages, you may be interested in our OAI feed for search engines: https://meta.wikimedia.org/wiki/Wikimedia_update_feed_service
Nemo
Vinay, would the OAI feed work? We haven't seen your response onlist. Thanks!
Sumana Harihareswara Engineering Community Manager Wikimedia Foundation
On Tue, Sep 3, 2013 at 11:15 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Vinay, could you tell us what you diff URLs and external links for? If it is for regular crawling of Wikimedia projects pages, you may be interested in our OAI feed for search engines: <https://meta.wikimedia.org/** wiki/Wikimedia_update_feed_**servicehttps://meta.wikimedia.org/wiki/Wikimedia_update_feed_service
Nemo
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l
That page mentions that Wikipedia is no longer providing the feed service to new parties. Can it be enabled for the Internet Archive? I'll ask Kul about this since he's the contact listed.
Sumana Harihareswara Engineering Community Manager Wikimedia Foundation
On Tue, Sep 3, 2013 at 11:15 AM, Federico Leva (Nemo) nemowiki@gmail.comwrote:
Vinay, could you tell us what you diff URLs and external links for? If it is for regular crawling of Wikimedia projects pages, you may be interested in our OAI feed for search engines: <https://meta.wikimedia.org/** wiki/Wikimedia_update_feed_**servicehttps://meta.wikimedia.org/wiki/Wikimedia_update_feed_service
Nemo
______________________________**_________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l
wikitech-l@lists.wikimedia.org