On 10/16/10 8:40 PM, Fred Bauder wrote:
http://mastersofmedia.hum.uva.nl/2010/10/16/wikipedia-we-have-a-google-refre...
The linked blog post laments the lag between the removal of vandalism on Wikipedia and its removal in Google's indices and cached data.
There is a way to mitigate that problem -- there are protocols to let Google know about recently changed pages. I'm assuming that we have no arrangement in place already for them to crawl recent changes for all of Wikipedia?
In any case, the more interesting goal is not so much to mitigate vandalism, but to increase the coverage and timeliness of the whole collection.
Anyway, the standard way to do this is Sitemaps:
As the name suggests, "Sitemaps" were originally intended as hints about site structure, but search engines like Google now use it as a sort of feed of recently changed pages.
http://www.sitemaps.org/faq.php#faq_submitting_changes
They don't accept something sensible like RSS or XMPP even from other top 50 websites, unless you happen to be Six Apart or Twitter. Still, we could ask, since Daniel Kinzler has a working demo of recent changes via XMPP.
Alternatively, we could use the XMPP stream to either transform it to a Sitemaps-compatible structure or generate both kinds of files at the same time. I assume, famous last words, that the really heavy lifting is already done since we have a recent changes feature.
I don't know if I'm committing any resources to this (I'm still busy with other stuff for the next two months at least) but I happen to know a lot about this from an aborted project at another employer, so I have always wanted to actually use that knowledge.