* Neil Kandalgaonkar neilk@wikimedia.org [Sat, 16 Oct 2010 22:20:50 -0700]:
On 10/16/10 8:40 PM, Fred Bauder wrote:
http://mastersofmedia.hum.uva.nl/2010/10/16/wikipedia-we-have-a-google-refre...
The linked blog post laments the lag between the removal of vandalism
on
Wikipedia and its removal in Google's indices and cached data.
I am completely disconnected from Wikipedia - I do use MediaWiki for small projects. However, wasn't there FlaggedRevs deployed at Wikipedia for some time already? If so, how can Google (which should index pages as anonymous), recieves vandalized pages instead of approved revisions? Only registered users should see vandalism by default.
There is a way to mitigate that problem -- there are protocols to let Google know about recently changed pages. I'm assuming that we have no arrangement in place already for them to crawl recent changes for all
of
Wikipedia?
In any case, the more interesting goal is not so much to mitigate vandalism, but to increase the coverage and timeliness of the whole collection.
Anyway, the standard way to do this is Sitemaps:
As the name suggests, "Sitemaps" were originally intended as hints
about
site structure, but search engines like Google now use it as a sort of feed of recently changed pages.
Sitemaps tend to grow huge, however MediaWiki has a sitemap generator (in /maintenance). I use it for small wikis to improve "coverage". However currently they are non-incremental so for a such huge wiki it would take a lot of time to produce them.
http://www.sitemaps.org/faq.php#faq_submitting_changes
They don't accept something sensible like RSS or XMPP even from other top 50 websites, unless you happen to be Six Apart or Twitter. Still,
we
could ask, since Daniel Kinzler has a working demo of recent changes
via
XMPP.
Yahoo has it's own sitemap which are extended version of RSS (MRSS), Google should support it as well? That would especially be useful for commons media. Also ordinary text pages probably, too. Dmitriy