On 10/16/10 8:40 PM, Fred Bauder wrote:
http://mastersofmedia.hum.uva.nl/2010/10/16/wikipedia-we-have-a-google-refre...
The linked blog post laments the lag between the removal of vandalism on Wikipedia and its removal in Google's indices and cached data.
There is a way to mitigate that problem -- there are protocols to let Google know about recently changed pages. I'm assuming that we have no arrangement in place already for them to crawl recent changes for all of Wikipedia?
In any case, the more interesting goal is not so much to mitigate vandalism, but to increase the coverage and timeliness of the whole collection.
Anyway, the standard way to do this is Sitemaps:
As the name suggests, "Sitemaps" were originally intended as hints about site structure, but search engines like Google now use it as a sort of feed of recently changed pages.
http://www.sitemaps.org/faq.php#faq_submitting_changes
They don't accept something sensible like RSS or XMPP even from other top 50 websites, unless you happen to be Six Apart or Twitter. Still, we could ask, since Daniel Kinzler has a working demo of recent changes via XMPP.
Alternatively, we could use the XMPP stream to either transform it to a Sitemaps-compatible structure or generate both kinds of files at the same time. I assume, famous last words, that the really heavy lifting is already done since we have a recent changes feature.
I don't know if I'm committing any resources to this (I'm still busy with other stuff for the next two months at least) but I happen to know a lot about this from an aborted project at another employer, so I have always wanted to actually use that knowledge.
As far as I know, sitemaps are used primarily to inform the search engine of the pages on a website directly, rather than waiting for the search engine to figure them out from links from external sites. I vaguely remember we used to generate sitemaps, but then stopped because google more-or-less totally ignored them, and instead choose to index articles based on their own algorithms and measures of "importance".
I am sure google already taps into recent changes in wikipedia, but it might be worth contacting them officially to see if edits marked as vandalism can be threated with larger priority in their indexing process.
Cheers, Robert
Anyway, the standard way to do this is Sitemaps:
http://www.sitemaps.org/
As the name suggests, "Sitemaps" were originally intended as hints about site structure, but search engines like Google now use it as a sort of feed of recently changed pages.
http://www.sitemaps.org/faq.php#faq_submitting_changes
They don't accept something sensible like RSS or XMPP even from other top 50 websites, unless you happen to be Six Apart or Twitter. Still, we could ask, since Daniel Kinzler has a working demo of recent changes via XMPP.
Alternatively, we could use the XMPP stream to either transform it to a Sitemaps-compatible structure or generate both kinds of files at the same time. I assume, famous last words, that the really heavy lifting is already done since we have a recent changes feature.
I don't know if I'm committing any resources to this (I'm still busy with other stuff for the next two months at least) but I happen to know a lot about this from an aborted project at another employer, so I have always wanted to actually use that knowledge.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On 10/17/2010 5:40 AM, Robert Stojnic wrote:
I am sure google already taps into recent changes in wikipedia, but it might be worth contacting them officially to see if edits marked as vandalism can be threated with larger priority in their indexing process.
Cheers, Robert
And what qualifies as vandalism and who gets to decide that?
On 17/10/10 13:34, Q wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On 10/17/2010 5:40 AM, Robert Stojnic wrote:
I am sure google already taps into recent changes in wikipedia, but it might be worth contacting them officially to see if edits marked as vandalism can be threated with larger priority in their indexing process.
Cheers, Robert
And what qualifies as vandalism and who gets to decide that?
Good point, I think it also applies to the original blog post about marking pages to be refreshed. I guess the community will have to draft some kind of a proposal of what kind of edits can be marked for fast-track search engine refresh, probably in lines of what is thought of as obvious vandalism.
Cheers, r.
Hoi, If you understand the issue, you would know who decides what qualifies as vandalism. It is exactly the same people who already decide what vandalism is. Thanks, GerardM
On 17 October 2010 14:34, Q overlordq@gmail.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
On 10/17/2010 5:40 AM, Robert Stojnic wrote:
I am sure google already taps into recent changes in wikipedia, but it might be worth contacting them officially to see if edits marked as vandalism can be threated with larger priority in their indexing process.
Cheers, Robert
And what qualifies as vandalism and who gets to decide that? -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iQEcBAEBCAAGBQJMuu1JAAoJEL+AqFCTAyc28+gH/RZu0k20z0aGtYUsVTFkaOfG Gv8rRgqHA5/Jthuw5oV8sKHxym9mPTkTvsTniDDYjKpBP52ANmgggU/AoMYRPIwy gGmE+9G0vEWqQWFAECtPAWfP7yHA+O6C+Ujpm7K+g7JNxX3m+J81Q6oBSpLAb+Um WbFpvJQObqbAmutmiWnBvR0CqzY/Pb/SiHeZrH//zYVr+/d7Z+O73dxFHNW+lac3 g+eahMiJyqSWmpkfL6iLFwMPidbnPdoYsaGhr30znuywtHneXnC6Doo7ILkPZRBE ly3E4Y/M/F+lYLUZngMBkpI9Y31BGlrp7eyO/RqeLTh3T/+4uClN7Bcr1VvijLg= =IK9/ -----END PGP SIGNATURE-----
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
On 10/17/2010 7:54 AM, Gerard Meijssen wrote:
Hoi, If you understand the issue, you would know who decides what qualifies as vandalism. It is exactly the same people who already decide what vandalism is.
So basically you want anybody who visits the website to be able to tell google to fast-track reindex a page and you think google will go for that?
Google would rather not have any vandalism in their index, but that's not the point. They care about the reindexing schedule. If we create sitemaps that also note the recent velocity of changes, the vandal's edits in a sense work against themselves. Every new change brings new scrutiny.
If you use the protocols they understand, and they think you're a high priority, Google can update their index at a rather fearsome speed. A new link can be in the #1 position before you can finish typing a tweet. Generally this is not the bottleneck.
On 10/17/10 6:23 AM, Q wrote:
On 10/17/2010 7:54 AM, Gerard Meijssen wrote:
Hoi, If you understand the issue, [...]
So basically you [...] and you think google will go for that?
Okay, you both are about to enter into a mini-flamewar, so can we just agree that "who decides what vandalism is" is generally a settled question at Wikipedia, and opening a post with "if you understand the issue" is a little bit aggressive?
On 17.10.2010, 22:42 Neil wrote:
Google would rather not have any vandalism in their index, but that's not the point. They care about the reindexing schedule. If we create sitemaps that also note the recent velocity of changes, the vandal's edits in a sense work against themselves. Every new change brings new scrutiny.
If you use the protocols they understand, and they think you're a high priority, Google can update their index at a rather fearsome speed. A new link can be in the #1 position before you can finish typing a tweet. Generally this is not the bottleneck.
On 10/17/10 6:23 AM, Q wrote:
On 10/17/2010 7:54 AM, Gerard Meijssen wrote:
Hoi, If you understand the issue, [...]
So basically you [...] and you think google will go for that?
Okay, you both are about to enter into a mini-flamewar, so can we just agree that "who decides what vandalism is" is generally a settled question at Wikipedia, and opening a post with "if you understand the issue" is a little bit aggressive?
/me suggests to turn the problem into "who decides which version is flagged and which is not". This is the only sane way, in addition being the way we have all the technical means to use at any moment.
On Sun, Oct 17, 2010 at 3:23 PM, Q overlordq@gmail.com wrote:
On 10/17/2010 7:54 AM, Gerard Meijssen wrote:
Hoi, If you understand the issue, you would know who decides what qualifies as vandalism. It is exactly the same people who already decide what vandalism is.
So basically you want anybody who visits the website to be able to tell google to fast-track reindex a page and you think google will go for that?
It seems a bit strange to me to expect Google (or anyone external) to devote more effort to vandalized than to useful Wikipedia content.
I also disagree with the diagnosis in the above-mentioned blog post, which reads "So where does the problem lie? With the search engine information refresh rate."
In my view, the problem lies primarily with Wikipedia, and specifically with vandalism being too voluminous and too visible to the public. Technical solutions for both issues exist, and those addressing the second kind have the potential to be accepted by the community.
Daniel
Hoi, The point of spending time is exactly to prevent vandalised content to remain available in the search engines. The synchronisation of the changes in Wikipedia and the reflection in search engines is beneficial to us both.
Vandalism is always more voluminous then we would like to have it. Here we are discussing the existence of our vandalism once it has *already* been solved in Wikipedia. I do not see how our community would be opposed to having a solution for this issue. Thanks, GerardM
On 18 October 2010 15:08, Daniel Mietchen daniel.mietchen@googlemail.comwrote:
On Sun, Oct 17, 2010 at 3:23 PM, Q overlordq@gmail.com wrote:
On 10/17/2010 7:54 AM, Gerard Meijssen wrote:
Hoi, If you understand the issue, you would know who decides what qualifies
as
vandalism. It is exactly the same people who already decide what
vandalism
is.
So basically you want anybody who visits the website to be able to tell google to fast-track reindex a page and you think google will go for
that?
It seems a bit strange to me to expect Google (or anyone external) to devote more effort to vandalized than to useful Wikipedia content.
I also disagree with the diagnosis in the above-mentioned blog post, which reads "So where does the problem lie? With the search engine information refresh rate."
In my view, the problem lies primarily with Wikipedia, and specifically with vandalism being too voluminous and too visible to the public. Technical solutions for both issues exist, and those addressing the second kind have the potential to be accepted by the community.
Daniel
-- http://en.wikipedia.org/wiki/User:Mietchen
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
* Gerard Meijssen gerard.meijssen@gmail.com [Mon, 18 Oct 2010 17:26:54 +0200]:
Hoi, The point of spending time is exactly to prevent vandalised content to remain available in the search engines. The synchronisation of the changes in Wikipedia and the reflection in search engines is beneficial to us both.
Vandalism is always more voluminous then we would like to have it.
Here
we are discussing the existence of our vandalism once it has *already*
been
solved in Wikipedia. I do not see how our community would be opposed
to
having a solution for this issue. Thanks,
If I had to deal with this problem, most of these vandal edits can be easily catched with simple text match filters, which should automatically flag such revision visible only to registered users, until it's been approved or rejected for anonymous access. Dmitriy
On 10/17/10 3:40 AM, Robert Stojnic wrote:
I vaguely remember we used to generate sitemaps, but then stopped because google more-or-less totally ignored them, and instead choose to index articles based on their own algorithms and measures of "importance".
We don't have any control over what they do, obviously, but preventing the kind of terrible experience the OP has had is in both our interests.
* Neil Kandalgaonkar neilk@wikimedia.org [Sat, 16 Oct 2010 22:20:50 -0700]:
On 10/16/10 8:40 PM, Fred Bauder wrote:
http://mastersofmedia.hum.uva.nl/2010/10/16/wikipedia-we-have-a-google-refre...
The linked blog post laments the lag between the removal of vandalism
on
Wikipedia and its removal in Google's indices and cached data.
I am completely disconnected from Wikipedia - I do use MediaWiki for small projects. However, wasn't there FlaggedRevs deployed at Wikipedia for some time already? If so, how can Google (which should index pages as anonymous), recieves vandalized pages instead of approved revisions? Only registered users should see vandalism by default.
There is a way to mitigate that problem -- there are protocols to let Google know about recently changed pages. I'm assuming that we have no arrangement in place already for them to crawl recent changes for all
of
Wikipedia?
In any case, the more interesting goal is not so much to mitigate vandalism, but to increase the coverage and timeliness of the whole collection.
Anyway, the standard way to do this is Sitemaps:
As the name suggests, "Sitemaps" were originally intended as hints
about
site structure, but search engines like Google now use it as a sort of feed of recently changed pages.
Sitemaps tend to grow huge, however MediaWiki has a sitemap generator (in /maintenance). I use it for small wikis to improve "coverage". However currently they are non-incremental so for a such huge wiki it would take a lot of time to produce them.
http://www.sitemaps.org/faq.php#faq_submitting_changes
They don't accept something sensible like RSS or XMPP even from other top 50 websites, unless you happen to be Six Apart or Twitter. Still,
we
could ask, since Daniel Kinzler has a working demo of recent changes
via
XMPP.
Yahoo has it's own sitemap which are extended version of RSS (MRSS), Google should support it as well? That would especially be useful for commons media. Also ordinary text pages probably, too. Dmitriy
Dmitriy Sintsov wrote:
The linked blog post laments the lag between the removal of vandalism
on
Wikipedia and its removal in Google's indices and cached data.
I am completely disconnected from Wikipedia - I do use MediaWiki for small projects. However, wasn't there FlaggedRevs deployed at Wikipedia for some time already? If so, how can Google (which should index pages as anonymous), recieves vandalized pages instead of approved revisions? Only registered users should see vandalism by default.
Not all pages have flaggedrevs enabled.
For the thread problem, I don't think there's much to do here. If google wants to spider us faster, or use the commercial RC feed, so be it. They are catching fast vandalism, so they should catch the corrections fast, too.
wikitech-l@lists.wikimedia.org