see http://www.google.com/googleblog/2005/01/preventing-comment-spam.html
Are we can join in the exterior links (rel="nofollow") , preventing wikispam?
[[zh:user:shizhao]]
On Jan 18, 2005, at 7:58 PM, shi zhao wrote:
see http://www.google.com/googleblog/2005/01/preventing-comment-spam.html
Are we can join in the exterior links (rel="nofollow") , preventing wikispam?
I'm inclined to agree. It looks legit from a markup perspective (ie it shouldn't cause HTML validators to bitch at us; rel _is_ a defined attribute for <a>, and the set of values is open-ended.)
Of course it won't prevent wikispam, but with luck it will discourage it by reducing its value.
-- brion vibber (brion @ pobox.com)
Brion Vibber wrote:
I'm inclined to agree. It looks legit from a markup perspective (ie it shouldn't cause HTML validators to bitch at us; rel _is_ a defined attribute for <a>, and the set of values is open-ended.)
Of course it won't prevent wikispam, but with luck it will discourage it by reducing its value.
Would this tag apply to all external links, even to the "yes, this is a good link"-links? In the long run, we wouldn't hurt not just spammers but even the good guys by not linking to them in a way search engines would consider this link to be relevant.
Or is this just PageRankParanoia?
Mathias
Mathias Schindler wrote:
Brion Vibber wrote:
I'm inclined to agree. It looks legit from a markup perspective (ie it shouldn't cause HTML validators to bitch at us; rel _is_ a defined attribute for <a>, and the set of values is open-ended.)
Of course it won't prevent wikispam, but with luck it will discourage it by reducing its value.
Would this tag apply to all external links, even to the "yes, this is a good link"-links? In the long run, we wouldn't hurt not just spammers but even the good guys by not linking to them in a way search engines would consider this link to be relevant.
Or is this just PageRankParanoia?
Mathias
The best proposal I've seen so far is to apply it to all recently-modified pages (with 'recent' determined per project by how fast you think the editors will get it cleaned up - probably a day or so is reasonable). That assumes that it will get cleaned up before the timeout expires, and prevents it from being visible to a robot in the interim. But links that survive for a while still become rankable. And it's fairly simple, unlike tracking, per-link, whether or not it has been verified.
It would also penalize valid pages that just change frequently (like, say, the front page) but I would think most such are really not good sources of PageRank correlation anyway.
Kevin Puetz ti 2005/1/25 ChS 09:57 sia-kong:
The best proposal I've seen so far is to apply it to all recently-modified pages (with 'recent' determined per project by how fast you think the editors will get it cleaned up - probably a day or so is reasonable).
This is sensible.
It would also penalize valid pages that just change frequently (like, say, the front page) but I would think most such are really not good sources of PageRank correlation anyway.
Assuming that high frequency translates to "well-patrolled", then maybe only recently AND infrequently modified pages would be "nofollowed". Spammers, of course, might then hit a page multiple times to have it appear well-patrolled.
On Tue, 25 Jan 2005 08:57:02 -0600, Kevin Puetz puetzk@puetzk.org wrote:
The best proposal I've seen so far is to apply it to all recently-modified pages (with 'recent' determined per project by how fast you think the editors will get it cleaned up - probably a day or so is reasonable). That assumes that it will get cleaned up before the timeout expires, and prevents it from being visible to a robot in the interim. But links that survive for a while still become rankable. And it's fairly simple, unlike tracking, per-link, whether or not it has been verified.
Although this *seems* simple, the database request needed to determine "how old is this page" is not, I suspect, a quick one, and server effeciency is a *BIG* issue with the Wikimedia sites. Add to that the complications to do with the cache - the simplest solution might be to block "nofollow" pages from being cached at all, which is an additional waste - and it begins to look like an imbalanced compromise. It would be better to determine more precise circumstances under which the attributes are appropriate, and at least the costs would be well spent.
Personally, I think a combination with new validation or edit patrol features would be a better route to go down - if a state could easily be associated with the article to say "this version has been approved by a trusted user", that would seem a good criterion for removing the "nofollow"s. Of course, you still get some cache awkwardness, but unlike "age of this version", the software could explicitly purge the cache when the approved status changed (since someone would have done that action), so you could have the version with the "nofollow"s in cached until that happened. So you get less cost, and a better match with the general intentions.
On Fri, 4 Feb 2005 14:56:31 +0000, Rowan Collins rowan.collins@gmail.com wrote:
On Tue, 25 Jan 2005 08:57:02 -0600, Kevin Puetz puetzk@puetzk.org wrote:
The best proposal I've seen so far is to apply it to all recently-modified pages (with 'recent' determined per project by how fast you think the editors will get it cleaned up - probably a day or so is reasonable). That assumes that it will get cleaned up before the timeout expires, and prevents it from being visible to a robot in the interim. But links that survive for a while still become rankable. And it's fairly simple, unlike tracking, per-link, whether or not it has been verified.
Although this *seems* simple, the database request needed to determine "how old is this page" [snip]
This should not be an issue since this query is already made, look at the bottom of any page on any wikimedia project, you'll see something like " This page was last modified 00:05, 5 Feb 2005."
On Sun, 6 Feb 2005 21:29:13 +0000, Ævar Arnfjörð Bjarmason avarab@gmail.com wrote:
On Fri, 4 Feb 2005 14:56:31 +0000, Rowan Collins rowan.collins@gmail.com wrote:
Although this *seems* simple, the database request needed to determine "how old is this page" [snip]
This should not be an issue since this query is already made, look at the bottom of any page on any wikimedia project, you'll see something like " This page was last modified 00:05, 5 Feb 2005."
Oops! My bad - now I think about it, there's timestamp right there in the 'cur' table, isn't there? And even in the new schema, you won't be able to get the text of a page without also encountering its revision date, won't you? So, yeah, ignore that bit of stupidity!
On Wednesday 19 January 2005 05:58, shi zhao wrote:
see http://www.google.com/googleblog/2005/01/preventing-comment-spam.html
Not a good idea. See http://nsk.wikinerds.org/blog/index.php?p=119
On Wednesday 19 January 2005 10:11, NSK wrote:
On Wednesday 19 January 2005 05:58, shi zhao wrote:
see http://www.google.com/googleblog/2005/01/preventing-comment-spam.html
Not a good idea. See http://nsk.wikinerds.org/blog/index.php?p=119
Sorry, I mean: http://nsk.wikinerds.org/blog/index.php?p=118
Will it help? It only works when the spammers think/know we have this implemented. And it does also harm the 'valid' links that are created.
Andre Engels
On Wed, 19 Jan 2005 03:58:54 +0000 (UTC), shi zhao shizhao@gmail.com wrote:
see http://www.google.com/googleblog/2005/01/preventing-comment-spam.html
Are we can join in the exterior links (rel="nofollow") , preventing wikispam?
[[zh:user:shizhao]]
Andre Engels wrote:
Will it help? It only works when the spammers think/know we have this implemented. And it does also harm the 'valid' links that are created.
We hope it will help. It's one of those things that will only work well if everyone participates, otherwise spammers will just continue to spam every wiki because they couldn't be bothered checking which ones are using this feature. That's why it's important that we enable this feature by default.
This may reduce the usefulness of blogs and wikis as a means of ranking sites, but they're not particularly useful as it is, because they are easily spammed. Hopefully this initiative will improve the quality of search rankings, allowing good sites to rise above the spam, rather than harm that quality by ignoring useful information. The major search engines obviously think it will.
Wikipedia has mirrors which will probably not implement this feature. That reduces both the usefulness and side effects of this for us. However most spam doesn't seem to be specifically targetted at us -- search google for a spam link you see on Wikipedia and you'll usually get thousands of hits from wikis, blogs and guestbooks. Spam on Wikipedia is usually cleaned up quickly, so the incentive for them to target us is already questionable. The most important thing is that this feature is implemented in MediaWiki by default, because that reduces the incentive for them to spam everything with a <textarea>.
-- Tim Starling
On Wednesday 19 January 2005 11:22, Tim Starling wrote:
feature is implemented in MediaWiki by default, because that reduces the incentive for them to spam everything with a <textarea>.
The nofollow proposal is not a good idea IMO; people will use it for commercial gain.
I can't see the point. Look at the losses and gains:
Gain: prevents a *tiny* amount of spamming (tiny, because the average lifespan of linkspam on the 'pedia is maybe 2 minutes - our editors watch this stuff like hawks.
Loss:
(1) extra complication to the code. (Small and slim is *always* best.) (2) even more crud to mess up the HTML output. If we keep on adding "features" in this manner, a page of MediaWiki output HTML is going to be harder to read and consume more bandwidth than a page produced by Frontpage.
To me, that says that, on balance, the best option is fairly clearly to do nothing.
Small is beautiful. Simple is best.
On Wednesday 19 January 2005 11:55, Tony wrote:
I can't see the point. Look at the losses and gains:
I don't see the point, too.
On Jan 19, 2005, at 1:55 AM, Tony wrote:
I can't see the point. Look at the losses and gains:
Gain: prevents a *tiny* amount of spamming (tiny, because the average lifespan of linkspam on the 'pedia is maybe 2 minutes - our editors watch this stuff like hawks.
Automated spam attacks will continue to hit us along with everybody else whether _any particular_ target cleans up attacks immediately or not; as a target we have two ways to protect ourselves: * Try to shield ourselves against particular attacks (blacklists etc) * Act to discourage spamming in general
The projected gain is from making blogspam/wikispam/guestspam more expensive in general (needing to do much more damage to gain the same effect). Will it work? I have no idea.
Loss:
(1) extra complication to the code. (Small and slim is *always* best.)
Here's the code: global $wgNoFollowLinks; if( $wgNoFollowLinks ) { $style .= ' rel="nofollow"'; }
I've already added it; if for some reason we change our minds we can turn it off at any time.
(2) even more crud to mess up the HTML output. If we keep on adding "features" in this manner, a page of MediaWiki output HTML is going to be harder to read and consume more bandwidth than a page produced by Frontpage.
The rel attribute is standard, validating HTML4 & XHTML1. It's not very big, and it will only appear on external links which are relatively rare compared to the primary mass of code. Compared to the ugly <span class="urlexpansion"> mess it's downright elegant. ;)
(1.5 has experimental code to remove the urlexpansion <span>s in favor of cleaner CSS, with a lightweight JavaScript fallback for Internet Explorer.)
-- brion vibber (brion @ pobox.com)
On Wednesday 19 January 2005 12:11, Brion Vibber wrote:
I've already added it; if for some reason we change our minds we can turn it off at any time.
Could you please turn it off in the non-Wikipedia MediaWiki distribution?
The rel attribute is standard, validating HTML4 & XHTML1.
That's true. However the values of the rel element are open to extensions, so maybe someone else already uses "nofollow" for another reason.
Note that as long search engines don't support this extension, it just uses extra bandwidth for no reason.
On Jan 19, 2005, at 2:25 AM, NSK wrote:
On Wednesday 19 January 2005 12:11, Brion Vibber wrote:
I've already added it; if for some reason we change our minds we can turn it off at any time.
Could you please turn it off in the non-Wikipedia MediaWiki distribution?
Having it on by default is a significant factor to its having the desired effect (wide distribution of potential spam targets which will produce no gain for the spammer if attacked).
You may of course turn it off on your own wikis (as you may configure or recode them in any way). Unless we decide not to support the campaign, the option will ship on by default.
The rel attribute is standard, validating HTML4 & XHTML1.
That's true. However the values of the rel element are open to extensions, so maybe someone else already uses "nofollow" for another reason.
If they have, there's little _we_ can do about it, as we are not a search engine. Talk to the fine folks at Google, MSN, and Yahoo.
Note that as long search engines don't support this extension, it just uses extra bandwidth for no reason.
Google started the thing, and MSN and Yahoo's search teams have announced they will support it. Whether the other players will hop on the bandwagon remains to be seen (the campaign is young).
-- brion vibber (brion @ pobox.com)
On Wednesday 19 January 2005 12:37, Brion Vibber wrote:
Having it on by default is a significant factor to its having the desired effect (wide distribution of potential spam targets which will produce no gain for the spammer if attacked).
But the links are still clickable by humans, so the spammers will continue inserting their links in Wikipedia.
Brion Vibber wrote:
You may of course turn it off on your own wikis (as you may configure or recode them in any way). Unless we decide not to support the campaign, the option will ship on by default.
I strongly agree with Brion on this. It should ship on by default.
Whether or not *we* should use it is a tougher question. It is primarily of value to small blogs with small communities which are ripe targets for wikispam. For us, there is direct and indirect value in the fact that our links are human-chosen and _mean something_.
We help the world by reducing the cost of wikispam. We help the world by helping google and other search engines find websites that don't suck. It's a tradeoff.
There's an additional irony if you think about it. One "cost" of this campaign to google is if the ignore links of value, thus reducing the quality of their search results. If we add the nofollow attribute, we remove our collective wisdom from the google dataset. This might cause google to rethink the wisdom of paying attention to this tag.
The general ethical principle seems to be something like this:
A good webmaster will identify links that appear on his or her website which may be of low quality (because not validated by trusted parties) and hint to search engines accordingly. This means that by default, MediaWiki should ship with this feature turned on, because small wikis do have a problem in this area. But it also means that wikipedia itself probably should turn the feature off.
(This is not a decree or anything, just one voice in the discussion.)
--Jimbo
On Wed, 19 Jan 2005 10:52:28 -0800, Jimmy (Jimbo) Wales jwales@wikia.com wrote:
Brion Vibber wrote:
You may of course turn it off on your own wikis (as you may configure or recode them in any way). Unless we decide not to support the campaign, the option will ship on by default.
I strongly agree with Brion on this. It should ship on by default.
Whether or not *we* should use it is a tougher question. It is primarily of value to small blogs with small communities which are ripe targets for wikispam. For us, there is direct and indirect value in the fact that our links are human-chosen and _mean something_.
We help the world by reducing the cost of wikispam. We help the world by helping google and other search engines find websites that don't suck. It's a tradeoff.
I think the latter should be chosen, an overwhelming majority of our links are not spam that contributes to the overall search quality of search engines like google. In any case i think that it should be turned off until consensus on the matter is reached, it's currently turned on on the wikimedia sites.
There's an additional irony if you think about it. One "cost" of this campaign to google is if the ignore links of value, thus reducing the quality of their search results. If we add the nofollow attribute, we remove our collective wisdom from the google dataset. This might cause google to rethink the wisdom of paying attention to this tag.
The general ethical principle seems to be something like this:
A good webmaster will identify links that appear on his or her website which may be of low quality (because not validated by trusted parties) and hint to search engines accordingly. This means that by default, MediaWiki should ship with this feature turned on, because small wikis do have a problem in this area. But it also means that wikipedia itself probably should turn the feature off.
(This is not a decree or anything, just one voice in the discussion.)
--Jimbo _______________________________________________ Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On Wed, 19 Jan 2005 10:52:28 -0800, Jimmy (Jimbo) Wales jwales@wikia.com wrote: <snip>
But it also means that wikipedia itself probably should turn the feature off.
(This is not a decree or anything, just one voice in the discussion.)
maybe in the 5 or so really big wikipedias, but i know for example that en.wikibooks has a problem with one particular chinese wikispammer, who doesnt get reverted for at least a couple of hours, if not half a day sometimes. I presume there is probably similar probs with some of the other smaller projects.
~~~~
Robin Shannon wrote:
But it also means that wikipedia itself probably should turn the feature off.
(This is not a decree or anything, just one voice in the discussion.)
maybe in the 5 or so really big wikipedias, but i know for example that en.wikibooks has a problem with one particular chinese wikispammer, who doesnt get reverted for at least a couple of hours, if not half a day sometimes. I presume there is probably similar probs with some of the other smaller projects.
I think that's probably right.
It's about weighing the value of giving search engines good clues about pages that don't suck versus the value of discouraging wikispam when it is a problem.
--Jimbo
I see no problem with it, im just wondering if maybe a comprimise would be to have it the default for annon edits (or would that be less elegent coding?) since im guessing that most spam is from annons (as users would be banned for it).
On Wed, 19 Jan 2005 12:25:15 +0200, NSK nsk2@wikinerds.org wrote: <snip>
Note that as long search engines don't support this extension, it just uses extra bandwidth for no reason.
Google is supporting it. Thats 90% of the search market already. MSN and Yahoo! are also supporting it; thats 99.999999% of the search market. What other engines are you talking about?
paz y amor [[User:The bellman]]
Robin Shannon wrote:
I see no problem with it, im just wondering if maybe a comprimise would be to have it the default for annon edits (or would that be less elegent coding?) since im guessing that most spam is from annons (as users would be banned for it).
Most spam on Wikipedia seems to be from anonymous users. But spam on blogs and wikis that require logins before editing generally comes from throwaway accounts. Believe me, they can create them faster than you can block them. See for example this revision, contributed by the user DinMo to a wiki which requires logins:
http://www.fedora.us/wiki/ScddBox?version=1
Judging by the URLs, that spammer was blocked from Wikipedia on December 29. As soon as they realise what we're doing (and I have no doubt they will), they'll log in to our wikis just like they log in to everything else.
-- Tim Starling
Tony wrote:
I can't see the point. Look at the losses and gains:
Gain: prevents a *tiny* amount of spamming (tiny, because the average lifespan of linkspam on the 'pedia is maybe 2 minutes - our editors watch this stuff like hawks.
People complain to me all the time that they spend too long reverting spam. I've written 3 anti-spam features in the last 6 months or so, in response to popular demand. Someone in #mediawiki two days ago asked me to help him set up a spam blacklist on his own MediaWiki installation, because his wiki had been spammed to the point where he was left with no choice but to take it offline.
Loss:
(1) extra complication to the code. (Small and slim is *always* best.)
It's just a few extra characters.
(2) even more crud to mess up the HTML output. If we keep on adding "features" in this manner, a page of MediaWiki output HTML is going to be harder to read and consume more bandwidth than a page produced by Frontpage.
XHTML isn't meant to be readable, readability has never been a design goal for the parser. We can afford the extra bandwidth.
To me, that says that, on balance, the best option is fairly clearly to do nothing.
Small is beautiful. Simple is best.
I think it would be better if we joined with the major blog software developers and the three biggest search engines in this effort to improve index quality and reduce spam. It's not as big an issue for Wikipedia as it is for the hundreds of smaller wikis which use our software, but that doesn't mean it shouldn't be implemented.
-- Tim Starling
shi zhao wrote:
see http://www.google.com/googleblog/2005/01/preventing-comment-spam.html
Are we can join in the exterior links (rel="nofollow") , preventing wikispam?
I think that it should be in a new XML namespace, not in the XHTML one, if we are talking about XHTML documents. Something like <html xmlns:gsp="http://www.google.com/ns/googlespam-1/"> ... <a gsp:nofollow="1">...</a> </html> The reason is that someone else might want to use(or already be using) nofollow for some other purpose(especially since it is not very intuitive as to what it does, perhaps norank would be better).
In terms of using it with Wiki, perhaps this should interact with validation(when it is implemented) or patrolling in some way. An ad hoc approach would be to drop the nofollow in links first added in a change which has already patrolled pages. This would, however, require that this state be stored , maintained, and read in some way on a per-link basis, which may be to costly to justify(unless it also offered some other benefit to the Wiki users).
[[zh:user:shizhao]]
Wikitech-l mailing list Wikitech-l@wikimedia.org http://mail.wikipedia.org/mailman/listinfo/wikitech-l
On Wed, 19 Jan 2005, Andrew Miller wrote:
In terms of using it with Wiki, perhaps this should interact with validation(when it is implemented) or patrolling in some way. An ad hoc approach would be to drop the nofollow in links first added in a change which has already patrolled pages. This would, however, require that this state be stored , maintained, and read in some way on a per-link basis, which may be to costly to justify(unless it also offered some other benefit to the Wiki users).
I would also make this work with patrolling pages. If patrols are limited to a trusted group you can remove the nofollow flag from patrolled pages. I think a per-page flag would be enough for this.
Christof
shi zhao <shizhao <at> gmail.com> writes:
see http://www.google.com/googleblog/2005/01/preventing-comment-spam.html
Are we can join in the exterior links (rel="nofollow") , preventing
wikispam?
[[zh:user:shizhao]]
We can use the similar to Trackback Pings of blog.
[[zh:user:shizhao]]
wikitech-l@lists.wikimedia.org