Daniel Brandt Links to Google

List overview All Threads
Download

newer

older

MediaWiki automated test run...

Single Crack 0wnz0r1ng

Jeff V. Merkey

8 May 2007 8 May '07

4:30 a.m.

Brion/Gregory,

Would it be possible to block Brandt's article from being scraped from the search engines in the main site robots.txt file? It would help alleviate the current conflict and hopefully remove the remaining issues between Daniel and Wikipedia. After this final issue is addressed, I feel we have done all we can to correct Daniel's bio and address his concerns, that being said, there is a limit on how far good samaritanism should go, and I think we have done about all we can here. The rest is up to Daniel.

Jeff

Show replies by date

David Gerard

8 May 8 May

4:39 a.m.

On 07/05/07, Jeff V. Merkey jmerkey@wolfmountaingroup.com wrote:

...

Would it be possible to block Brandt's article from being scraped from the search engines in the main site robots.txt file? It would help alleviate the current conflict and hopefully remove the remaining issues between Daniel and Wikipedia. After this final issue is addressed, I feel we have done all we can to correct Daniel's bio and address his concerns, that being said, there is a limit on how far good samaritanism should go, and I think we have done about all we can here. The rest is up to Daniel.

This idea has actually been suggested on wikien-l and met with a mostly positive response - selective noindexing of some living biographies. This would actually cut down the quantity of OTRS complaints tremendously. The response was not unanimous, I must note.

- d.

Jeff V. Merkey

4:48 a.m.

David Gerard wrote:

...

On 07/05/07, Jeff V. Merkey jmerkey@wolfmountaingroup.com wrote:

...
Would it be possible to block Brandt's article from being scraped from the search engines in the main site robots.txt file? It would help alleviate the current conflict and hopefully remove the remaining issues between Daniel and Wikipedia. After this final issue is addressed, I feel we have done all we can to correct Daniel's bio and address his concerns, that being said, there is a limit on how far good samaritanism should go, and I think we have done about all we can here. The rest is up to Daniel.

This idea has actually been suggested on wikien-l and met with a mostly positive response - selective noindexing of some living biographies. This would actually cut down the quantity of OTRS complaints tremendously. The response was not unanimous, I must note.

d.

David,

I think in this case, we should consider doing it, particularly if the subject of a BLP asks for us to do so as a courtesy. I realize we do not have to accomdate anyone, but still, I think it would be the polite and considerate thing to do.

Jeff

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Gregory Maxwell

4:51 a.m.

On 5/7/07, David Gerard dgerard@gmail.com wrote:

...

On 07/05/07, Jeff V. Merkey jmerkey@wolfmountaingroup.com wrote:

...
Would it be possible to block Brandt's article from being scraped from the search engines in the main site robots.txt file? It would help alleviate the current conflict and hopefully remove the remaining issues between Daniel and Wikipedia. After this final issue is addressed, I feel we have done all we can to correct Daniel's bio and address his concerns, that being said, there is a limit on how far good samaritanism should go, and I think we have done about all we can here. The rest is up to Daniel.

This idea has actually been suggested on wikien-l and met with a mostly positive response - selective noindexing of some living biographies. This would actually cut down the quantity of OTRS complaints tremendously. The response was not unanimous, I must note.

This would be very useful for another use case: Sometimes google will pick up a cached copy of a vandalized page. In order to purge the google cache you need to make the page 404 (which deletion doesn't do), put the page into a robots.txt deny, or include some directive in the page that stops indexing.

If we provided some directive to do one of the latter two (ideally the last) we could use it temporally to purge google cached copies of vandalism... so it would even be useful for pages that we normally want to keep indexed.

Jay R. Ashworth

6:12 a.m.

On Mon, May 07, 2007 at 03:51:48PM -0400, Gregory Maxwell wrote:

...

This would be very useful for another use case: Sometimes google will pick up a cached copy of a vandalized page. In order to purge the google cache you need to make the page 404 (which deletion doesn't do), put the page into a robots.txt deny, or include some directive in the page that stops indexing.

If we provided some directive to do one of the latter two (ideally the last) we could use it temporally to purge google cached copies of vandalism... so it would even be useful for pages that we normally want to keep indexed.

With all due respect to... oh, whomever the hell thinks they deserve some: aren't we big enough to get a little special handling from Google? I should think that if we have a page get cached that's either been vandalised or in some other way exposes us to liability, that as big as we are, and as many high ranked search results as we return on Google (we're often the top hit, and *very* often in the top 20), perhaps we might be able to access some *slightly* more prompt deindexing facility? At least for, say, our top 10 administators?

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

Gregory Maxwell

6:46 a.m.

On 5/7/07, Jay R. Ashworth jra@baylink.com wrote:

...

With all due respect to... oh, whomever the hell thinks they deserve some: aren't we big enough to get a little special handling from Google? I should think that if we have a page get cached that's either been vandalised or in some other way exposes us to liability, that as big as we are, and as many high ranked search results as we return on Google (we're often the top hit, and *very* often in the top 20), perhaps we might be able to access some *slightly* more prompt deindexing facility? At least for, say, our top 10 administators?

They've provided a reasonable API, "put something on the page that you want deindexed and tell them to visit it".. I don't know what more we could ask for since anything else would have to have some complex authentication system.. using metatags in the page avoids that problem nicely. .. We just need to support the API.

David Gerard

6:49 a.m.

On 07/05/07, Gregory Maxwell gmaxwell@gmail.com wrote:

...

They've provided a reasonable API, "put something on the page that you want deindexed and tell them to visit it".. I don't know what more we could ask for since anything else would have to have some complex authentication system.. using metatags in the page avoids that problem nicely. .. We just need to support the API.

Add a tickybox that admins can use to noindex a page? When the box is ticked or unticked, the URL is sent to Google.

- d.

Jay R. Ashworth

7:31 a.m.

On Mon, May 07, 2007 at 10:49:46PM +0100, David Gerard wrote:

...

On 07/05/07, Gregory Maxwell gmaxwell@gmail.com wrote:

...
They've provided a reasonable API, "put something on the page that you want deindexed and tell them to visit it".. I don't know what more we could ask for since anything else would have to have some complex authentication system.. using metatags in the page avoids that problem nicely. .. We just need to support the API.

Add a tickybox that admins can use to noindex a page? When the box is ticked or unticked, the URL is sent to Google.

Well, my point was more "what, *exactly* happens when you ask Google to index a page?" The issue sounded like "could we please get this pulled offline *RIGHT NOW*!?", and I have no reason to believe that the publically available Google API for this has anything like that degree of control...

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

Gregory Maxwell

7:40 a.m.

On 5/7/07, Jay R. Ashworth jra@baylink.com wrote:

...

Well, my point was more "what, *exactly* happens when you ask Google to index a page?" The issue sounded like "could we please get this pulled offline *RIGHT NOW*!?", and I have no reason to believe that the publically available Google API for this has anything like that degree of control...

It does exactly what I said it does in my prior post. It pulls pages *right* *now*, iff you've marked them to be no indexed (through any of several mechanisms) and you tell it to hit them.

Jay R. Ashworth

9 May 9 May

4:12 a.m.

On Mon, May 07, 2007 at 06:40:02PM -0400, Gregory Maxwell wrote:

...

On 5/7/07, Jay R. Ashworth jra@baylink.com wrote:

...
Well, my point was more "what, *exactly* happens when you ask Google to index a page?" The issue sounded like "could we please get this pulled offline *RIGHT NOW*!?", and I have no reason to believe that the publically available Google API for this has anything like that degree of control...

It does exactly what I said it does in my prior post. It pulls pages *right* *now*, iff you've marked them to be no indexed (through any of several mechanisms) and you tell it to hit them.

Such has not ever been my understanding of the publically accessible tools for this.

http://www.google.com/support/webmasters/bin/answer.py?answer=61062&topi...

notes that 'priority' requests will take 3-5 days, and that they won't reappear in the index for at *leaat* 180 days. That doesn't sound like a solution to this problem.

The only *other* removal procedures I see are mentioned at

http://www.google.com/support/webmasters/bin/answer.py?answer=35301

which notes that the changes will take effect *the next time Google crawls the site*.

And while there *used* to be a place where you could say "here's my URL, please crawl my site",

http://www.google.com/support/webmasters/bin/answer.py?answer=34397

suggests pretty strongly that there no longer is.

Could you expand on your assertion above, Greg?

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

Platonides

7:08 a.m.

Jay R. Ashworth wrote:

...

On Mon, May 07, 2007 at 06:40:02PM -0400, Gregory Maxwell wrote:

...
On 5/7/07, Jay R. Ashworth jra@baylink.com wrote:

...
Well, my point was more "what, *exactly* happens when you ask Google to index a page?" The issue sounded like "could we please get this pulled offline *RIGHT NOW*!?", and I have no reason to believe that the publically available Google API for this has anything like that degree of control...

It does exactly what I said it does in my prior post. It pulls pages *right* *now*, iff you've marked them to be no indexed (through any of several mechanisms) and you tell it to hit them.

Such has not ever been my understanding of the publically accessible tools for this.

http://www.google.com/support/webmasters/bin/answer.py?answer=61062&topi...

notes that 'priority' requests will take 3-5 days, and that they won't reappear in the index for at *leaat* 180 days. That doesn't sound like a solution to this problem.

After 180 days, Google will start crawlinf it again. If it's still forbidden in robots.txt / meta tag it shouldn't matter.

...

The only *other* removal procedures I see are mentioned at

http://www.google.com/support/webmasters/bin/answer.py?answer=35301

which notes that the changes will take effect *the next time Google crawls the site*.

And while there *used* to be a place where you could say "here's my URL, please crawl my site",

http://www.google.com/support/webmasters/bin/answer.py?answer=34397

suggests pretty strongly that there no longer is.

http://www.google.com/addurl/

Steve Sanbeg

8 May 8 May

7:18 a.m.

On Mon, 07 May 2007 17:12:52 -0400, Jay R. Ashworth wrote:

...

On Mon, May 07, 2007 at 03:51:48PM -0400, Gregory Maxwell wrote:

...
This would be very useful for another use case: Sometimes google will pick up a cached copy of a vandalized page. In order to purge the google cache you need to make the page 404 (which deletion doesn't do), put the page into a robots.txt deny, or include some directive in the page that stops indexing.

If we provided some directive to do one of the latter two (ideally the last) we could use it temporally to purge google cached copies of vandalism... so it would even be useful for pages that we normally want to keep indexed.

With all due respect to... oh, whomever the hell thinks they deserve some: aren't we big enough to get a little special handling from Google? I should think that if we have a page get cached that's either been vandalised or in some other way exposes us to liability, that as big as we are, and as many high ranked search results as we return on Google (we're often the top hit, and *very* often in the top 20), perhaps we might be able to access some *slightly* more prompt deindexing facility? At least for, say, our top 10 administators?

Cheers, -- jra

Doesn't this assume that:

1) The foundation is willing to self censor its content.

2) Google will recognize that if a URL is marked like a crawler trap in robots.txt that obviously isn't, it means that the corresponding censored article shouldn't be crawled or extracted from the syndication dumps. 3)The foundation wants to set up a private channel of information exclusively for Google.

Misusing robots.txt is somewhat dubious when you don't publish XML dumps for syndication, and seems somewhat pointless when you do. Having a team of censors maintaining a secret blacklist to be sent to one corporation seems somewhat contrary to the foundations goals.

There may be better ways to do it, but they wouldn't be as simple as adding a name to a file; and some may consider the ramifications of hiding an article like this to be more serious than deleting it, not less so.

Jeff V. Merkey

7:21 a.m.

Steve Sanbeg wrote:

...

On Mon, 07 May 2007 17:12:52 -0400, Jay R. Ashworth wrote:

...
On Mon, May 07, 2007 at 03:51:48PM -0400, Gregory Maxwell wrote:

...
This would be very useful for another use case: Sometimes google will pick up a cached copy of a vandalized page. In order to purge the google cache you need to make the page 404 (which deletion doesn't do), put the page into a robots.txt deny, or include some directive in the page that stops indexing.

If we provided some directive to do one of the latter two (ideally the last) we could use it temporally to purge google cached copies of vandalism... so it would even be useful for pages that we normally want to keep indexed.

With all due respect to... oh, whomever the hell thinks they deserve some: aren't we big enough to get a little special handling from Google? I should think that if we have a page get cached that's either been vandalised or in some other way exposes us to liability, that as big as we are, and as many high ranked search results as we return on Google (we're often the top hit, and *very* often in the top 20), perhaps we might be able to access some *slightly* more prompt deindexing facility? At least for, say, our top 10 administators?

Cheers, -- jra

Doesn't this assume that:

The foundation is willing to self censor its content.

Preventing google from scraping content at the request of BLP subjects is not censorship and sounds reasonable. It does not compromise wikipedia, just external engines creating biaed link summaries.

...

Google will recognize that if a URL is marked like a crawler trap in

robots.txt that obviously isn't, it means that the corresponding censored article shouldn't be crawled or extracted from the syndication dumps. 3)The foundation wants to set up a private channel of information exclusively for Google.

insert noindex into the HTML output -- very easy and straightforward.

Jeff

...

Misusing robots.txt is somewhat dubious when you don't publish XML dumps for syndication, and seems somewhat pointless when you do. Having a team of censors maintaining a secret blacklist to be sent to one corporation seems somewhat contrary to the foundations goals.

There may be better ways to do it, but they wouldn't be as simple as adding a name to a file; and some may consider the ramifications of hiding an article like this to be more serious than deleting it, not less so.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Evan Martin

9 May 9 May

3:13 a.m.

On 5/7/07, Gregory Maxwell gmaxwell@gmail.com wrote:

...

This would be very useful for another use case: Sometimes google will pick up a cached copy of a vandalized page. In order to purge the google cache you need to make the page 404 (which deletion doesn't do), put the page into a robots.txt deny, or include some directive in the page that stops indexing.

If we provided some directive to do one of the latter two (ideally the last) we could use it temporally to purge google cached copies of vandalism... so it would even be useful for pages that we normally want to keep indexed.

For cases of vandalism, just changing indexing-related metadata on the page won't help. Because Googlebot must fetch the page to discover it's a 404, then you might as well just serve a reverted/corrected page instead. It seems that if someone is going to the trouble to flip the "has been vandalized" bit they could also just revert the page to a pre-vandalism state.

You can learn more about getting pages deindexed from Google here: http://www.google.com/support/webmasters/bin/topic.py?topic=8459 There is a form for submitting URL removal requests, but in that case the URL won't be crawled again for six months. It's intended for emergencies where the content should completely disappear, like "I accidentally put all my customers' credit card numbers in a world-readable directory!" I don't think it's useful for this discussion.

I think the best solution for vandalism is to set up a system that allows pages to be marked as needing expedited recrawling. This wouldn't be for every updated page -- just those that someone with sufficient access (an admin?) had explicitly marked. (It'd be best if this information were *not* pushed directly to Google, because ideally every search engine would be able to make use of it.) We're currently looking into ways to let webmasters provide this sort of information. I'll get back to you when there's news.

(Disclaimer: I'm not an official company spokesman.)

Jay R. Ashworth

4:14 a.m.

On Tue, May 08, 2007 at 11:13:07AM -0700, Evan Martin wrote:

...

I think the best solution for vandalism is to set up a system that allows pages to be marked as needing expedited recrawling. This wouldn't be for every updated page -- just those that someone with sufficient access (an admin?) had explicitly marked. (It'd be best if this information were *not* pushed directly to Google, because ideally every search engine would be able to make use of it.)

I concur that it would be useful if everyone could make use of it, but, given the target problem, if it *wasn't* pushed to the engines, I don't see how it would be useful... unless you pinged it every 15 minutes or so. And of course, there is a potential for abuse, no matter how it's implemented. Balancing the attendant risks is the underlying problem, as it always is.

Good to know you're here, though, Evan; we promise not to assume you actually work at Google or anything. :-)

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

Steve Sanbeg

6:22 a.m.

On Tue, 08 May 2007 15:14:52 -0400, Jay R. Ashworth wrote:

...

On Tue, May 08, 2007 at 11:13:07AM -0700, Evan Martin wrote:

...
I think the best solution for vandalism is to set up a system that allows pages to be marked as needing expedited recrawling. This wouldn't be for every updated page -- just those that someone with sufficient access (an admin?) had explicitly marked. (It'd be best if this information were *not* pushed directly to Google, because ideally every search engine would be able to make use of it.)

I concur that it would be useful if everyone could make use of it, but, given the target problem, if it *wasn't* pushed to the engines, I don't see how it would be useful... unless you pinged it every 15 minutes or so. And of course, there is a potential for abuse, no matter how it's implemented. Balancing the attendant risks is the underlying problem, as it always is.

I'd think letting sites ping it as often as they want to update should be feasible. An extra request every 15 minutes isn't so much. If we were to do a push model, I think it'd be better to let anyone subscribe to the feed.

Erik Moeller

8 May 8 May

8:33 a.m.

On 5/7/07, Jeff V. Merkey jmerkey@wolfmountaingroup.com wrote:

...

Brion/Gregory,

Would it be possible to block Brandt's article from being scraped from the search engines in the main site robots.txt file?

I oppose this. We should not have articles invisible to search engines. Either the page gets deleted, or it stays. None of this monkey business.

-- Peace & Love, Erik DISCLAIMER: This message does not represent an official position of the Wikimedia Foundation or its Board of Trustees. "An old, rigid civilization is reluctantly dying. Something new, open, free and exciting is waking up." -- Ming the Mechanic

Gregory Maxwell

8:45 a.m.

On 5/7/07, Erik Moeller erik@wikimedia.org wrote:

...

I oppose this. We should not have articles invisible to search engines. Either the page gets deleted, or it stays. None of this monkey business.

Last I tried google wouldn't remove 'deleted' pages from the index because they still 'existed'. :(

Thomas Dalton

9:30 a.m.

...

Last I tried google wouldn't remove 'deleted' pages from the index because they still 'existed'. :(

That's a good point. Why doesn't MediaWiki return a 404 when a page isn't found? As far as I know, we could show exactly the same page, it's just a matter of changing the status header from 200 to 404. Same applies to redirects - HTTP has a "Page Moved" code, or something similar, which we should return.

Simetrical

11 a.m.

On 5/7/07, Thomas Dalton thomas.dalton@gmail.com wrote:

...

...
Last I tried google wouldn't remove 'deleted' pages from the index because they still 'existed'. :(

That's a good point. Why doesn't MediaWiki return a 404 when a page isn't found?

http://bugzilla.wikimedia.org/show_bug.cgi?id=2585

It was tried and seemed to cause problems for some users.

Platonides

10:27 p.m.

Simetrical wrote:

...

On 5/7/07, Thomas Dalton thomas.dalton@gmail.com wrote:

...
...
Last I tried google wouldn't remove 'deleted' pages from the index because they still 'existed'. :(

That's a good point. Why doesn't MediaWiki return a 404 when a page isn't found?

http://bugzilla.wikimedia.org/show_bug.cgi?id=2585

It was tried and seemed to cause problems for some users.

"Several users have reported being persistently unable to access nonexistent pages" is quite vague. Do you know what was the specific problem? Did they get a blank page? A niced 404? A MessageBox?

Brion Vibber

10:58 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Platonides wrote:

...

Simetrical wrote:

...
On 5/7/07, Thomas Dalton thomas.dalton@gmail.com wrote:

...
...
Last I tried google wouldn't remove 'deleted' pages from the index because they still 'existed'. :(

That's a good point. Why doesn't MediaWiki return a 404 when a page isn't found?

http://bugzilla.wikimedia.org/show_bug.cgi?id=2585

It was tried and seemed to cause problems for some users.

"Several users have reported being persistently unable to access nonexistent pages" is quite vague. Do you know what was the specific problem? Did they get a blank page? A niced 404? A MessageBox?

We were unable to reproduce the problem at the time, so cannot say more specifically than that it was disruptive and there were multiple complaints.

- -- brion vibber (brion @ wikimedia.org)

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGQIHywRnhpk1wk44RAjQhAJ9XODz6Q1aKSriBmF4M9t5LAnvxOwCg1u5w qZe5sfPZVtFWVu2ki6SnetU= =2axZ -----END PGP SIGNATURE-----

Gregory Maxwell

11:05 p.m.

On 5/8/07, Brion Vibber brion@wikimedia.org wrote:

...

...
"Several users have reported being persistently unable to access nonexistent pages" is quite vague. Do you know what was the specific problem? Did they get a blank page? A niced 404? A MessageBox?

We were unable to reproduce the problem at the time, so cannot say more specifically than that it was disruptive and there were multiple complaints.

I do see that we're sending "<meta name="robots" content="noindex,nofollow" />" on non-existant pages.. that should be enough to allow google to deindex a page. Hmm.

David Gerard

11:58 p.m.

On 08/05/07, Gregory Maxwell gmaxwell@gmail.com wrote:

...

I do see that we're sending "<meta name="robots" content="noindex,nofollow" />" on non-existant pages.. that should be enough to allow google to deindex a page. Hmm.

Get the software to send the URL to Google automatically upon deletion?

I believe en:wp runs about 6000 deleted pages per day at present ...

- d.

Platonides

9 May 9 May

2:50 a.m.

David Gerard wrote:

...

Get the software to send the URL to Google automatically upon deletion?

I believe en:wp runs about 6000 deleted pages per day at present ...

Easy. Set a bot listening on rc and sending the retrieve-url for each deletion of a page with more than X hours (thus probably cached)... until google starts asking you for a captcha.

Sending Google the URL should only be done if there's a problem with Google cache (ie. complaints, when deleting you foresee problems...) and these small cases can be manually sent by the sysop.

Evan Martin

3:19 a.m.

On 5/8/07, Gregory Maxwell gmaxwell@gmail.com wrote:

...

On 5/8/07, Brion Vibber brion@wikimedia.org wrote:

...
...
"Several users have reported being persistently unable to access nonexistent pages" is quite vague. Do you know what was the specific problem? Did they get a blank page? A niced 404? A MessageBox?

We were unable to reproduce the problem at the time, so cannot say more specifically than that it was disruptive and there were multiple complaints.

I do see that we're sending "<meta name="robots" content="noindex,nofollow" />" on non-existant pages.. that should be enough to allow google to deindex a page. Hmm.

If you can find a specific example of a page that you think shouldn't be indexed that is getting indexed, please email me directly. It's either a bug on Google's end or some confusion about the proper protocol and both of those can be diagnosed.

(Disclaimer: I'm not an official company spokesman.)

Jay R. Ashworth

4:16 a.m.

On Tue, May 08, 2007 at 01:30:27AM +0100, Thomas Dalton wrote:

...

That's a good point. Why doesn't MediaWiki return a 404 when a page isn't found? As far as I know, we could show exactly the same page, it's just a matter of changing the status header from 200 to 404. Same applies to redirects - HTTP has a "Page Moved" code, or something similar, which we should return.

Notably, while I haven't tested it, I suspect that the default IE config's "friendly HTTP error messages" setting would cause problems with this.

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

Simetrical

4:25 a.m.

On 5/8/07, Jay R. Ashworth jra@baylink.com wrote:

...

Notably, while I haven't tested it, I suspect that the default IE config's "friendly HTTP error messages" setting would cause problems with this.

I believe that cleverly figures out it shouldn't meddle if the page length is greater than some fairly low threshold, so as to only filter out unhelpful default messages and not pages with actual content of any kind.

Jay R. Ashworth

4:41 a.m.

On Tue, May 08, 2007 at 03:25:20PM -0400, Simetrical wrote:

...

On 5/8/07, Jay R. Ashworth jra@baylink.com wrote:

...
Notably, while I haven't tested it, I suspect that the default IE config's "friendly HTTP error messages" setting would cause problems with this.

I believe that cleverly figures out it shouldn't meddle if the page length is greater than some fairly low threshold, so as to only filter out unhelpful default messages and not pages with actual content of any kind.

I'll assume you're basing that on observed behavior, since I wouldn't expect anyone to assign cleverness to a Microsoft product just off-hand.

:-)

Cheers, -- jra

-- Jay R. Ashworth jra@baylink.com Designer Baylink RFC 2100 Ashworth & Associates The Things I Think '87 e24 St Petersburg FL USA http://baylink.pitas.com +1 727 647 1274

Simetrical

5:55 a.m.

On 5/8/07, Jay R. Ashworth jra@baylink.com wrote:

...

On Tue, May 08, 2007 at 03:25:20PM -0400, Simetrical wrote:

...
On 5/8/07, Jay R. Ashworth jra@baylink.com wrote:

...
Notably, while I haven't tested it, I suspect that the default IE config's "friendly HTTP error messages" setting would cause problems with this.

I believe that cleverly figures out it shouldn't meddle if the page length is greater than some fairly low threshold, so as to only filter out unhelpful default messages and not pages with actual content of any kind.

I'll assume you're basing that on observed behavior, since I wouldn't expect anyone to assign cleverness to a Microsoft product just off-hand.

Not personally observed, but I've read it.

Platonides

6:56 a.m.

Jay R. Ashworth wrote:

...

Notably, while I haven't tested it, I suspect that the default IE config's "friendly HTTP error messages" setting would cause problems with this.

IE friendly error messages only show when the page has less than 512 bytes, which is not the case with MediaWiki (jsut checked, 11991 bytes for a non-existant article).

http://www.404lab.com/404/howto.asp?article=3

Rob Church

8 May 8 May

8:56 a.m.

On 08/05/07, Erik Moeller erik@wikimedia.org wrote:

...

I oppose this. We should not have articles invisible to search engines. Either the page gets deleted, or it stays. None of this monkey business.

I concur, for a change. ;) Wikipedia is supposed to be an information resource, and search engines are the main means of finding such resources on the web. While I strongly sympathise with any victims of libel, and join the call to tighten up on it, I feel it's important to continue supporting indexing of all articles.

I would suggest a better idea would be to find some means of asking Google to update caches and indexes for a particular page "right now" for these cases.

Rob Church

Jeffrey V. Merkey

1:35 p.m.

Rob Church wrote:

...

On 08/05/07, Erik Moeller erik@wikimedia.org wrote:

...
I oppose this. We should not have articles invisible to search engines. Either the page gets deleted, or it stays. None of this monkey business.

I concur, for a change. ;) Wikipedia is supposed to be an information resource, and search engines are the main means of finding such resources on the web. While I strongly sympathise with any victims of libel, and join the call to tighten up on it, I feel it's important to continue supporting indexing of all articles.

I would suggest a better idea would be to find some means of asking Google to update caches and indexes for a particular page "right now" for these cases.

Rob Church

Rob,

Need a solution that balances both concerns. How about if an article is placed under WP:OFFICE then it gets removed from the index. This would be managable.

Erik,

There is such a thing as a 1st ammendment right of "expressive association". There may be a sound basis for reviewing the issues of search engines and their artificial pay-for-rankings scheme. I mean, if WP is free content, then why again should we allow Google to reorder our information for paying customers by scraping every template and corner of WP? There is no "higher morality" in allowing search engines to decide what we publish (since they filter it anyway for a fee) nor is it "more open".

Putting some controls there to get rid of more complaints might be worth the tradeof, particularly since the whole "Google is Divine and Open" view (which it is really not) is only illusionary - the way Google wants all of us to believe.

Jeff

...

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org http://lists.wikimedia.org/mailman/listinfo/wikitech-l

Thomas Dalton

9:28 a.m.

...

I oppose this. We should not have articles invisible to search engines. Either the page gets deleted, or it stays. None of this monkey business.

I agree. If the information is good enough to be on Wikipedia, it's good enough to appear in a Google search. We have no obligation to protect subjects of articles from getting upset (beyond libel law, of course). While it is always nice to try and keep people happy, we shouldn't be going out of our way to do so.

6454

Age (days ago)

6455

Last active (days ago)

wikitech-l@lists.wikimedia.org

33 comments

13 participants

tags (0)

participants (13)

Brion Vibber
David Gerard
Erik Moeller
Evan Martin
Gregory Maxwell
Jay R. Ashworth
Jeff V. Merkey
Jeffrey V. Merkey
Platonides
Rob Church
Simetrical
Steve Sanbeg
Thomas Dalton