I just did a search on Google for "Okhrana" and came up with www.wikipedia.org/w/wiki.phtml?title=Okhrana&action=edit as the 10th hit. But that's a link to an edit page (and for some reason one without the "You've followed a link to a page that doesn't exist.." explanation.)
If edit pages show up in Google like this, that must really increase the number of confused nonsense edits. (It was looking for information for a stub to replace one of these that made me do the search in the first place.)
Isn't there some way excluding edit pages from being seen by Google?
sannse
I think the problem was that Google had cached our page, and I just deleted it. You therefore got sent to a nonexistent entity. Zoe sannse sannse@delphiforums.com wrote:I just did a search on Google for "Okhrana" and came up with www.wikipedia.org/w/wiki.phtml?title=Okhrana&action=edit as the 10th hit. But that's a link to an edit page (and for some reason one without the "You've followed a link to a page that doesn't exist.." explanation.)
If edit pages show up in Google like this, that must really increase the number of confused nonsense edits. (It was looking for information for a stub to replace one of these that made me do the search in the first place.)
Isn't there some way excluding edit pages from being seen by Google?
sannse
_______________________________________________ Wikipedia-l mailing list Wikipedia-l@wikipedia.org http://www.wikipedia.org/mailman/listinfo/wikipedia-l
--------------------------------- Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
On Thu, 2003-03-20 at 22:52, Zoe wrote:
sannse sannse@delphiforums.com wrote: I just did a search on Google for "Okhrana" and came up with www.wikipedia.org/w/wiki.phtml?title=Okhrana&action=edit as the 10th hit. But that's a link to an edit page
I think the problem was that Google had cached our page, and I just deleted it. You therefore got sent to a nonexistent entity.
That wouldn't have gone to an edit page, just to a blank page. The problem here is that an actual edit URL got into google at some point and is still coming up in results.
Sannse, we *do* exclude edit pages from google's and other bots' spiders, doubly:
* robots.txt excludes access to the /w/ subdirectory, and thus all direct script actions (edits, histories, diffs, printable mode, changing options/length on recentchanges, etc), so it shouldn't be touching them at all.
* edit pages and such have meta tags telling robots "noindex,nofollow"; ie that if they do end up with that page, they shouldn't index it, and shouldn't follow links from it, but should just toss the page out and go back where it came from.
A few have somehow gotten through. I'm not sure how. They may be old and not yet flushed (googlebot is still going over the site and hasn't reindexed every page yet). Note that in the google results there's no summary extract, no cache, no notice of the size. It's just a raw URL sitting there in the results. That's weird and wrong, and to me indicates a problem in their index.
(and for some reason one without the "You've followed a link to a page that doesn't exist.." explanation.)
Hmm, I *do* see that message when I follow the link.
-- brion vibber (brion @ pobox.com)
Brion wrote (in part):
Hmm, I *do* see that message when I follow the link.
So do I now (could have sworn it wasn't there before). That must reduce the confusion factor somewhat.
So I guess we just wait for Google to catch up. In most cases I don't think it will matter anyway, the links will be very low down on the list of hits. "Okhrana" was unusual in that the link was visible on the first page of the search - and that made me wonder if the page had been created by someone following that link.
Thanks for the explanation,
sannse
On 20 Mar 2003, Brion Vibber wrote:
That wouldn't have gone to an edit page, just to a blank page. The problem here is that an actual edit URL got into google at some point and is still coming up in results.
Sannse, we *do* exclude edit pages from google's and other bots' spiders, doubly:
robots.txt excludes access to the /w/ subdirectory, and thus all direct script actions (edits, histories, diffs, printable mode, changing options/length on recentchanges, etc), so it shouldn't be touching them at all.
edit pages and such have meta tags telling robots "noindex,nofollow"; ie that if they do end up with that page, they shouldn't index it, and shouldn't follow links from it, but should just toss the page out and go back where it came from.
A few have somehow gotten through. I'm not sure how. They may be old and not yet flushed (googlebot is still going over the site and hasn't reindexed every page yet). Note that in the google results there's no summary extract, no cache, no notice of the size. It's just a raw URL sitting there in the results. That's weird and wrong, and to me indicates a problem in their index.
This is more than a few - do a search for "site:wikipedia.org action=edit'. That gives an estimated 148,000 hits; only the first two seem to be NOT edit pages. It seems to me that google works according to one of the following two procedures: * When a URL is forbidden by the robots.txt, they do keep the URL in the database, but do not follow it. Thus, the page will be in the database, but only with its URL, without any title or content * When a new URL is found, it is added to the database in the manner described above. Only at a later stage it is found to be forbidden by the robots.txt, and thrown out again - to then be moved back in when a link to it is found.
Andre Engels
At 11:41 PM 3/20/03 -0800, Brion wrote:
On Thu, 2003-03-20 at 22:52, Zoe wrote:
sannse sannse@delphiforums.com wrote: I just did a search on Google for "Okhrana" and came up with www.wikipedia.org/w/wiki.phtml?title=Okhrana&action=edit as the 10th hit. But that's a link to an edit page
I think the problem was that Google had cached our page, and I just deleted it. You therefore got sent to a nonexistent entity.
That wouldn't have gone to an edit page, just to a blank page. The problem here is that an actual edit URL got into google at some point and is still coming up in results.
Sannse, we *do* exclude edit pages from google's and other bots' spiders, doubly:
- robots.txt excludes access to the /w/ subdirectory, and thus all direct script actions (edits, histories, diffs, printable mode, changing options/length on recentchanges, etc), so it shouldn't be touching them at all.
Google is in the habit of ignoring robots.txt files. (This keeps coming up on LiveJournal, where the support volunteers have to explain to people that even if they've asked not to be indexed, they'll have to contact Google and ask to be removed.)
Zoe wrote:
I think the problem was that Google had cached our page, and I just
deleted it. You therefore got sent to a nonexistent entity.
I don't think so, because the link on Google was to the edit page itself, not to the article (see the end of the link, "&action=edit") - that's what worried me.
The page only existed for a short time, I noticed it on "recent changes" and blanked it, then Googled to find replacement text. You deleted it while I was gone (quite rightly of course :) So I don't think Google would be likely to have picked it up in that short time anyway.
I did a little more looking around. Google searches for some other non-existant pages have the same problem (I added "wikipedia" to the search to speed things up)
wikipedia "Spinal tumor" www.wikipedia.org/w/wiki.phtml?title=Spinal_tumor&action=edit
wikipedia "Alexander disease" www.wikipedia.org/w/wiki.phtml?title=Alexander's_disease&action=edit
It doesn't come up with all non-existant pages though: wikipedia "Benign Essential Blepharospasm" was OK for example.
sannse
wikipedia-l@lists.wikimedia.org