Hello all,
Googling for something for the Polish Wikipedia I discovered that the test site test.wikipedia.com is indexed by Google robots even with edit pages.
We surely don't want this to happen ?
Regards, Kpjas.
On Tue, 2003-04-08 at 08:34, Krzysztof P. Jasiutowicz wrote:
Googling for something for the Polish Wikipedia I discovered that the test site test.wikipedia.com is indexed by Google robots even with edit pages.
We surely don't want this to happen ?
I forgot that even exists... That's on the old server, which I have no control over. It should be removed entirely, or at least the DNS pointed at the new server.
-- brion vibber (brion @ pobox.com)
(Brion Vibber brion@pobox.com): On Tue, 2003-04-08 at 08:34, Krzysztof P. Jasiutowicz wrote:
Googling for something for the Polish Wikipedia I discovered that the test site test.wikipedia.com is indexed by Google robots even with edit pages.
We surely don't want this to happen ?
I forgot that even exists... That's on the old server, which I have no control over. It should be removed entirely, or at least the DNS pointed at the new server.
While we're at it, I should probably make a LocalSettings option for test.wikipedia.org to send "noindex" metas for /all/ pages.
On Tue, 2003-04-08 at 11:29, Lee Daniel Crocker wrote:
While we're at it, I should probably make a LocalSettings option for test.wikipedia.org to send "noindex" metas for /all/ pages.
# robots.txt for http://test.wikipedia.org/
User-agent: * Disallow: /wiki Disallow: /w
-- brion vibber (brion @ pobox.com)
I removed the VirtualHost entry for that name from the old server, so surfers going to that name will end up at www.bomis.com. I'm not quite sure why that is... The Apache setup on that machine is a little strange.
do you want the name test.wikipedia.com pointed to the new server?
Jason
Brion Vibber wrote:
On Tue, 2003-04-08 at 08:34, Krzysztof P. Jasiutowicz wrote:
Googling for something for the Polish Wikipedia I discovered that the test site test.wikipedia.com is indexed by Google robots even with edit pages.
We surely don't want this to happen ?
I forgot that even exists... That's on the old server, which I have no control over. It should be removed entirely, or at least the DNS pointed at the new server.
-- brion vibber (brion @ pobox.com)
On Tue, 8 Apr 2003, Jason Richey wrote:
I removed the VirtualHost entry for that name from the old server, so surfers going to that name will end up at www.bomis.com. I'm not quite sure why that is... The Apache setup on that machine is a little strange.
Thanks! As long as they're not going to an old test site and thinking they can edit articles...
do you want the name test.wikipedia.com pointed to the new server?
Could do; no big rush, though.
One amusing thing I did notice: do a google search on "wikipedia totally offline". Google picks the most annoying times to reindex, doesn't it? :)
-- brion vibber (brion @ pobox.com)
Jason Richey wrote:
I removed the VirtualHost entry for that name from the old server, so surfers going to that name will end up at www.bomis.com. I'm not quite sure why that is... The Apache setup on that machine is a little strange.
do you want the name test.wikipedia.com pointed to the new server?
It seems sensible that anyone who visits http://test.wikipedia.com/* be forwarded to http://www.wikipedia.org/*, rather than to the Bomis homepage, which may astonish them.
--Jimbo
I agree. The problem was that I couldn't figure out why it was originally redirecting to Bomis in the first place...
Turns out it was using a "redirect" directive from the first VirtualHost section (even though it didn't match the hostname). So, that's fixed...
Jason
Jimmy Wales wrote:
Jason Richey wrote:
I removed the VirtualHost entry for that name from the old server, so surfers going to that name will end up at www.bomis.com. I'm not quite sure why that is... The Apache setup on that machine is a little strange.
do you want the name test.wikipedia.com pointed to the new server?
It seems sensible that anyone who visits http://test.wikipedia.com/* be forwarded to http://www.wikipedia.org/*, rather than to the Bomis homepage, which may astonish them.
--Jimbo _______________________________________________ Wikitech-l mailing list Wikitech-l@wikipedia.org http://www.wikipedia.org/mailman/listinfo/wikitech-l
On Wed, 2003-04-09 at 04:58, Jimmy Wales wrote:
It seems sensible that anyone who visits http://test.wikipedia.com/* be forwarded to http://www.wikipedia.org/*, rather than to the Bomis homepage, which may astonish them.
That would still astonish them, as test.wikipedia.com contained pages in Polish. :) If it must be redirected, it should be redirected to pl.wikipedia.org/*.
-- brion vibber (brion @ pobox.com)
On 8 Apr 2003 at 11:20, Brion Vibber wrote:
On Tue, 2003-04-08 at 08:34, Krzysztof P. Jasiutowicz wrote:
Googling for something for the Polish Wikipedia I discovered that the test site test.wikipedia.com is indexed by Google robots even with edit pages.
We surely don't want this to happen ?
I forgot that even exists... That's on the old server, which I have no control over. It should be removed entirely, or at least the DNS pointed at the new server.
I know that this issue is like a yo-yo, but maybe there's something more that admins can do. Somebody at Polish Wikipedia, looking for wibrator site:org in Google, found this: pl.wikipedia.org/w/wiki.phtml?title=Wibrator&action=edit And this is not the way we want to welcome newcomers, right?
Another issue: looking at http://pl.wikipedia.org/robots.txt we can see (caution: the address below may cause misunderstanding but that's what one can see!)
# robots.txt for http://www.wikipedia.org/
User-agent: * Disallow: /wiki/Special:Maintenance Disallow: /w/
If so, then how _our_ Special pages are excluded from indexing? To remind you, our namespace is called Specjalna, so I suggest replacement:
Disallow: /wiki/Specjalna:Maintenance
BTW, why don't we put just
Disallow: /wiki/Specjalna:
Regards Youandme
Currently the edit forms and other things are accessed by a URL similar to the one for viewing pages, but with an added query string: ..wiki.phtml?title=Name&action=edit vs. ..wiki.phtml?title=Name.
It might be easier to play nice with bots and indexers if we change the non-viewing URLs to something entirely different, say ..edit.phtml?title=Name, ..special.phtml?title=Name, etc. This might make it easier to use robots.txt to exclude things, for example, since some bots apparently don't listen to meta tags the way they should.
On the other hand, it would break whatever tools already depend on the old URLs, like Magnus's offline reader, etc. Do the techies think this might be worth the hassle?
On Thu, 2003-04-10 at 18:12, Lee Daniel Crocker wrote:
Currently the edit forms and other things are accessed by a URL similar to the one for viewing pages, but with an added query string: ..wiki.phtml?title=Name&action=edit vs. ..wiki.phtml?title=Name.
Well, not exactly. Viewing pages is done with /wiki/Name, which is very distinct from other operations at /w/wiki.phtml?title=Name&otherstuff. The primary advantage of this is being able to block off all non-view operations from web search indexing in robots.txt.
EXCEPT FOR FRICKING GOOGLE GRRRRGH. :)
But I'm sure it's just a misunderstanding. :)
-- brion vibber (brion @ pobox.com)
On Thu, Apr 10, 2003 at 07:50:04PM -0700, Brion Vibber wrote:
On Thu, 2003-04-10 at 18:12, Lee Daniel Crocker wrote:
Currently the edit forms and other things are accessed by a URL similar to the one for viewing pages, but with an added query string: ..wiki.phtml?title=Name&action=edit vs. ..wiki.phtml?title=Name.
Well, not exactly. Viewing pages is done with /wiki/Name, which is very distinct from other operations at /w/wiki.phtml?title=Name&otherstuff. The primary advantage of this is being able to block off all non-view operations from web search indexing in robots.txt.
EXCEPT FOR FRICKING GOOGLE GRRRRGH. :)
But I'm sure it's just a misunderstanding. :)
-- brion vibber (brion @ pobox.com)
I think the misunderstanding is not on Google's part. As far as I can tell, Google isn't indexing that page.
A quick search on google for "wibrator wikipedia" shows a subsection for the edit link. Note that it doesn't have any 'Cached' link. This means that google saw a link to the edit page in a page that could be indexed.
So, stop being mean to my pet. <pats google softly on its head>
On Thu, 2003-04-10 at 19:58, Nick Reinking wrote:
I think the misunderstanding is not on Google's part. As far as I can tell, Google isn't indexing that page.
A quick search on google for "wibrator wikipedia" shows a subsection for the edit link. Note that it doesn't have any 'Cached' link. This means that google saw a link to the edit page in a page that could be indexed.
I didn't say it was being cached, that its content could be word-searched, or that it had been spidered through to other pages. I said it was *indexed*. Now, maybe Google uses some word other than "indexed" to mean "contained in a database of links which are shown to users when they search for words contained in the link". I'll buy that. Maybe the word they use is "florble". In that case, the page is being florbled despite our best efforts to stop it from being florbled.
Is there any way we can tell google not to florble pages that are explicitly excluded by our robots.txt file so that people will stop complaining to *us* about google's overzealous florbling?
Hypothetically we could jimmy the page to not produce edit links if the user agent is googlebot, but that would be very annoying for several reasons: 1) The google-cached page would be missing those links. 2) This would screw with page caching. Google hits a lot of pages, and we'd have to either not cache any of its hits or be very careful in coding around it.
-- brion vibber (brion @ pobox.com)
On Thu, Apr 10, 2003 at 08:46:21PM -0700, Brion Vibber wrote:
On Thu, 2003-04-10 at 19:58, Nick Reinking wrote:
I think the misunderstanding is not on Google's part. As far as I can tell, Google isn't indexing that page.
A quick search on google for "wibrator wikipedia" shows a subsection for the edit link. Note that it doesn't have any 'Cached' link. This means that google saw a link to the edit page in a page that could be indexed.
I didn't say it was being cached, that its content could be word-searched, or that it had been spidered through to other pages. I said it was *indexed*. Now, maybe Google uses some word other than "indexed" to mean "contained in a database of links which are shown to users when they search for words contained in the link". I'll buy that. Maybe the word they use is "florble". In that case, the page is being florbled despite our best efforts to stop it from being florbled.
Is there any way we can tell google not to florble pages that are explicitly excluded by our robots.txt file so that people will stop complaining to *us* about google's overzealous florbling?
Hypothetically we could jimmy the page to not produce edit links if the user agent is googlebot, but that would be very annoying for several reasons:
- The google-cached page would be missing those links.
- This would screw with page caching. Google hits a lot of pages, and
we'd have to either not cache any of its hits or be very careful in coding around it.
-- brion vibber (brion @ pobox.com)
I've always understood 'indexed' to mean 'downloaded the entire page and added its contents to a searchable database.' As far as I know, robots.txt just tells google (and everybody else) not to download the page; it doesn't say they can't link to it. Since Masturbacja says to follow links, but robots.txt says not to index edit links, Google does the sensible thing: creates the link in its database, but doesn't index the content. Go figure; the Google engineers would probably cooperate with you if you asked them nicely. :)
I don't think it would be a good idea, but we could replace the edit this page link with a form and button that does the same thing. Surely, google doesn't post forms when indexing (btw., I was quite amused by the florbling discussion).
Jason
Brion Vibber wrote:
On Thu, 2003-04-10 at 19:58, Nick Reinking wrote:
I think the misunderstanding is not on Google's part. As far as I can tell, Google isn't indexing that page.
A quick search on google for "wibrator wikipedia" shows a subsection for the edit link. Note that it doesn't have any 'Cached' link. This means that google saw a link to the edit page in a page that could be indexed.
I didn't say it was being cached, that its content could be word-searched, or that it had been spidered through to other pages. I said it was *indexed*. Now, maybe Google uses some word other than "indexed" to mean "contained in a database of links which are shown to users when they search for words contained in the link". I'll buy that. Maybe the word they use is "florble". In that case, the page is being florbled despite our best efforts to stop it from being florbled.
Is there any way we can tell google not to florble pages that are explicitly excluded by our robots.txt file so that people will stop complaining to *us* about google's overzealous florbling?
Hypothetically we could jimmy the page to not produce edit links if the user agent is googlebot, but that would be very annoying for several reasons:
- The google-cached page would be missing those links.
- This would screw with page caching. Google hits a lot of pages, and
we'd have to either not cache any of its hits or be very careful in coding around it.
-- brion vibber (brion @ pobox.com)
"Brion Vibber" skribis:
I didn't say it was being cached, that its content could be word-searched, or that it had been spidered through to other pages. I said it was *indexed*. Now, maybe Google uses some word other than "indexed" to mean "contained in a database of links which are shown to users when they search for words contained in the link". I'll buy that. Maybe the word they use is "florble". In that case, the page is being florbled despite our best efforts to stop it from being florbled.
Is there any way we can tell google not to florble pages that are explicitly excluded by our robots.txt file so that people will stop complaining to *us* about google's overzealous florbling?
As I understand it:
The problem is that there are two parts of GoogleBot.
First step is collecting URLs and adding it to their database, without doing any checking of it, nor retrieving the page. This step actually uses nor robots.txt nor meta-noindex of the given links. meta-nofollow of the page containing the links is probably used.
The second step (which can occur some weeks later) is taking URLs from their database, and retrieve the page. When they are excluded in the respective robots.txt or by a meta-noindex, they are deleted from the database. (At the same time, step one is done with the links on this page).
Between those two steps, the url stays in the database, and whenever it contains the search-words (in the url itself) it is shown as a search result.
Hypothetically we could jimmy the page to not produce edit links if the user agent is googlebot, but that would be very annoying for several reasons:
- The google-cached page would be missing those links.
- This would screw with page caching. Google hits a lot of pages, and
we'd have to either not cache any of its hits or be very careful in coding around it.
What about changing the edit urls, so that they don't contain anything, which people would search?
For example
http://pl.wikipedia.org/w/wiki.phtml?title=W.i.b.r.a.t.o.r&action=edit
or
http://pl.wikipedia.org/w/wiki.phtml?articlenum=12345678&action=edit
Paul
Paul Ebermann wrote:
The second step (which can occur some weeks later) is taking URLs from their database, and retrieve the page. When they are excluded in the respective robots.txt or by a meta-noindex, they are deleted from the database.
I'm trying to understand the causes of this problem, because I don't want it to happen to me. So far it hasn't happened to me, and it has been a mystery to me why it happens to Wikipedia. For susning.nu, I use "meta noindex" only (no robots.txt), and I never see any edit links on Google.
What you just wrote made me think that performance might be the key to the problem: Perhaps the Wikipedia server was slow, timed out, or down when Googlebot tried to retrieve the edit URL, so Google never understood that this page was "noindex". Given Wikipedia's large number of pages and sluggish performance, this could easily happen to some pages. Not all, but enough many to become a problem.
The only real difference that I can think of between susning.nu and Wikipedia is that my site's performance has always been pretty good.
Lee-
On the other hand, it would break whatever tools already depend on the old URLs, like Magnus's offline reader, etc.
It is doubtful whether any new scheme would stop search engines from indexing the content. The thing is, search engines try increasingly to get snapshots of dynamic web content, so they grab almost everything.
In addition, existing links would stop working. I may be the only one who does such a bizarre thing, but I occasionally pass around direct edit URLs to Wikipedia articles to demonstrate how it works ("take a look at _this article_, and if you find an error, you can _edit it_ right now").
Perhaps we should return an HTTP error code for known search engine agents accessing edit pages? No search engine I know indexes 403 or 404 pages.
Regards,
Erik
On Thu, 2003-04-10 at 17:28, Youandme wrote:
I know that this issue is like a yo-yo, but maybe there's something more that admins can do. Somebody at Polish Wikipedia, looking for wibrator site:org in Google, found this: pl.wikipedia.org/w/wiki.phtml?title=Wibrator&action=edit And this is not the way we want to welcome newcomers, right?
I would *love* it if anyone knows how to fix this and can tell us how to do so.
These pages are already forbidden by robots.txt and by meta tags in the pages themselves. Either google is turning them up despite this information, or it's somehow indexed them at inopportune moments (like server being down) and includes them in results despite having gotten a no response / 403 / 404.
Should I e-mail google about it?
BTW, why don't we put just
Disallow: /wiki/Specjalna:
Not a bad idea, I suppose. (And yes, those ought to be localized.) I have the vague memory that people wanted things like recentchanges to be indexable, but I'm not sure it matters much.
-- brion vibber (brion @ pobox.com)
wikitech-l@lists.wikimedia.org