..even though it knows about them. Daniel Brandt explains the problem here: http://www.google-watch.org/dying.html
Of course that article is hyperbole, but you can see the problem if you search for articles on Wikipedia that do not include the word "MediaWiki" (which occurs on every page), e.g.:
http://www.google.com/search?q=site%3Aen.wikipedia.org+-mediawiki
This lists the pages on en.wikipedia.org that are not indexed, presently 255,000 (Google apparently has a list of URLs with no associated content). That includes dynamic URLs, redirects and other duplicates, but it also includes many pages that should be indexed. E.g. if I search for the text of the English article "Potassium", I currently get plenty of mirrors, but not Wikipedia itself.
I have a few questions, which someone with more Google-Fu than I may be able to answer: 1) Why exactly doesn't Google index these articles? 2) Is there a way we can make them index them? 3) Is there a clever way to get a list of only the actual articles that aren't indexed, as opposed to the redirects, edit pages and dupes? That would help in estimating the scope of the problem.
It also to me makes it clear that we can't just rely on Google to do our searching for us. It is not just outdated, it is also quite incomplete. We really need to develop a working search infrastructure. In the meantime, we might want to add some of our "good mirrors" to the Google search query.
Regards,
Erik
It also to me makes it clear that we can't just rely on Google to do our searching for us. It is not just outdated, it is also quite incomplete. We really need to develop a working search infrastructure. In the meantime, we might want to add some of our "good mirrors" to the Google search query.
Regards,
Erik
I know too little about software development and such to be a reliable adviser but maybe a suggestion.
I use glimpse locally to search my server hd's for stuff. I understand every page of wikipedia is held in a cache, or is it the proxy? Can't that cache or proxy be indexed by glimpseindex? Or the db even? That one would be fast and reliable.
--tic
On Mon, Sep 13, 2004 at 08:56:55AM +0200, tic@tictric.net wrote:
I know too little about software development and such to be a reliable adviser but maybe a suggestion.
I use glimpse locally to search my server hd's for stuff. I understand every page of wikipedia is held in a cache, or is it the proxy? Can't that cache or proxy be indexed by glimpseindex? Or the db even? That one would be fast and reliable.
As far as I know glimpse, it can't update documents on the fly. Search index is updated in a batch run.
The current mysql based search engine updates on-the-fly, and using the newly arrived servers, it should perform fine.
Regards,
JeLuF
On Mon, Sep 13, 2004 at 08:09:00AM +0200, Erik Moeller wrote:
..even though it knows about them. Daniel Brandt explains the problem here: http://www.google-watch.org/dying.html
Of course that article is hyperbole, but you can see the problem if you search for articles on Wikipedia that do not include the word "MediaWiki" (which occurs on every page), e.g.:
http://www.google.com/search?q=site%3Aen.wikipedia.org+-mediawiki
This lists the pages on en.wikipedia.org that are not indexed, presently 255,000 (Google apparently has a list of URLs with no associated content). That includes dynamic URLs, redirects and other duplicates, but it also includes many pages that should be indexed. E.g. if I search for the text of the English article "Potassium", I currently get plenty of mirrors, but not Wikipedia itself.
I have a few questions, which someone with more Google-Fu than I may be able to answer:
- Why exactly doesn't Google index these articles?
first look at our robots.txt that will show you that edit-pages and such aren't allowed to be searched. That's good so. The real articles that are listed are mostly brand-new or badly-linked from other articles.
ciao, tom
Thomas R. Koll wrote:
On Mon, Sep 13, 2004 at 08:09:00AM +0200, Erik Moeller wrote:
..even though it knows about them. Daniel Brandt explains the problem here: http://www.google-watch.org/dying.html
Of course that article is hyperbole, but you can see the problem if you search for articles on Wikipedia that do not include the word "MediaWiki" (which occurs on every page), e.g.:
http://www.google.com/search?q=site%3Aen.wikipedia.org+-mediawiki
This lists the pages on en.wikipedia.org that are not indexed, presently 255,000 (Google apparently has a list of URLs with no associated content). That includes dynamic URLs, redirects and other duplicates, but it also includes many pages that should be indexed. E.g. if I search for the text of the English article "Potassium", I currently get plenty of mirrors, but not Wikipedia itself.
I have a few questions, which someone with more Google-Fu than I may be able to answer:
- Why exactly doesn't Google index these articles?
first look at our robots.txt that will show you that edit-pages and such aren't allowed to be searched. That's good so. The real articles that are listed are mostly brand-new or badly-linked from other articles.
ciao, tom
When you look for [[nl:Oostvaardersplassen]] in Google you will not find it. Using the wikipedia google search for it. You will however find an entry for nl.wikipedia.org/wiki/Oostvaardersplassen http://nl.wikipedia.org/wiki/Oostvaardersplassen which is the article in question however the format should have been like: Oostvaardersplassen - Wikipedia NL http://nl.wikipedia.org/wiki/Veluwe In the local ratings among the articles that contain the word "Oostvaardersplassen" it has a low rating even though it is *the* article about the subject.
This article is not new, there are no problems with the linking that I can see. It has 39 references and was started on 21 jan 2004.
This information is just to show that there ARE problems with Google and that it is not just with new and badly linked articles.
Thanks, GerardM
On Mon, Sep 13, 2004 at 08:09:00AM +0200, Erik Moeller wrote:
It also to me makes it clear that we can't just rely on Google to do our searching for us. It is not just outdated, it is also quite incomplete. We really need to develop a working search infrastructure.
The search servers have arrived and are currently being installed. So this should improve soon.
Regards,
JeLuF
Erik Moeller wrote:
..even though it knows about them. Daniel Brandt explains the problem here: http://www.google-watch.org/dying.html
Has anyone investigated other search engines?
It seems that since around 2001 or so, people have begun to regard Google as if it was the only search engine on the planet...
Timwi
wikitech-l@lists.wikimedia.org