Re: [Foundation-l] excluding Wikipedia clones from searching - Wikimedia-l

10 Dec 2010


      On Fri, Dec 10, 2010 at 9:54 PM,  WJhonson@aol.com wrote:
...
In a message dated 12/10/2010 12:48:31 PM Pacific Standard Time,
jamesmikedupont@googlemail.com writes:
I am not talking about books, just webpages.
lets take ladygaga.com as example
Wayback engine :
http://web.archive.org/web/*/http://www.ladygaga.com
Google cache:
http://webcache.googleusercontent.com/search?q=cache:1720lEPHkysJ:www.ladyga...
here are two copies of copyrighted materials, we should make sure that
our referenced webpages are in archive.org or mirrored on some server.
Ideally we would have our own search engine and cache.
mike
I have no problem with the idea of pointing refs to a page on archive.org,
however you must understand that even previously archived pages *may* be
removed from archive.org at the owner's request or even at the request of a
.robots entry.
The only advantage I see over using archive.org instead of a plain link, is
the ability to see what a page *looked* like in the past.  I'm not sure
that's a great advantage.  Why do you think it is?  If a page comes down,
should we not err on the part of assuming the owner no longer wants it
public and if the owner doesnt want it public, are we to make sure it stays
public by caching it against their will?
Both Google and Archive.org (much to my utter dismay) obey certain rules set
up by web page owners to not index certain pages, or to remove them from
caching history entirely (even old copies).  Are you suggesting we disregard
those rules?  If not, then I see no advantage in our caching pages which are
available in caches already.
My point is we should index them ourselves. We should have the pages
used as references first listed in an easy to use manner and if
possible we should cache them. If they are not cacheable because of
some restrictions, the references should be marked somehow as not as
good and people might find better references. In the end, like
citeseer you will find that pages that are available and open and
cachable will be cited and used more than pages that are not.
Right now, I dont know of a simple way to even get this list of
references from wp. There is alot of work to do, and if we do this, it
will benefit the wikipedia. Another thing to do is to translate the
pages referenced.
mike