[Foundation-l] excluding Wikipedia clones from searching

Fri Dec 10 21:09:48 UTC 2010

On Fri, Dec 10, 2010 at 9:54 PM,  <WJhonson at aol.com> wrote:
> In a message dated 12/10/2010 12:48:31 PM Pacific Standard Time,
> jamesmikedupont at googlemail.com writes:
>
>
> I am not talking about books, just webpages.
>
> lets take ladygaga.com as example
>
> Wayback engine :
> http://web.archive.org/web/*/http://www.ladygaga.com
>
> Google cache:
> http://webcache.googleusercontent.com/search?q=cache:1720lEPHkysJ:www.ladygaga.com/+lady+gaga&cd=1&hl=de&ct=clnk&gl=de&client=firefox-a
>
> here are two copies of copyrighted materials, we should make sure that
> our referenced webpages are in archive.org or mirrored on some server.
> Ideally we would have our own search engine and cache.
>
> mike
>
>
> I have no problem with the idea of pointing refs to a page on archive.org,
> however you must understand that even previously archived pages *may* be
> removed from archive.org at the owner's request or even at the request of a
> .robots entry.
>
> The only advantage I see over using archive.org instead of a plain link, is
> the ability to see what a page *looked* like in the past.  I'm not sure
> that's a great advantage.  Why do you think it is?  If a page comes down,
> should we not err on the part of assuming the owner no longer wants it
> public and if the owner doesnt want it public, are we to make sure it stays
> public by caching it against their will?
>
> Both Google and Archive.org (much to my utter dismay) obey certain rules set
> up by web page owners to not index certain pages, or to remove them from
> caching history entirely (even old copies).  Are you suggesting we disregard
> those rules?  If not, then I see no advantage in our caching pages which are
> available in caches already.

My point is we should index them ourselves. We should have the pages
used as references first listed in an easy to use manner and if
possible we should cache them. If they are not cacheable because of
some restrictions, the references should be marked somehow as not as
good and people might find better references. In the end, like
citeseer you will find that pages that are available and open and
cachable will be cited and used more than pages that are not.

Right now, I dont know of a simple way to even get this list of
references from wp. There is alot of work to do, and if we do this, it
will benefit the wikipedia. Another thing to do is to translate the
pages referenced.

mike