[Foundation-l] excluding Wikipedia clones from searching

Fri Dec 10 23:18:10 UTC 2010

On Sat, Dec 11, 2010 at 12:02 AM,  <WJhonson at aol.com> wrote:
> In a message dated 12/10/2010 2:58:08 PM Pacific Standard Time,
> jamesmikedupont at googlemail.com writes:
>
>
> my idea was that you will want to search pages that are referenced by
> wikipedia already, in my work on kosovo, it would be very helpful
> because there are lots of bad results on google, and it would be nice
> to use that also to see how many times certain names occur.
> That is why we need also our own indexing engine, I would like to
> count the occurances of each term and what page they occur on, and to
> xref that to names on wikipedia against them. Wanted pages could also
> be assisted like this, what are the most wanted pages that match
> against the most common terms in the new refindex or also existing
> pages.
>
>
>
> Well then all you would need to do is cross-reference the refs themselves.
> You don't need to cache the underlying pages to which they refer.

well i was hoping to look at all the pages that wikipedia considers to
be valuable enough to be referenced, and to find new information on
those pages for other articles. I dont think it is enough to just look
at the referernces on the wikipedia itself, we should resolve them and
look at those pages, and also to build a list of sites of possible
full indexing, or at least some spidering.

>
> So in your new search engine, when you search for "Mary, Queen of Scots" you
> really are saying, show me those external references, which are mentioned,
> in connection with Mary Queen of Scots, by Wikipedia.

Not really, find all pages referenced in total by the wikipedia that
contain the term "Mary, Queen of Scots", maybe someone added a site to
an article on King Henry that contains the text "Mary, Queen of Scots"
that has not been referenced yet.

show me the occurrences of the word, the frequency, maybe in the
sentence or paragraph it occurs in and a link to the page and the
ability to see the cached version if the site is down. it can also be
cached on another site as well, if the same version.
>
> That doesn't require caching the pages to which refs refer.  It only
> requires indexing those refs which currently are used in-world.

Well indexing normally means caching as well, public or private. You
need to copy the pages into the memory of a computer to index them.
Best is to store them on disk.

The first step will be to collect all references of course, but the
second step will be to resolve them.This is also good to check for
dead references and mark them as such.