In a message dated 12/10/2010 2:58:08 PM Pacific Standard Time, jamesmikedupont@googlemail.com writes:
my idea was that you will want to search pages that are referenced by wikipedia already, in my work on kosovo, it would be very helpful because there are lots of bad results on google, and it would be nice to use that also to see how many times certain names occur. That is why we need also our own indexing engine, I would like to count the occurances of each term and what page they occur on, and to xref that to names on wikipedia against them. Wanted pages could also be assisted like this, what are the most wanted pages that match against the most common terms in the new refindex or also existing pages.
Well then all you would need to do is cross-reference the refs themselves. You don't need to cache the underlying pages to which they refer.
So in your new search engine, when you search for "Mary, Queen of Scots" you really are saying, show me those external references, which are mentioned, in connection with Mary Queen of Scots, by Wikipedia.
That doesn't require caching the pages to which refs refer. It only requires indexing those refs which currently are used in-world.
W
On Sat, Dec 11, 2010 at 12:02 AM, WJhonson@aol.com wrote:
In a message dated 12/10/2010 2:58:08 PM Pacific Standard Time, jamesmikedupont@googlemail.com writes:
my idea was that you will want to search pages that are referenced by wikipedia already, in my work on kosovo, it would be very helpful because there are lots of bad results on google, and it would be nice to use that also to see how many times certain names occur. That is why we need also our own indexing engine, I would like to count the occurances of each term and what page they occur on, and to xref that to names on wikipedia against them. Wanted pages could also be assisted like this, what are the most wanted pages that match against the most common terms in the new refindex or also existing pages.
Well then all you would need to do is cross-reference the refs themselves. You don't need to cache the underlying pages to which they refer.
well i was hoping to look at all the pages that wikipedia considers to be valuable enough to be referenced, and to find new information on those pages for other articles. I dont think it is enough to just look at the referernces on the wikipedia itself, we should resolve them and look at those pages, and also to build a list of sites of possible full indexing, or at least some spidering.
So in your new search engine, when you search for "Mary, Queen of Scots" you really are saying, show me those external references, which are mentioned, in connection with Mary Queen of Scots, by Wikipedia.
Not really, find all pages referenced in total by the wikipedia that contain the term "Mary, Queen of Scots", maybe someone added a site to an article on King Henry that contains the text "Mary, Queen of Scots" that has not been referenced yet.
show me the occurrences of the word, the frequency, maybe in the sentence or paragraph it occurs in and a link to the page and the ability to see the cached version if the site is down. it can also be cached on another site as well, if the same version.
That doesn't require caching the pages to which refs refer. It only requires indexing those refs which currently are used in-world.
Well indexing normally means caching as well, public or private. You need to copy the pages into the memory of a computer to index them. Best is to store them on disk.
The first step will be to collect all references of course, but the second step will be to resolve them.This is also good to check for dead references and mark them as such.
I'm In the process of creating a cleanup tool that checks archive.org and webcitation.org if a URL is not archived it checks to see if it is live and if it is I request that webcitation archive it on demand, and fills in the archiveurl parameter of cite templates.
John
On 10/12/2010 23:51, John Doe wrote:
I'm In the process of creating a cleanup tool that checks archive.org and webcitation.org if a URL is not archived it checks to see if it is live and if it is I request that webcitation archive it on demand, and fills in the archiveurl parameter of cite templates.
What is the point of doing that? If an URL goes missing the information should be refound from another source. If it can't be re-referenced then perhaps it wasn't quite as reliable as one first thought, and if URLs aren't stable on any particular site then maybe one should re-examine the reliability of the originating source.
Most dead URLs that I see, that can't be refound, come from references to online articles of minor events in BLPs. Simply the event was recorded on Monday and was fish and chip wrapping by Thursday. Or to put it another way non-notable in the grand scheme of things. In some cases the original source may also have removed the content because it was untrue and could not be substantiated.
Stuffing URLs across to archive.org, or webcitation.org simply perpetuates unsubstantiated gossip. One really ought to examine one's motives for doing that.
wikimedia-l@lists.wikimedia.org