I remember hearing a couple of times that CorenSearchBot was down, but just assumed that something so important was being rescued, though I did wonder slightly about the recent net increase in articles on EN wiki. 3,738,826 articles today means we've way overshot the 3 million projection, the 3.5 million prediction is looking distinctly cautious and and even the 4 million by late 2012 http://commons.wikimedia.org/wiki/File:Enwikipediapercgrowth.PNG looks somewhat unceiling like.
Could we get Google and Bing to make an exception for CorenSearchbot? If not then I'd agree that a spider would make sense, though I've no idea what that would cost. Having our own spider could be useful for other things though, including: # bot adding of {{deadlink}} templates. # creating our own wayback machine showing webpages as they were when they were cited by our articles # a "may have moved here" table so we could add possibly moved here and wayback options to {{deadlink}}. # A bot to update links as sites reorganise and organisations rebrand, without it we could be mostly deadlinked as early as mid-century. #A bot that listed probable deaths based on obituaries in reliable sources and even updates to subjects' own websites would also be useful. # Possible breaches of our copyright would be another potential use, but maybe we just need to rename "what links here" as "what links here (internal)" and add "what links here (external)".
WSC
Message: 5 Date: Wed, 14 Sep 2011 17:09:44 +0200 From: Kim Bruning kim@bruning.xs4all.nl Subject: Re: [Foundation-l] The WikiNews fork - for lack of a copyvio detection bot half a project was lost To: Wikimedia Foundation Mailing List foundation-l@lists.wikimedia.org Message-ID: 20110914170944.C22787@bruning.lan Content-Type: text/plain; charset=us-ascii
On Wed, Sep 14, 2011 at 10:49:06AM -0500, Aaron Adrignola wrote:
CorenSearchBot has not been operational for several months since Yahoo stopped allowing automated queries. Bing's terms of use don't permit this either and apparently the same is true for Google.
It might be useful to have a community operated spider, then? In that way, we could also optimize our database for the kinds of queries we need.
sincerely, Kim Bruning
On 14 Sep 2011, at 23:05, WereSpielChequers werespielchequers@gmail.com wrote:
I remember hearing a couple of times that CorenSearchBot was down, but just assumed that something so important was being rescued, though I did wonder slightly about the recent net increase in articles on EN wiki. 3,738,826 articles today means we've way overshot the 3 million projection, the 3.5 million prediction is looking distinctly cautious and and even the 4 million by late 2012 http://commons.wikimedia.org/wiki/File:Enwikipediapercgrowth.PNG looks somewhat unceiling like.
Could we get Google and Bing to make an exception for CorenSearchbot? If not then I'd agree that a spider would make sense, though I've no idea what that would cost. Having our own spider could be useful for other things though,
If this is the approach taken we should talk to some of the smaller search operators with their own crawl databases
Implementing and effective crawler is non-trivial at scale :-)
DuckDuckGo springs to mind; they do their own crawling and the guy behind is quite approachable.
Tom
wikimedia-l@lists.wikimedia.org