[Foundation-l] Spiders and bots. Was "The WikiNews fork - for lack of a copyvio detection bot half a project was lost"

Thomas Morton morton.thomas at googlemail.com
Wed Sep 14 22:24:38 UTC 2011


On 14 Sep 2011, at 23:05, WereSpielChequers <werespielchequers at gmail.com> wrote:

> I remember hearing a couple of times that CorenSearchBot was down, but just
> assumed that something so important was being rescued, though I did wonder
> slightly about the recent net increase in articles on EN wiki. 3,738,826
> articles today means we've way overshot the 3 million projection, the 3.5
> million prediction is looking distinctly cautious and and even the 4 million
> by late 2012
> http://commons.wikimedia.org/wiki/File:Enwikipediapercgrowth.PNG looks
> somewhat unceiling like.
>
> Could we get Google and Bing to make an exception for CorenSearchbot? If not
> then I'd agree that a spider would make sense, though I've no idea what that
> would cost. Having our own spider could be useful for other things though,

If this is the approach taken we should talk to some of the smaller
search operators with their own crawl databases

Implementing and effective crawler is non-trivial at scale :-)

DuckDuckGo springs to mind; they do their own crawling and the guy
behind is quite approachable.

Tom




More information about the wikimedia-l mailing list