We are having to resort to crawling
en.wikipedia.org while we await
for regular dumps.
What is the minimum crawling delay we can get away with? I figure if we
have 1 second delay then we'd be able to crawl the 2+ million articles
in a month.
I know crawling is discouraged but it seems a lot of parties still do
so after looking at robots.txt
I have to assume that is how Google et al. is able to keep up to date.
Are their private data feeds? I noticed a wg_enwiki dump listed.
Christian
On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:
> That would be great. I second this notion whole heartedly.
>
>
> On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:
>
>> "Brion Vibber" <brion(a)wikimedia.org> wrote in message
>> news:497F9C35.9050500@wikimedia.org...
>>> On 1/27/09 2:55 PM, Robert Rohde wrote:
>>>> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber<brion(a)wikimedia.org
> >>>> wrote:
> >>>>> On 1/27/09 2:35 PM, Thomas Dalton wrote:
> >>>>>> The way I see it, what we need is to get a really powerful