Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

11 Feb 2009


      Brion,
We are having to resort to crawling en.wikipedia.org while we await  
for regular dumps.
What is the minimum crawling delay we can get away with? I figure if we
have 1 second delay then we'd be able to crawl the 2+ million articles  
in a month.
I know crawling is discouraged but it seems a lot of parties still do  
so after looking at robots.txt
I have to assume that is how Google et al. is able to keep up to date.
Are their private data feeds?  I noticed a wg_enwiki dump listed.
Christian
On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:
...
That would be great.  I second this notion whole heartedly.
On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:
...
"Brion Vibber" brion@wikimedia.org wrote in message
news:497F9C35.9050500@wikimedia.org...
...
On 1/27/09 2:55 PM, Robert Rohde wrote:
...
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibberbrion@wikimedia.org
wrote:
...
On 1/27/09 2:35 PM, Thomas Dalton wrote:
...
The way I see it, what we need is to get a really powerful server
Nope, it's a software architecture issue. We'll restart it with
the new
arch when it's ready to go.
The simplest solution is just to kill the current dump job if you
have
faith that a new architecture can be put in place in less than a
year.
We'll probably do that.
-- brion
FWIW, I'll add my vote for aborting the current dump *now* if we  
don't
expect it ever to actually be finished, so we can at least get a
fresh dump
of the current pages.
Russ

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008