Brion,
We are having to resort to crawling en.wikipedia.org while we await for regular dumps. What is the minimum crawling delay we can get away with? I figure if we have 1 second delay then we'd be able to crawl the 2+ million articles in a month.
I know crawling is discouraged but it seems a lot of parties still do so after looking at robots.txt I have to assume that is how Google et al. is able to keep up to date.
Are their private data feeds? I noticed a wg_enwiki dump listed.
Christian
On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:
That would be great. I second this notion whole heartedly.
On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:
"Brion Vibber" brion@wikimedia.org wrote in message news:497F9C35.9050500@wikimedia.org...
On 1/27/09 2:55 PM, Robert Rohde wrote:
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibberbrion@wikimedia.org wrote:
On 1/27/09 2:35 PM, Thomas Dalton wrote:
The way I see it, what we need is to get a really powerful server
Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go.
The simplest solution is just to kill the current dump job if you have faith that a new architecture can be put in place in less than a year.
We'll probably do that.
-- brion
FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh dump of the current pages.
Russ
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l