Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008

25 Mar 2009

toolserver users dont have access to text

On Wed, Mar 25, 2009 at 7:05 PM, Brian &lt;Brian.Mingus(a)colorado.edu&gt; wrote:

...
  Perhaps the toolserver can make you a current dump of
current en?

 On Wed, Mar 25, 2009 at 11:08 AM, Christian Storm &lt;storm(a)iparadigms.com
 wrote: 
  Thanks to everyone who got the enwiki dumps going
again!  Should we  expect
  more regular dumps now?  What was the final
solution of fixing this?

 We are having to resort to crawling en.wikipedia.org while we await
 for regular dumps.
 What is the minimum crawling delay we can get away with? I figure if we
 have 1 second delay then we'd be able to crawl the 2+ million articles
 in a month.

 I know crawling is discouraged but it seems a lot of parties still do
 so after looking at robots.txt
 I have to assume that is how Google et al. is able to keep up to date.

 Are their private data feeds?  I noticed a wg_enwiki dump listed.

 Christian

 On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:

 > That would be great.  I second this notion whole heartedly.
 >
 >
 > On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:
 >
 >> "Brion Vibber" &lt;brion(a)wikimedia.org&gt; wrote in message
 >> news:497F9C35.9050500@wikimedia.org...
 >>> On 1/27/09 2:55 PM, Robert Rohde wrote:
 >>>> On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber&lt;brion(a)wikimedia.org

 > >>>> wrote:
 > >>>>> On 1/27/09 2:35 PM, Thomas Dalton wrote:
 > >>>>>> The way I see it, what we need is to get a really powerful
 server
   >>>> Nope, it's a software architecture
issue. We'll restart it with
>>>> the new
>>>> arch when it's ready to go.
>>> The simplest solution is just to kill the current dump job if you
>>> have
>>> faith that a new architecture can be put in place in less than a
>>> year.
>>
>> We'll probably do that.
>>
>> -- brion
>
> FWIW, I'll add my vote for aborting the current dump *now* if we
> don't
> expect it ever to actually be finished, so we can at least get a
> fresh dump
> of the current pages.
>
> Russ
>
>
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l(a)lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l 

 _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
  _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
  _______________________________________________
 Wikitech-l mailing list
 Wikitech-l(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Enwiki dump crawling since 10/15/2008