Enwiki dump crawling since 10/15/2008

List overview All Threads
Download

newer

older

Bugzilla Weekly Report

Possible license issues in...

Christian Storm

27 Jan 2009 27 Jan '09

7:24 p.m.

...

...
On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote: The current enwiki database dump (http://download.wikimedia.org/enwiki/20081008/ ) has been crawling along since 10/15/2008.

The current dump system is not sustainable on very large wikis and is being replaced. You'll hear about it when we have the new one in place. :) -- brion

Following up on this thread: http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html

Brion,

Can you offer any general timeline estimates (weeks, months, 1/2 year)? Are there any alternatives to retrieving the article data beyond directly crawling the site? I know this is verboten but we are in dire need of retrieving this data and don't know of any alternatives. The current estimate of end of year is too long for us to wait. Unfortunately, wikipedia is a favored source for students to plagiarize from which makes out of date content a real issue.

Is there any way to help this process along? We can donate disk drives, developer time, ...? There is another possibility that we could offer but I would need to talk with someone at the wikimedia foundation offline. Is there anyone I could contact?

Thanks for any information and/or direction you can give.

Christian

Show replies by date

Bilal Abdul Kader

27 Jan 27 Jan

7:57 p.m.

I have a decent server that is dedicated for a Wikipedia project that depends on the fresh dumps. Can this be used anyway to speed up the process of generating the dumps?

bilal

On Tue, Jan 27, 2009 at 2:24 PM, Christian Storm storm@iparadigms.comwrote:

...

...
...
On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote: The current enwiki database dump (

http://download.wikimedia.org/enwiki/20081008/

...
...
) has been crawling along since 10/15/2008.

The current dump system is not sustainable on very large wikis and is being replaced. You'll hear about it when we have the new one in place. :) -- brion

Following up on this thread: http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html

Brion,

Can you offer any general timeline estimates (weeks, months, 1/2 year)? Are there any alternatives to retrieving the article data beyond directly crawling the site? I know this is verboten but we are in dire need of retrieving this data and don't know of any alternatives. The current estimate of end of year is too long for us to wait. Unfortunately, wikipedia is a favored source for students to plagiarize from which makes out of date content a real issue.

Is there any way to help this process along? We can donate disk drives, developer time, ...? There is another possibility that we could offer but I would need to talk with someone at the wikimedia foundation offline. Is there anyone I could contact?

Thanks for any information and/or direction you can give.

Christian

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Robert Rohde

10:02 p.m.

The problem, as I understand it (and Brion may come by to correct me) is essentially that the current dump process is designed in a way that can't be sustained given the size of enwiki. It really needs to be re-engineered, which means that developer time is needed to create a new approach to dumping.

The main target for improvement is almost certainly parallelizing the process so that wouldn't be a single monolithic dump process, but rather a lot of little processes working in parallel. That would also ensure that if a single process gets stuck and dies, the entire dump doesn't need to start over.

By way of observation, the dewiki's full history dumps in 26 hours with 96% prefetched (i.e. loaded from previous dumps). That suggests that even starting from scratch (prefetch = 0%) it should dump in ~25 days under the current process. enwiki is perhaps 3-6 times larger than dewiki depending on how you do the accounting, which implies dumping the whole thing from scratch would take ~5 months if the process scaled linearly. Of course it doesn't scale linearly, and we end up with a prediction for completion that is currently 10 months away (which amounts to a 13 month total execution). And of course, if there is any serious error in the next ten months the entire process could die with no result.

Whether we want to let the current process continue to try and finish or not, I would seriously suggest someone look into redumping the rest of the enwiki files (i.e. logs, current pages, etc.). I am also among the people that care about having reasonably fresh dumps and it really is a problem that the other dumps (e.g. stubs-meta-history) are frozen while we wait to see if the full history dump can run to completion.

-Robert Rohde

On Tue, Jan 27, 2009 at 11:24 AM, Christian Storm storm@iparadigms.com wrote:

...

...
...
On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote: The current enwiki database dump (http://download.wikimedia.org/enwiki/20081008/ ) has been crawling along since 10/15/2008.

The current dump system is not sustainable on very large wikis and is being replaced. You'll hear about it when we have the new one in place. :) -- brion

Following up on this thread: http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html

Brion,

Can you offer any general timeline estimates (weeks, months, 1/2 year)? Are there any alternatives to retrieving the article data beyond directly crawling the site? I know this is verboten but we are in dire need of retrieving this data and don't know of any alternatives. The current estimate of end of year is too long for us to wait. Unfortunately, wikipedia is a favored source for students to plagiarize from which makes out of date content a real issue.

Is there any way to help this process along? We can donate disk drives, developer time, ...? There is another possibility that we could offer but I would need to talk with someone at the wikimedia foundation offline. Is there anyone I could contact?

Thanks for any information and/or direction you can give.

Christian

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Thomas Dalton

10:35 p.m.

...

Whether we want to let the current process continue to try and finish or not, I would seriously suggest someone look into redumping the rest of the enwiki files (i.e. logs, current pages, etc.). I am also among the people that care about having reasonably fresh dumps and it really is a problem that the other dumps (e.g. stubs-meta-history) are frozen while we wait to see if the full history dump can run to completion.

Even if we do let it finish, I'm not sure a dump of what Wikipedia was like 13 months ago is much use... The way I see it, what we need is to get a really powerful server to do the dump just once at a reasonable speed and then we'll have a previous dump to build on so future ones would be more reasonable.

Brion Vibber

10:42 p.m.

On 1/27/09 2:35 PM, Thomas Dalton wrote:

...

The way I see it, what we need is to get a really powerful server

Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go.

-- brion

Robert Rohde

10:55 p.m.

On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber brion@wikimedia.org wrote:

...

On 1/27/09 2:35 PM, Thomas Dalton wrote:

...
The way I see it, what we need is to get a really powerful server

Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go.

I don't know what your timetable is, but what about doing something to address the other aspects of the dump (logs, stubs, etc.) that are in limbo while full history chugs along. All the other enwiki files are now 3 months old and that is already enough to inconvenience some people.

The simplest solution is just to kill the current dump job if you have faith that a new architecture can be put in place in less than a year.

-Robert Rohde

Brion Vibber

11:43 p.m.

On 1/27/09 2:55 PM, Robert Rohde wrote:

...

On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibberbrion@wikimedia.org wrote:

...
On 1/27/09 2:35 PM, Thomas Dalton wrote:

...
The way I see it, what we need is to get a really powerful server

Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go.

I don't know what your timetable is, but what about doing something to address the other aspects of the dump (logs, stubs, etc.) that are in limbo while full history chugs along. All the other enwiki files are now 3 months old and that is already enough to inconvenience some people.

The simplest solution is just to kill the current dump job if you have faith that a new architecture can be put in place in less than a year.

We'll probably do that.

-- brion

Russell Blau

28 Jan 28 Jan

3:34 p.m.

"Brion Vibber" brion@wikimedia.org wrote in message news:497F9C35.9050500@wikimedia.org...

...

On 1/27/09 2:55 PM, Robert Rohde wrote:

...
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibberbrion@wikimedia.org wrote:

...
On 1/27/09 2:35 PM, Thomas Dalton wrote:

...
The way I see it, what we need is to get a really powerful server

Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go.

The simplest solution is just to kill the current dump job if you have faith that a new architecture can be put in place in less than a year.

We'll probably do that.

-- brion

FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh dump of the current pages.

Russ

Brion Vibber

4:32 p.m.

Probably wise to poke in a hack to skip the history first. :)

-- brion vibber (brion @ wikimedia.org)

On Jan 28, 2009, at 7:34, "Russell Blau" russblau@hotmail.com wrote:

...

"Brion Vibber" brion@wikimedia.org wrote in message news:497F9C35.9050500@wikimedia.org...

...
On 1/27/09 2:55 PM, Robert Rohde wrote:

...
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibberbrion@wikimedia.org wrote:

...
On 1/27/09 2:35 PM, Thomas Dalton wrote:

...
The way I see it, what we need is to get a really powerful server

Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go.

The simplest solution is just to kill the current dump job if you have faith that a new architecture can be put in place in less than a year.

We'll probably do that.

-- brion

FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh dump of the current pages.

Russ

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brion Vibber

29 Jan 29 Jan

7:20 p.m.

On 1/28/09 8:32 AM, Brion Vibber wrote:

...

Probably wise to poke in a hack to skip the history first. :)

Done in r46545.

Updated dump scripts and canceled the old enwiki dump.

New dumps also will be attempting to generate log output as XML which correctly handles the deletion/oversighting options; we'll see hwo that goes. :)

-- brion

Robert Rohde

7:43 p.m.

On Thu, Jan 29, 2009 at 11:20 AM, Brion Vibber brion@wikimedia.org wrote:

...

On 1/28/09 8:32 AM, Brion Vibber wrote:

...
Probably wise to poke in a hack to skip the history first. :)

Done in r46545.

Updated dump scripts and canceled the old enwiki dump.

New dumps also will be attempting to generate log output as XML which correctly handles the deletion/oversighting options; we'll see hwo that goes. :)

Is there somewhere that explains (or at least gives an example) of the new logging format and what has changed?

-Robert Rohde

Christian Storm

28 Jan 28 Jan

6:47 p.m.

That would be great. I second this notion whole heartedly.

On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:

...

"Brion Vibber" brion@wikimedia.org wrote in message news:497F9C35.9050500@wikimedia.org...

...
On 1/27/09 2:55 PM, Robert Rohde wrote:

...
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibberbrion@wikimedia.org wrote:

...
On 1/27/09 2:35 PM, Thomas Dalton wrote:

...
The way I see it, what we need is to get a really powerful server

Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go.

The simplest solution is just to kill the current dump job if you have faith that a new architecture can be put in place in less than a year.

We'll probably do that.

-- brion

FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh dump of the current pages.

Russ

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Christian Storm

10 Feb 10 Feb

11:32 p.m.

Brion,

We are having to resort to crawling en.wikipedia.org while we await for regular dumps. What is the minimum crawling delay we can get away with? I figure if we have 1 second delay then we'd be able to crawl the 2+ million articles in a month.

I know crawling is discouraged but it seems a lot of parties still do so after looking at robots.txt I have to assume that is how Google et al. is able to keep up to date.

Are their private data feeds? I noticed a wg_enwiki dump listed.

Christian

On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:

...

That would be great. I second this notion whole heartedly.

On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:

...
"Brion Vibber" brion@wikimedia.org wrote in message news:497F9C35.9050500@wikimedia.org...

...
On 1/27/09 2:55 PM, Robert Rohde wrote:

...
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibberbrion@wikimedia.org wrote:

...
On 1/27/09 2:35 PM, Thomas Dalton wrote:

...
The way I see it, what we need is to get a really powerful server

Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go.

The simplest solution is just to kill the current dump job if you have faith that a new architecture can be put in place in less than a year.

We'll probably do that.

-- brion

FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh dump of the current pages.

Russ

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Christian Storm

25 Mar 25 Mar

5:08 p.m.

Thanks to everyone who got the enwiki dumps going again! Should we expect more regular dumps now? What was the final solution of fixing this?

...

We are having to resort to crawling en.wikipedia.org while we await for regular dumps. What is the minimum crawling delay we can get away with? I figure if we have 1 second delay then we'd be able to crawl the 2+ million articles in a month.

I know crawling is discouraged but it seems a lot of parties still do so after looking at robots.txt I have to assume that is how Google et al. is able to keep up to date.

Are their private data feeds? I noticed a wg_enwiki dump listed.

Christian

On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:

...
That would be great. I second this notion whole heartedly.

On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:

...
"Brion Vibber" brion@wikimedia.org wrote in message news:497F9C35.9050500@wikimedia.org...

...
On 1/27/09 2:55 PM, Robert Rohde wrote:

...
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibberbrion@wikimedia.org wrote:

...
On 1/27/09 2:35 PM, Thomas Dalton wrote: > The way I see it, what we need is to get a really powerful server Nope, it's a software architecture issue. We'll restart it with the new arch when it's ready to go.

The simplest solution is just to kill the current dump job if you have faith that a new architecture can be put in place in less than a year.

We'll probably do that.

-- brion

FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh dump of the current pages.

Russ

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Brian

11:05 p.m.

Perhaps the toolserver can make you a current dump of current en?

On Wed, Mar 25, 2009 at 11:08 AM, Christian Storm storm@iparadigms.comwrote:

...

Thanks to everyone who got the enwiki dumps going again! Should we expect more regular dumps now? What was the final solution of fixing this?

...
We are having to resort to crawling en.wikipedia.org while we await for regular dumps. What is the minimum crawling delay we can get away with? I figure if we have 1 second delay then we'd be able to crawl the 2+ million articles in a month.

I know crawling is discouraged but it seems a lot of parties still do so after looking at robots.txt I have to assume that is how Google et al. is able to keep up to date.

Are their private data feeds? I noticed a wg_enwiki dump listed.

Christian

On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:

...
That would be great. I second this notion whole heartedly.

On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:

...
"Brion Vibber" brion@wikimedia.org wrote in message news:497F9C35.9050500@wikimedia.org...

...
On 1/27/09 2:55 PM, Robert Rohde wrote:

...
On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibberbrion@wikimedia.org wrote: > On 1/27/09 2:35 PM, Thomas Dalton wrote: >> The way I see it, what we need is to get a really powerful server > Nope, it's a software architecture issue. We'll restart it with > the new > arch when it's ready to go. The simplest solution is just to kill the current dump job if you have faith that a new architecture can be put in place in less than a year.

We'll probably do that.

-- brion

FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh dump of the current pages.

Russ

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

John Doe

11:06 p.m.

toolserver users dont have access to text

On Wed, Mar 25, 2009 at 7:05 PM, Brian Brian.Mingus@colorado.edu wrote:

...

Perhaps the toolserver can make you a current dump of current en?

On Wed, Mar 25, 2009 at 11:08 AM, Christian Storm <storm@iparadigms.com

...
wrote:

...
Thanks to everyone who got the enwiki dumps going again! Should we

expect

...
more regular dumps now? What was the final solution of fixing this?

...
We are having to resort to crawling en.wikipedia.org while we await for regular dumps. What is the minimum crawling delay we can get away with? I figure if we have 1 second delay then we'd be able to crawl the 2+ million articles in a month.

I know crawling is discouraged but it seems a lot of parties still do so after looking at robots.txt I have to assume that is how Google et al. is able to keep up to date.

Are their private data feeds? I noticed a wg_enwiki dump listed.

Christian

On Jan 28, 2009, at 10:47 AM, Christian Storm wrote:

...
That would be great. I second this notion whole heartedly.

On Jan 28, 2009, at 7:34 AM, Russell Blau wrote:

...
"Brion Vibber" brion@wikimedia.org wrote in message news:497F9C35.9050500@wikimedia.org...

...
On 1/27/09 2:55 PM, Robert Rohde wrote: > On Tue, Jan 27, 2009 at 2:42 PM, Brion Vibber<brion@wikimedia.org

...
...
...
...
> wrote: >> On 1/27/09 2:35 PM, Thomas Dalton wrote: >>> The way I see it, what we need is to get a really powerful

server

...
...
...
...
...
>> Nope, it's a software architecture issue. We'll restart it with >> the new >> arch when it's ready to go. > The simplest solution is just to kill the current dump job if you > have > faith that a new architecture can be put in place in less than a > year.

We'll probably do that.

-- brion

FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh dump of the current pages.

Russ

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Tomasz Finc

26 Mar 26 Mar

1:33 a.m.

On 3/25/09 10:08 AM, Christian Storm wrote:

...

Thanks to everyone who got the enwiki dumps going again! Should we expect more regular dumps now? What was the final solution of fixing this?

Lots of love and upkeep by everyone :)

But really its needs to be more automated and made parallelised so that we can spot issues faster, validate inconsistencies, and finish quicker.

Brion and I have met about this and we've even brought it into the Wikimedia dev meetings to brainstorm how the system could change for the better.

I've started drafting some new ideas at http://wikitech.wikimedia.org/view/Data_dump_redesign

of the various problems that were facing and what kind of job management we can put around it. Were taking this on as a full "should have been done 2 years ago" project and I'm going to be shepherding this along.

Right now I'm collecting stats about the throughput of the components to see how much in parallel this could be farmed out in a job management system.

This is a large project that has some distinct problem areas that we'll be isolating and welcoming help on.

--tomasz

Keisial

10:25 p.m.

Tomasz Finc wrote:

...

I've started drafting some new ideas at http://wikitech.wikimedia.org/view/Data_dump_redesign

of the various problems that were facing and what kind of job management we can put around it. Were taking this on as a full "should have been done 2 years ago" project and I'm going to be shepherding this along.

Right now I'm collecting stats about the throughput of the components to see how much in parallel this could be farmed out in a job management system.

This is a large project that has some distinct problem areas that we'll be isolating and welcoming help on.

--tomasz

Quite interesting. Can the images at office.wikimedia.org be moved to somewhere public?

...

Decompression takes as long as compression with bzip2

I think decompression is *faster* than compression http://tukaani.org/lzma/benchmarks

Let me know if I can help with anything.

Brion Vibber

27 Mar 27 Mar

12:14 a.m.

On 3/26/09 3:25 PM, Keisial wrote:

...

Quite interesting. Can the images at office.wikimedia.org be moved to somewhere public?

I've copied those two to the public wiki. :)

...

...
Decompression takes as long as compression with bzip2

I think decompression is *faster* than compression http://tukaani.org/lzma/benchmarks

LZMA is nice and fast to decompress... but *insanely* slower to compress, and doesn't seem as parallelizable. :(

-- brion

ERSEK Laszlo

12:51 a.m.

On 03/27/09 01:14, Brion Vibber wrote:

...

LZMA is nice and fast to decompress... but *insanely* slower to compress, and doesn't seem as parallelizable. :(

The xz file format should allow for "easy" parallelization, both when compressing and decompressing; see

http://tukaani.org/xz/xz-file-format.txt

3. Block 3.1. Block Header 3.1.1. Block Header Size 3.1.3. Compressed Size 3.1.4. Uncompressed Size 3.1.6. Header Padding 3.3. Block Padding

At least in theory, this "length-prefixing" should make it fairly straightforward to write a multi-threaded decompressor with a splitter that can work from a pipe and is input-bound. I reckon the xz structure will eventually prove useful even for distributed compression/decompression.

lacos

Anthony

2:05 a.m.

On Thu, Mar 26, 2009 at 8:51 PM, ERSEK Laszlo lacos@elte.hu wrote:

...

On 03/27/09 01:14, Brion Vibber wrote:

...
LZMA is nice and fast to decompress... but *insanely* slower to compress, and doesn't seem as parallelizable. :(

The xz file format should allow for "easy" parallelization, both when compressing and decompressing; see

http://tukaani.org/xz/xz-file-format.txt

Block

3.1. Block Header 3.1.1. Block Header Size 3.1.3. Compressed Size 3.1.4. Uncompressed Size 3.1.6. Header Padding 3.3. Block Padding

At least in theory, this "length-prefixing" should make it fairly straightforward to write a multi-threaded decompressor with a splitter that can work from a pipe and is input-bound. I reckon the xz structure will eventually prove useful even for distributed compression/decompression.

lacos

It includes an index for random access too. Cool. I wonder what kind of block size you'd need to get a compression ratio approaching that of 7z.

Keisial

30 Mar 30 Mar

12:53 a.m.

Brion Vibber wrote:

...

...
...
Decompression takes as long as compression with bzip2

I think decompression is *faster* than compression http://tukaani.org/lzma/benchmarks

LZMA is nice and fast to decompress... but *insanely* slower to compress, and doesn't seem as parallelizable. :(

-- brion

I used the lzma benchmark as evidence to support that decompressing bzip2 is faster than compressing.

Alai

29 Jan 29 Jan

8:47 a.m.

Russell Blau <russblau <at> hotmail.com> writes:

...

FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh dump of the current pages.

I'd like to third/fourth/(other ordinal) this idea too. I've been using the (in comparison tiny) SQL dumps for various purposes, and it's most vexing that these have to wait until the end (or lack of any end...) of the larger XML dumps. (The same data is replicated on the toolserver, of course, but I'd get beaten to death if I tried to run some of the data collection scripts I've been running offline, there.)

Cheers, Alai.

Gerard Meijssen

9:52 a.m.

Hoi, Two things:

- if we abort the backup now, we do not know if we WILL have something at the time it would have ended - if the toolserver data can provide a service as a stop gap measure why not provide that in the mean time

Thanks, GerardM

2009/1/29 Alai AlaiWiki@gmail.com

...

Russell Blau <russblau <at> hotmail.com> writes:

...
FWIW, I'll add my vote for aborting the current dump *now* if we don't expect it ever to actually be finished, so we can at least get a fresh

dump

...
of the current pages.

I'd like to third/fourth/(other ordinal) this idea too. I've been using the (in comparison tiny) SQL dumps for various purposes, and it's most vexing that these have to wait until the end (or lack of any end...) of the larger XML dumps. (The same data is replicated on the toolserver, of course, but I'd get beaten to death if I tried to run some of the data collection scripts I've been running offline, there.)

Cheers, Alai.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Robert Rohde

10:22 a.m.

On Thu, Jan 29, 2009 at 1:52 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, Two things:

if we abort the backup now, we do not know if we WILL have something at

the time it would have ended

if the toolserver data can provide a service as a stop gap measure why

not provide that in the mean time

If you want to play the optimist and believe this dump might eventually accomplish something, then the right stopgap would be the hack the dumper so that it periodically regenerates the other files even while the big dump is still running. Such a thing, though definitely a hack, would not be hard to do.

-Robert Rohde

5593

Age (days ago)

5655

Last active (days ago)

wikitech-l@lists.wikimedia.org

24 comments

14 participants

tags (0)

participants (14)

Alai
Anthony
Bilal Abdul Kader
Brian
Brion Vibber
Christian Storm
ERSEK Laszlo
Gerard Meijssen
John Doe
Keisial
Robert Rohde
Russell Blau
Thomas Dalton
Tomasz Finc