Moving the Dump Process to another language

List overview All Threads
Download

newer

older

GSOC Proposal for Account...

Re: [Wikitech-l] Enable WikiTrust...

Yuvi Panda

24 Mar 2011 24 Mar '11

3:05 p.m.

Hi, I'm Yuvi, a student looking forward to working with MediaWiki via this year's GSoC.

I want to work on something dump related, and have been bugging apergos (Ariel) for a while now. One of the things that popped up into my head is moving the dump process to another language (say, C#, or Java, or be very macho and do C++ or C). This would give the dump process quite a bit of a speed bump (The profiling I did[1] seems to indicate that the DB is not the bottleneck. Might be wrong though), and can also be done in a way that makes running distributed dumps easier/more elegant.

So, thoughts on this? Is 'Move Dumping Process to another language' a good idea at all?

P.S. I'm just looking out for ideas, so if you have specific improvements to the dumping process in mind, please respond with those too. I already have DistributedBZip2 and Incremental Dumps in mind too :)

[1]: https://bugzilla.wikimedia.org/show_bug.cgi?id=5303

Thanks :)

-- Yuvi Panda T http://yuvi.in/

Show replies by date

Brion Vibber

24 Mar 24 Mar

4:48 p.m.

On Thu, Mar 24, 2011 at 1:05 PM, Yuvi Panda yuvipanda@gmail.com wrote:

...

Hi, I'm Yuvi, a student looking forward to working with MediaWiki via this year's GSoC.

I want to work on something dump related, and have been bugging apergos (Ariel) for a while now. One of the things that popped up into my head is moving the dump process to another language (say, C#, or Java, or be very macho and do C++ or C). This would give the dump process quite a bit of a speed bump (The profiling I did[1] seems to indicate that the DB is not the bottleneck. Might be wrong though), and can also be done in a way that makes running distributed dumps easier/more elegant.

So, thoughts on this? Is 'Move Dumping Process to another language' a good idea at all?

I'd worry a lot less about what languages are used than whether the process itself is scalable.

The current dump process (which I created in 2004-2005 when we had a LOT less data, and a LOT fewer computers) is very linear, which makes it awkward to scale up:

* pull a list of all page revisions, in page/rev order * as they go through, pump page/rev data to a linear XML stream * pull that linear XML stream back in again, as well as the last time's completed linear XML stream * while going through those, combine the original page text from the last XML dump, or from the current database, and spit out a linear XML stream containing both page/rev data and rev text * and also stick compression on the end

About the only way we can scale it beyond a couple of CPUs (compression/decompression as separate processes from the main PHP stream handler) is to break it into smaller linear pieces and either reassemble them, or require users to reassemble the pieces for linear processing.

Within each of those linear processes, any bottleneck will slow everything down whether that's bzip2 or 7zip compression/decompression, fetching revisions from the wiki's complex storage systems, the XML parsing, or something in the middle.

What I'd recommend looking at is ways to actually rearrange the data so a) there's less work that needs to be done to create a new dump and b) most of that work can be done independently of other work that's going on, so it's highly scalable.

Ideally, anything that hasn't changed since the last dump shouldn't need *any* new data processing (right now it'll go through several stages of slurping from a DB, decompression and recompression, XML parsing and re-structuring, etc). A new dump should consist basically of running through appending new data and removing deleted data, without touching the things that haven't changed.

This may actually need a fancier structured data file format, or perhaps a sensible directory structure and subfile structure -- ideally one that's friendly to beed updated via simple things like rsync.

-- brion

James Linden

7:29 p.m.

...

...
So, thoughts on this? Is 'Move Dumping Process to another language' a good idea at all?

I'd worry a lot less about what languages are used than whether the process itself is scalable.

I'm not a mediawiki / wikipedia developer, but as a developer / sys admin, I'd think that adding another environment stack requirement (in the case of C# or Java) to the overall architecture would be a bad idea in general.

...

The current dump process (which I created in 2004-2005 when we had a LOT less data, and a LOT fewer computers) is very linear, which makes it awkward to scale up:

pull a list of all page revisions, in page/rev order

* as they go through, pump page/rev data to a linear XML stream

pull that linear XML stream back in again, as well as the last time's

completed linear XML stream * while going through those, combine the original page text from the last XML dump, or from the current database, and spit out a linear XML stream containing both page/rev data and rev text * and also stick compression on the end

About the only way we can scale it beyond a couple of CPUs (compression/decompression as separate processes from the main PHP stream handler) is to break it into smaller linear pieces and either reassemble them, or require users to reassemble the pieces for linear processing.

Within each of those linear processes, any bottleneck will slow everything down whether that's bzip2 or 7zip compression/decompression, fetching revisions from the wiki's complex storage systems, the XML parsing, or something in the middle.

What I'd recommend looking at is ways to actually rearrange the data so a) there's less work that needs to be done to create a new dump and b) most of that work can be done independently of other work that's going on, so it's highly scalable.

Ideally, anything that hasn't changed since the last dump shouldn't need *any* new data processing (right now it'll go through several stages of slurping from a DB, decompression and recompression, XML parsing and re-structuring, etc). A new dump should consist basically of running through appending new data and removing deleted data, without touching the things that haven't changed.

This may actually need a fancier structured data file format, or perhaps a sensible directory structure and subfile structure -- ideally one that's friendly to beed updated via simple things like rsync.

I'm probably stating the obvious here...

Breaking the dump up by article namespace might be a starting point -- have 1 controller process for each namespace. That leaves 85% of the work in the default namespace, which could them be segmented by any combination of factors, maybe as simple as block batches of X number of articles.

When I'm importing the XML dump to MySQL, I have one process that reads the XML file, and X processes (10 usually) working in parallel to parse each article block on a first-available queue system. My current implementation is a bit cumbersome, but maybe the idea could be used for building the dump as well?

In general, I'm interested in pitching in some effort on anything related to the dump/import processes.

-------------------------------------- James Linden kodekrash@gmail.com --------------------------------------

Ariel T. Glenn

25 Mar 25 Mar

2:21 a.m.

Στις 24-03-2011, ημέρα Πεμ, και ώρα 20:29 -0400, ο/η James Linden έγραψε:

...

...
...
So, thoughts on this? Is 'Move Dumping Process to another language' a good idea at all?

I'd worry a lot less about what languages are used than whether the process itself is scalable.

I'm not a mediawiki / wikipedia developer, but as a developer / sys admin, I'd think that adding another environment stack requirement (in the case of C# or Java) to the overall architecture would be a bad idea in general.

...
The current dump process (which I created in 2004-2005 when we had a LOT less data, and a LOT fewer computers) is very linear, which makes it awkward to scale up:

pull a list of all page revisions, in page/rev order

as they go through, pump page/rev data to a linear XML stream

pull that linear XML stream back in again, as well as the last time's

completed linear XML stream

while going through those, combine the original page text from the last

XML dump, or from the current database, and spit out a linear XML stream containing both page/rev data and rev text

and also stick compression on the end

About the only way we can scale it beyond a couple of CPUs (compression/decompression as separate processes from the main PHP stream handler) is to break it into smaller linear pieces and either reassemble them, or require users to reassemble the pieces for linear processing.

TBH users wouldn't have to reassemble the pieces I don't think; they might be annoyed at having 400 little (or not so little) files lying around but any processing they meant to do could, I would think, easily be wrapped in a loop that tossed in each piece in order as input.

...

...
Within each of those linear processes, any bottleneck will slow everything down whether that's bzip2 or 7zip compression/decompression, fetching revisions from the wiki's complex storage systems, the XML parsing, or something in the middle.

What I'd recommend looking at is ways to actually rearrange the data so a) there's less work that needs to be done to create a new dump and b) most of that work can be done independently of other work that's going on, so it's highly scalable.

Ideally, anything that hasn't changed since the last dump shouldn't need *any* new data processing (right now it'll go through several stages of slurping from a DB, decompression and recompression, XML parsing and re-structuring, etc). A new dump should consist basically of running through appending new data and removing deleted data, without touching the things that haven't changed.

One assumption here is that there is a previous dump to work from; that's not always true, and we should be able to run a dump "from scratch" without it needing to take 3 months for en wiki.

A second assumption is that the previous dump data is sound; we've also seen that fail to be true. This means that we need to be able to check the contents against the database contents in some fashion. Currently we look at revision length for each revision, but that's not foolproof (and it's also still too slow).

However if verification meant just that, verification instead of rerwiting a new file with the additional costs that compression imposes on us, we would see some gains immediately.

...

...
This may actually need a fancier structured data file format, or perhaps a sensible directory structure and subfile structure -- ideally one that's friendly to beed updated via simple things like rsync.

I'm probably stating the obvious here...

Breaking the dump up by article namespace might be a starting point -- have 1 controller process for each namespace. That leaves 85% of the work in the default namespace, which could them be segmented by any combination of factors, maybe as simple as block batches of X number of articles.

We already have the mechanism for running batches of arbitrary numbers of articles. That's what the en history dumps do now.

What we don't have is:

* a way to run easily over multiple hosts * a way to recombine small pieces into larger files for download that isn't serial, *or* alternatively a format that relies on multiple small pieces so we can skip recombining * a way to check previous content for integrity *quickly* before folding it into the current dumps (we check each revision separately, much too slow) * a way to "fold previous content into the current dumps" that consists of making a straight copy of what's on disk with no processing. (What do we do if something has been deleted or moved, or is corrupt? The existing format isn't friendly to those cases.)

...

When I'm importing the XML dump to MySQL, I have one process that reads the XML file, and X processes (10 usually) working in parallel to parse each article block on a first-available queue system. My current implementation is a bit cumbersome, but maybe the idea could be used for building the dump as well?

In general, I'm interested in pitching in some effort on anything related to the dump/import processes.

Glad to hear it! Drop by irc please, I'm in the usual channels. :-)

Ariel

...

James Linden kodekrash@gmail.com

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Andrew Dunbar

2:48 a.m.

On 25 March 2011 18:21, Ariel T. Glenn ariel@wikimedia.org wrote:

...

Στις 24-03-2011, ημέρα Πεμ, και ώρα 20:29 -0400, ο/η James Linden έγραψε:

...
...
...
So, thoughts on this? Is 'Move Dumping Process to another language' a good idea at all?

I'd worry a lot less about what languages are used than whether the process itself is scalable.

I'm not a mediawiki / wikipedia developer, but as a developer / sys admin, I'd think that adding another environment stack requirement (in the case of C# or Java) to the overall architecture would be a bad idea in general.

...
The current dump process (which I created in 2004-2005 when we had a LOT less data, and a LOT fewer computers) is very linear, which makes it awkward to scale up:

pull a list of all page revisions, in page/rev order

* as they go through, pump page/rev data to a linear XML stream

pull that linear XML stream back in again, as well as the last time's

completed linear XML stream * while going through those, combine the original page text from the last XML dump, or from the current database, and spit out a linear XML stream containing both page/rev data and rev text * and also stick compression on the end

About the only way we can scale it beyond a couple of CPUs (compression/decompression as separate processes from the main PHP stream handler) is to break it into smaller linear pieces and either reassemble them, or require users to reassemble the pieces for linear processing.

TBH users wouldn't have to reassemble the pieces I don't think; they might be annoyed at having 400 little (or not so little) files lying around but any processing they meant to do could, I would think, easily be wrapped in a loop that tossed in each piece in order as input.

...
...
Within each of those linear processes, any bottleneck will slow everything down whether that's bzip2 or 7zip compression/decompression, fetching revisions from the wiki's complex storage systems, the XML parsing, or something in the middle.

What I'd recommend looking at is ways to actually rearrange the data so a) there's less work that needs to be done to create a new dump and b) most of that work can be done independently of other work that's going on, so it's highly scalable.

Ideally, anything that hasn't changed since the last dump shouldn't need *any* new data processing (right now it'll go through several stages of slurping from a DB, decompression and recompression, XML parsing and re-structuring, etc). A new dump should consist basically of running through appending new data and removing deleted data, without touching the things that haven't changed.

One assumption here is that there is a previous dump to work from; that's not always true, and we should be able to run a dump "from scratch" without it needing to take 3 months for en wiki.

A second assumption is that the previous dump data is sound; we've also seen that fail to be true. This means that we need to be able to check the contents against the database contents in some fashion. Currently we look at revision length for each revision, but that's not foolproof (and it's also still too slow).

However if verification meant just that, verification instead of rerwiting a new file with the additional costs that compression imposes on us, we would see some gains immediately.

...
...
This may actually need a fancier structured data file format, or perhaps a sensible directory structure and subfile structure -- ideally one that's friendly to beed updated via simple things like rsync.

I'm probably stating the obvious here...

Breaking the dump up by article namespace might be a starting point -- have 1 controller process for each namespace. That leaves 85% of the work in the default namespace, which could them be segmented by any combination of factors, maybe as simple as block batches of X number of articles.

We already have the mechanism for running batches of arbitrary numbers of articles. That's what the en history dumps do now.

What we don't have is:

a way to run easily over multiple hosts

a way to recombine small pieces into larger files for download that

isn't serial, *or* alternatively a format that relies on multiple small pieces so we can skip recombining

a way to check previous content for integrity *quickly* before folding

it into the current dumps (we check each revision separately, much too slow)

a way to "fold previous content into the current dumps" that consists

of making a straight copy of what's on disk with no processing. (What do we do if something has been deleted or moved, or is corrupt? The existing format isn't friendly to those cases.)

...
When I'm importing the XML dump to MySQL, I have one process that reads the XML file, and X processes (10 usually) working in parallel to parse each article block on a first-available queue system. My current implementation is a bit cumbersome, but maybe the idea could be used for building the dump as well?

In general, I'm interested in pitching in some effort on anything related to the dump/import processes.

Glad to hear it! Drop by irc please, I'm in the usual channels. :-)

Just a thought, wouldn't it be easier to generate dumps in parallel if we did away with the assumption that the dump would be in database order. The metadata in the dump provides the ordering info for the people that require it.

Andrew Dunbar (hippietrail)

...

Ariel

...

James Linden kodekrash@gmail.com

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Platonides

3:49 p.m.

Andrew Dunbar wrote:

...

Just a thought, wouldn't it be easier to generate dumps in parallel if we did away with the assumption that the dump would be in database order. The metadata in the dump provides the ordering info for the people that require it.

Andrew Dunbar (hippietrail)

I don't see how doing the dumps in a different order allows you to greater parallelism. You can already launch several processes at different points of the set. Giving one every N articles to each process would allow more balanced pieces, but that's not important. You would also skip the work of reading the old dump to the offset, although that's reasonably fast. The important point for having them in this order is the property to keep the pages in the same order as the previous dump.

Ariel T. Glenn

4:15 p.m.

Στις 25-03-2011, ημέρα Παρ, και ώρα 21:49 +0100, ο/η Platonides έγραψε:

...

Andrew Dunbar wrote:

...
Just a thought, wouldn't it be easier to generate dumps in parallel if we did away with the assumption that the dump would be in database order. The metadata in the dump provides the ordering info for the people that require it.

Andrew Dunbar (hippietrail)

I don't see how doing the dumps in a different order allows you to greater parallelism. You can already launch several processes at different points of the set. Giving one every N articles to each process would allow more balanced pieces, but that's not important. You would also skip the work of reading the old dump to the offset, although that's reasonably fast. The important point for having them in this order is the property to keep the pages in the same order as the previous dump.

I'm pretty sure there are a lot of folks out there that, like me, have tools which rely on exactly this property (new/changed stuff shows up at the end).

Amusingly, splitting based on some number of articles doesn't really balance out the pieces, at least for history dumps, after the project has been around long enough with enough activity. Splitting by number of revisions is what we really want, and the older pages have many many more revs than later pages.

Ariel

Platonides

6:58 p.m.

Ariel T. Glenn wrote:

...

Amusingly, splitting based on some number of articles doesn't really balance out the pieces, at least for history dumps, after the project has been around long enough with enough activity. Splitting by number of revisions is what we really want, and the older pages have many many more revs than later pages.

Right. That would only work for pages-articles, not for pages-history. But splitting the revisions on different files makes no sense. You could however get an approximation if instead of giving out pages in strict order, they are given to the workers as soon as they are ready. Workers with pages holding many revisions will take longer, while those with will come back again shortly. I think it would correlate quite well to the number of revisions. You would be balancing between workers the time needed (which is what we really care about).

Platonides

24 Mar 24 Mar

7:34 p.m.

Yuvi Panda wrote:

...

Hi, I'm Yuvi, a student looking forward to working with MediaWiki via this year's GSoC.

I want to work on something dump related, and have been bugging apergos (Ariel) for a while now. One of the things that popped up into my head is moving the dump process to another language (say, C#, or Java, or be very macho and do C++ or C). This would give the dump process quite a bit of a speed bump (The profiling I did[1] seems to indicate that the DB is not the bottleneck. Might be wrong though), and can also be done in a way that makes running distributed dumps easier/more elegant.

So, thoughts on this? Is 'Move Dumping Process to another language' a good idea at all?

P.S. I'm just looking out for ideas, so if you have specific improvements to the dumping process in mind, please respond with those too. I already have DistributedBZip2 and Incremental Dumps in mind too :)

Thanks :)

An idea I have been pondering is to pass the offset to the previous revision to the compressor, so it would need much less work in the compressing window to perform its work. You would need something like 7z/xz so that the window can be big enough to contain at least the latest revision (its compression factor is quite impressive, too: 1TB down to 2.31GB). Note that I haven't checked on how factible it can be such modification to the compressor.

Charles Polisher

25 Mar 25 Mar

9:08 a.m.

Platonides wrote:

...

Yuvi Panda wrote:

...
Hi, I'm Yuvi, a student looking forward to working with MediaWiki via this year's GSoC.

<snip/>

...

An idea I have been pondering is to pass the offset to the previous revision to the compressor, so it would need much less work in the compressing window to perform its work. You would need something like 7z/xz so that the window can be big enough to contain at least the latest revision (its compression factor is quite impressive, too: 1TB down to 2.31GB). Note that I haven't checked on how factible it can be such modification to the compressor.

Consider using pigz for the compression step.

+ Much (7x?) faster than gzip + Straighforward install + Stable + One or more threads per CPU (settable) - Only compresses to .gz or .zz formats - Linux only

Alternately, could Gnu make's parallel feature be used? For example, "make -j --load-average=30" will keep adding jobs in parallel until the load average reaches 30.

+ It's make - It's make

-- Charles Polisher

Brion Vibber

11 a.m.

On Fri, Mar 25, 2011 at 7:08 AM, Charles Polisher cpolish@surewest.netwrote:

...

Consider using pigz for the compression step.

Much (7x?) faster than gzip

gzip is fairly fast already, but also does a poor job at compressing huge repetitive text dumps compared to the (MUCHHHHHHH slower) bzip2 and (MUCHHHHHH^3 slower) 7zip LZMA which are what we use.

-- brion

5021

Age (days ago)

5022

Last active (days ago)

wikitech-l@lists.wikimedia.org

10 comments

7 participants

tags (0)

participants (7)

Andrew Dunbar
Ariel T. Glenn
Brion Vibber
Charles Polisher
James Linden
Platonides
Yuvi Panda