Dear devs,
I would like to initiate a discussion about how to reduce the time required to generate dump files. A while ago Emmanuel Engelhart opened a bugreport suggesting to parallelize this feature and I would like to go through the available options and hopefully determine a course of action.
The current process is straightforward and sequential (as far as I know): it reads table by table and row by row and stores the output. The drawbacks of this process are that it takes increasingly more time to generate a dump as the different projects continue to grow and when the process halts or is interrupted then it needs to start all over again.
I believe that there are two approaches to parallelizing the export dump: 1) Launch multiple PHP processes that each take care of a particular range of ids. This might not be called true parallelization, but it achieves the same goal. The reason for this approach is that PHP has very limited (maybe no) support for parallelization / multiprocessing. The only thing PHP can do is fork a process (I might be incorrect about this)
2) Use a different language with builtin support for multiprocessing like Java or Python. I am not intending to start an heated debate but I think this is an option that at least should be on the table and be discussed. Obviously, an important reason not to do it is that it's a different language. I am not sure how integral the export functionality is to MediaWiki and if it is then this is a dead end.
However, if the export functionality is primarily used by Wikimedia and nobody else then we might consider a different language. Or, we make a standalone app that is not part of Mediawiki and it's use is only internally for Wikimedia.
If i am missing other approaches or solutions then please chime in.
Best regards,
Diederik
Indeed I run parallel dumps based on a range of ids... although the algorithm was needing tweaking. I expect to get back to looking at that pretty soon.
Ariel
Στις 15-12-2010, ημέρα Τετ, και ώρα 13:01 -0800, ο/η Diederik van Liere έγραψε:
Dear devs,
I would like to initiate a discussion about how to reduce the time required to generate dump files. A while ago Emmanuel Engelhart opened a bugreport suggesting to parallelize this feature and I would like to go through the available options and hopefully determine a course of action.
The current process is straightforward and sequential (as far as I know): it reads table by table and row by row and stores the output. The drawbacks of this process are that it takes increasingly more time to generate a dump as the different projects continue to grow and when the process halts or is interrupted then it needs to start all over again.
I believe that there are two approaches to parallelizing the export dump:
Launch multiple PHP processes that each take care of a particular range of ids. This might not be called true parallelization, but it achieves the same goal. The reason for this approach is that PHP has very limited (maybe no) support for parallelization / multiprocessing. The only thing PHP can do is fork a process (I might be incorrect about this)
Use a different language with builtin support for multiprocessing like Java or Python. I am not intending to start an heated debate but I think this is an option that at least should be on the table and be discussed. Obviously, an important reason not to do it is that it's a different language. I am not sure how integral the export functionality is to MediaWiki and if it is then this is a dead end.
However, if the export functionality is primarily used by Wikimedia and nobody else then we might consider a different language. Or, we make a standalone app that is not part of Mediawiki and it's use is only internally for Wikimedia.
If i am missing other approaches or solutions then please chime in.
Best regards,
Diederik _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
2010/12/15 Diederik van Liere dvanliere@gmail.com:
However, if the export functionality is primarily used by Wikimedia and nobody else then we might consider a different language. Or, we make a standalone app that is not part of Mediawiki and it's use is only internally for Wikimedia.
If i am missing other approaches or solutions then please chime in.
I have an idea for a Wikimedia-specific performance hack: use lower-level ways of accessing ES so each blob is only fetched and decompressed once. This should reduce ES hits by a factor ~50 AFAIK.
<background for people less familiar with the ES setup>
WMF stores revision text in a system called external storage (ES), which is basically a MySQL database that stores blobs of compressed data. Revisions are not stored in order but are grouped per-page, such that each blob only contains revisions belonging to a specific page, in chronological order (up to N revs per blob or M bytes, whichever comes first; if memory serves, N=50 and M=10MB). Because the contents of consecutive versions of the same page are highly similar, compression performs very well in this case, and this type of storage is very space-efficient.
It's not particularly fast for random access, though, because to retrieve a revision you have to look up its text table entry (in the 'normal' wiki database), which will tell you which blob it's in and what its index in the blob is, then you have to fetch the entire blob, decompress it and find your revision. MediaWiki stores the result of each revision text fetch in memcached, probably for this reason.
Instead, if we would just fetch and decompress the entire blob /and use all of it/, we'd have the text of a number of consecutive revisions to the same page basically for the price of one fetch. This seems to be ideally suited to the dumps, because they output consecutive revisions to the same page.
</background>
I'm not sure how hard this would be to achieve (you'd have to correlate blob parts with revisions manually using the text table; there might be gaps for deleted revs because ES is append-only) or how much it would help (my impression is ES is one of the slower parts of our system and reducing the number of ES hits by a factor 50 should help, but I may be wrong), maybe someone with more relevant knowledge and experience can comment on that (Tim?).
Roan Kattouw (Catrope)
Roan Kattouw wrote:
I'm not sure how hard this would be to achieve (you'd have to correlate blob parts with revisions manually using the text table; there might be gaps for deleted revs because ES is append-only) or how much it would help (my impression is ES is one of the slower parts of our system and reducing the number of ES hits by a factor 50 should help, but I may be wrong), maybe someone with more relevant knowledge and experience can comment on that (Tim?).
Roan Kattouw (Catrope)
ExternalStoreDB::fetchBlob() is already keeping the last one to optimize repeated accesses to the same blob (we would probably want a bigger cache for the dumper, though). On the other hand, I don't think the dumpers should be doing the store of textid contents in memcached (Revision::loadText) since they are filling them with entries useless for the users queries (having a different locality set), useless for themselves (since they are traversing the full list once) and -even assuming that the memcached can happily handle it and no other data is affecting by it- the network delay make it a non-free operation.
Ariel, do you have in wikitech the step-by-step list of actions to setup a WMF dump server? I always forget about which scripts are being used and what does each of them do. Can xmldumps-phase3 be removed? I'd prefer that it uses the release/trunk/wmf-deployment, an old copy is a source for problems. If additional changes are needed (it seems unpatched), the appropiate hooks should be added in core.
Στις 17-12-2010, ημέρα Παρ, και ώρα 00:52 +0100, ο/η Platonides έγραψε:
Roan Kattouw wrote:
I'm not sure how hard this would be to achieve (you'd have to correlate blob parts with revisions manually using the text table; there might be gaps for deleted revs because ES is append-only) or how much it would help (my impression is ES is one of the slower parts of our system and reducing the number of ES hits by a factor 50 should help, but I may be wrong), maybe someone with more relevant knowledge and experience can comment on that (Tim?).
Roan Kattouw (Catrope)
ExternalStoreDB::fetchBlob() is already keeping the last one to optimize repeated accesses to the same blob (we would probably want a bigger cache for the dumper, though). On the other hand, I don't think the dumpers should be doing the store of textid contents in memcached (Revision::loadText) since they are filling them with entries useless for the users queries (having a different locality set), useless for themselves (since they are traversing the full list once) and -even assuming that the memcached can happily handle it and no other data is affecting by it- the network delay make it a non-free operation.
Ariel, do you have in wikitech the step-by-step list of actions to setup a WMF dump server? I always forget about which scripts are being used and what does each of them do. Can xmldumps-phase3 be removed? I'd prefer that it uses the release/trunk/wmf-deployment, an old copy is a source for problems. If additional changes are needed (it seems unpatched), the appropiate hooks should be added in core.
Most backups run off of trunk. The stuff I have in my branch is the parallel stuff for testing.
http://wikitech.wikimedia.org/view/Dumps details the various scripts.
No, xmldumps-phase3 can't be removed yet. I have yet to make the changes I need to that code (and I won't make them in core immediately, they need to be tested thoroughly first before being checked in). Once I think they are ok, then I will fold them into trunk. It will be a while yet.
Ariel
To continue the discussion on how to improve the performance, would it be possible to distribute the dumps as a 7z / gz / other format archive containing multiple smaller XML files. It's quite tricky to split a very large XML file in smaller valid XML files and if the dumping process is already parallelized then we do not have to cat the different XML files to one large XML file but instead we can distribute multiple smaller parallelized files .
best,
Diederik On 2010-12-16, at 7:02 PM, Ariel T. Glenn wrote:
Στις 17-12-2010, ημέρα Παρ, και ώρα 00:52 +0100, ο/η Platonides έγραψε:
Roan Kattouw wrote:
I'm not sure how hard this would be to achieve (you'd have to correlate blob parts with revisions manually using the text table; there might be gaps for deleted revs because ES is append-only) or how much it would help (my impression is ES is one of the slower parts of our system and reducing the number of ES hits by a factor 50 should help, but I may be wrong), maybe someone with more relevant knowledge and experience can comment on that (Tim?).
Roan Kattouw (Catrope)
ExternalStoreDB::fetchBlob() is already keeping the last one to optimize repeated accesses to the same blob (we would probably want a bigger cache for the dumper, though). On the other hand, I don't think the dumpers should be doing the store of textid contents in memcached (Revision::loadText) since they are filling them with entries useless for the users queries (having a different locality set), useless for themselves (since they are traversing the full list once) and -even assuming that the memcached can happily handle it and no other data is affecting by it- the network delay make it a non-free operation.
Ariel, do you have in wikitech the step-by-step list of actions to setup a WMF dump server? I always forget about which scripts are being used and what does each of them do. Can xmldumps-phase3 be removed? I'd prefer that it uses the release/trunk/wmf-deployment, an old copy is a source for problems. If additional changes are needed (it seems unpatched), the appropiate hooks should be added in core.
Most backups run off of trunk. The stuff I have in my branch is the parallel stuff for testing.
http://wikitech.wikimedia.org/view/Dumps details the various scripts.
No, xmldumps-phase3 can't be removed yet. I have yet to make the changes I need to that code (and I won't make them in core immediately, they need to be tested thoroughly first before being checked in). Once I think they are ok, then I will fold them into trunk. It will be a while yet.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Diederik van Liere wrote:
To continue the discussion on how to improve the performance, would it be possible to distribute the dumps as a 7z / gz / other format archive containing multiple smaller XML files. It's quite tricky to split a very large XML file in smaller valid XML files and if the dumping process is already parallelized then we do not have to cat the different XML files to one large XML file but instead we can distribute multiple smaller parallelized files .
best,
Diederik
That has already been done for enwiki.
Which dump file is offered in smaller sub files?
On Sun, Dec 19, 2010 at 6:02 PM, Platonides Platonides@gmail.com wrote:
Diederik van Liere wrote:
To continue the discussion on how to improve the performance, would it be possible to distribute the dumps as a 7z / gz / other format archive containing multiple smaller XML files. It's quite tricky to split a very large XML file in smaller valid XML files and if the dumping process is already parallelized then we do not have to cat the different XML files to one large XML file but instead we can distribute multiple smaller parallelized files .
best,
Diederik
That has already been done for enwiki.
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Diederik van Liere wrote:
Which dump file is offered in smaller sub files?
http://download.wikimedia.org/enwiki/20100904/
Also see http://wikitech.wikimedia.org/view/Dumps/Parallelization
Okay, no clue how I could have missed that. My google skills failed me :) thanks for the pointer! best Diederik On 2010-12-19, at 6:21 PM, Platonides wrote:
Diederik van Liere wrote:
Which dump file is offered in smaller sub files?
http://download.wikimedia.org/enwiki/20100904/
Also see http://wikitech.wikimedia.org/view/Dumps/Parallelization
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Στις 20-12-2010, ημέρα Δευ, και ώρα 00:21 +0100, ο/η Platonides έγραψε:
Diederik van Liere wrote:
Which dump file is offered in smaller sub files?
http://download.wikimedia.org/enwiki/20100904/
Also see http://wikitech.wikimedia.org/view/Dumps/Parallelization
Expect to see more of this once the new xml server is up and running new jobs.
Ariel
2010/12/17 Platonides Platonides@gmail.com:
-even assuming that the memcached can happily handle it and no other data is affecting by it- the network delay make it a non-free operation.
Because memcached uses LRU, I think this'll also flood a lot of stuff out of the cache.
Roan Kattouw (Catrope)
wikitech-l@lists.wikimedia.org