To continue the discussion on how to improve the performance, would it be possible to distribute the dumps as a 7z / gz / other format archive containing multiple smaller XML files. It's quite tricky to split a very large XML file in smaller valid XML files and if the dumping process is already parallelized then we do not have to cat the different XML files to one large XML file but instead we can distribute multiple smaller parallelized files .
best,
Diederik On 2010-12-16, at 7:02 PM, Ariel T. Glenn wrote:
Στις 17-12-2010, ημέρα Παρ, και ώρα 00:52 +0100, ο/η Platonides έγραψε:
Roan Kattouw wrote:
I'm not sure how hard this would be to achieve (you'd have to correlate blob parts with revisions manually using the text table; there might be gaps for deleted revs because ES is append-only) or how much it would help (my impression is ES is one of the slower parts of our system and reducing the number of ES hits by a factor 50 should help, but I may be wrong), maybe someone with more relevant knowledge and experience can comment on that (Tim?).
Roan Kattouw (Catrope)
ExternalStoreDB::fetchBlob() is already keeping the last one to optimize repeated accesses to the same blob (we would probably want a bigger cache for the dumper, though). On the other hand, I don't think the dumpers should be doing the store of textid contents in memcached (Revision::loadText) since they are filling them with entries useless for the users queries (having a different locality set), useless for themselves (since they are traversing the full list once) and -even assuming that the memcached can happily handle it and no other data is affecting by it- the network delay make it a non-free operation.
Ariel, do you have in wikitech the step-by-step list of actions to setup a WMF dump server? I always forget about which scripts are being used and what does each of them do. Can xmldumps-phase3 be removed? I'd prefer that it uses the release/trunk/wmf-deployment, an old copy is a source for problems. If additional changes are needed (it seems unpatched), the appropiate hooks should be added in core.
Most backups run off of trunk. The stuff I have in my branch is the parallel stuff for testing.
http://wikitech.wikimedia.org/view/Dumps details the various scripts.
No, xmldumps-phase3 can't be removed yet. I have yet to make the changes I need to that code (and I won't make them in core immediately, they need to be tested thoroughly first before being checked in). Once I think they are ok, then I will fold them into trunk. It will be a while yet.
Ariel
Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l