To continue the discussion on how to improve the performance, would it be possible to
distribute the dumps as a 7z / gz / other format archive containing multiple smaller XML
files. It's quite tricky to split a very large XML file in smaller valid XML files and
if the dumping process is already parallelized then we do not have to cat the different
XML files to one large XML file but instead we can distribute multiple smaller
parallelized files .
best,
Diederik
On 2010-12-16, at 7:02 PM, Ariel T. Glenn wrote:
Στις 17-12-2010, ημέρα Παρ, και ώρα 00:52 +0100, ο/η
Platonides έγραψε:
Roan Kattouw wrote:
I'm not sure how hard this would be to
achieve (you'd have to
correlate blob parts with revisions manually using the text table;
there might be gaps for deleted revs because ES is append-only) or how
much it would help (my impression is ES is one of the slower parts of
our system and reducing the number of ES hits by a factor 50 should
help, but I may be wrong), maybe someone with more relevant knowledge
and experience can comment on that (Tim?).
Roan Kattouw (Catrope)
ExternalStoreDB::fetchBlob() is already keeping the last one to optimize
repeated accesses to the same blob (we would probably want a bigger
cache for the dumper, though).
On the other hand, I don't think the dumpers should be doing the store
of textid contents in memcached (Revision::loadText) since they are
filling them with entries useless for the users queries (having a
different locality set), useless for themselves (since they are
traversing the full list once) and -even assuming that the memcached can
happily handle it and no other data is affecting by it- the network
delay make it a non-free operation.
Ariel, do you have in wikitech the step-by-step list of actions to setup
a WMF dump server?
I always forget about which scripts are being used and what does each of
them do.
Can xmldumps-phase3 be removed? I'd prefer that it uses the
release/trunk/wmf-deployment, an old copy is a source for problems. If
additional changes are needed (it seems unpatched), the appropiate hooks
should be added in core.
Most backups run off of trunk. The stuff I have in my branch is the
parallel stuff for testing.
http://wikitech.wikimedia.org/view/Dumps details the various scripts.
No, xmldumps-phase3 can't be removed yet. I have yet to make the
changes I need to that code (and I won't make them in core immediately,
they need to be tested thoroughly first before being checked in). Once
I think they are ok, then I will fold them into trunk. It will be a
while yet.
Ariel
_______________________________________________
Wikitech-l mailing list
Wikitech-l(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l