Re: [Wikitech-l] Parallelizing export dump (bug 24630)

19 Dec 2010


      To continue the discussion on how to improve the performance, would it be possible to distribute the dumps as a 7z / gz / other format archive containing multiple smaller XML files. It's quite tricky to split a very large XML file in smaller valid XML files and if the dumping process is already parallelized then we do not have to cat the different XML files to one large XML file but instead we can distribute multiple smaller parallelized files .
best,
Diederik
On 2010-12-16, at 7:02 PM, Ariel T. Glenn wrote:
...
Στις 17-12-2010, ημέρα Παρ, και ώρα 00:52 +0100, ο/η Platonides έγραψε:
...
Roan Kattouw wrote:
...
I'm not sure how hard this would be to achieve (you'd have to
correlate blob parts with revisions manually using the text table;
there might be gaps for deleted revs because ES is append-only) or how
much it would help (my impression is ES is one of the slower parts of
our system and reducing the number of ES hits by a factor 50 should
help, but I may be wrong), maybe someone with more relevant knowledge
and experience can comment on that (Tim?).
Roan Kattouw (Catrope)
ExternalStoreDB::fetchBlob() is already keeping the last one to optimize
repeated accesses to the same blob (we would probably want a bigger
cache for the dumper, though).
On the other hand, I don't think the dumpers should be doing the store
of textid contents in memcached (Revision::loadText) since they are
filling them with entries useless for the users queries (having a
different locality set), useless for themselves (since they are
traversing the full list once) and -even assuming that the memcached can
happily handle it and no other data is affecting by it- the network
delay make it a non-free operation.
Ariel, do you have in wikitech the step-by-step list of actions to setup
a WMF dump server?
I always forget about which scripts are being used and what does each of
them do.
Can xmldumps-phase3 be removed? I'd prefer that it uses the
release/trunk/wmf-deployment, an old copy is a source for problems. If
additional changes are needed (it seems unpatched), the appropiate hooks
should be added in core.
Most backups run off of trunk. The stuff I have in my branch is the
parallel stuff for testing.
http://wikitech.wikimedia.org/view/Dumps details the various scripts.
No, xmldumps-phase3 can't be removed yet.  I have yet to make the
changes I need to that code (and I won't make them in core immediately,
they need to be tested thoroughly first before being checked in).  Once
I think they are ok, then I will fold them into trunk.  It will be a
while yet.
Ariel

Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Parallelizing export dump (bug 24630)