Re: [Wikitech-l] Dumps Still Stuck (since 7/1)?

3 Aug 2008

      On Sat, Aug 2, 2008 at 9:49 PM, Platonides Platonides@gmail.com wrote:
...
Magnus Manske wrote:
...
Independent of that,

Run several parallel processes on several servers (assuming we have several)
Each process generates the complete history dump of a single

article, or a small group of them, bz2ipped to save intermittent disk
space

Success/faliure is checked, so each process can be rerun if needed
At the end, all these files are appended into a single bzip2/7zip file

The system we use is not exactly that. It's writing compressed data
from the compressed reading of last dump and the revision snapshot. It's
never using uncompressed data.
The little processes would need to know where in the last file is the
section they're doing.
However, if you knew in which part it was in the old dump... it's
worthwhile considering.
Why is it using the old dump instead of the "real" storage? For
performance reasons?
Does that mean that if there's an error in an old dump, it will stay
there forever?
How does this cope with deleted revisions?
...
...
This will need more diskspace while the entire thing is running, as
small text files compress less well than larger ones. Also, it eats
more CPU cycles, for starting all these processes, and then for
re-bzip2ing the intermediate files.
Not neccessarily. If the number of files per bzip2 group is large
enough, there is almost no difference.
Yes. we'd have to find a balance between many fast processes with lots
of overhead and few slow ones that, when failing, will set back the
dump for weeks.
At work, I'm using a computing farm with several thousand cores, and
the suggested time per process is < 2h. May be worth contemplating,
even though the technical situation for Wikimedia is very much
different.
Magnus

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Dumps Still Stuck (since 7/1)?