Re: [Wikitech-l] Dumps Still Stuck (since 7/1)?

2 Aug 2008

Magnus Manske wrote:
...
  Knowing little about the current dump generation
process, but some
 about terabyte-scale data handling (actually, we here are well into
 the petabyte range by now;-), how about this:
 * Set up the usual MySQL replication slave
 * At one point in time, disconnect it from the MySQL master, but leave
 it running in read-only mode
 * Use that as the dump base

 This should result in a single-point-in-time snapshot.  
Why? As i already said, the revision status is a snapshot, it's done in 
a transaction.

...
  Also, it will reduce load to the rest of the system.
Not sure if 
 IDs will change internally, though. IDs won't change, but you don't need
the disconnected slave. Once you 
have the revisions, you will be querying external storage. That's where 
the load goes.

...
  Independent of that,
 * Run several parallel processes on several servers (assuming we have several)
 * Each process generates the complete history dump of a single
 article, or a small group of them, bz2ipped to save intermittent disk
 space
 * Success/faliure is checked, so each process can be rerun if needed
 * At the end, all these files are appended into a single bzip2/7zip file 
The system we use is not exactly that. It's writing compressed data 
from the compressed reading of last dump and the revision snapshot. It's 
never using uncompressed data.
The little processes would need to know where in the last file is the 
section they're doing.
However, if you knew in which part it was in the old dump... it's 
worthwhile considering.

...
  This will need more diskspace while the entire thing
is running, as
 small text files compress less well than larger ones. Also, it eats
 more CPU cycles, for starting all these processes, and then for
 re-bzip2ing the intermediate files. 
Not neccessarily. If the number of files per bzip2 group is large 
enough, there is almost no difference.

...
  But, it is a lot less error-prone (if a process or a
bunch of them
 fail, just restart them), and it scales better (just throw more
 machines at it to make it faster; or use apaches during low-traffic
 hours). Individual processes should be less memory-intensive, so
 several of them can run on the same machine.

 My 2c

 Magnus 
We are talking very happily here, but what is slowing the dump process?
Brion, Tim, there's some profiling information about that? Is it I/O 
waiting for the revisions fetched for external storage? Is it disk speed 
when reading/writing? Is it CPU for decompressing previous dump? Is it 
CPU for compressing? How is dbzip2 helping with it?*

*I thought you were using dbzip2, but i now see mw:Dbzip2 says "dbzip2 
is not ready for public use yet" Has it been indefinitely postponed?

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] Dumps Still Stuck (since 7/1)?