incremental backups - Wikitech-l

25 Jan 2004


      Two recent posts about incremental backups:
http://mail.wikipedia.org/pipermail/wikitech-l/2003-September/006036.html
http://mail.wikipedia.org/pipermail/wikitech-l/2003-December/007134.html
Is anyone still working on it? Incremental backups would be very useful
for people like me who have a monthly transfer limit.
xdelta seems to be good for small files. But it has a 2 GB file size
limit. As I didn't manage to install xdelta2, I can't tell if that
version can handle larger files. With two 1.6 and 1.9 GB files
(uncompressed) xdelta starts, but it needs more than 500 MB main memory
and is very slow (10 MB output after approx. 2 hours when I stopped it;
my computer has 1 GB main memory, so swapping is not the problem).
There may also be other suitable programs like rdiff, but the users
have to install them (and may fail as I did with xdelta2). Some of these
programs may not be available for non-Unix operating systems.
It's also possible to extract the difference information for the old
table from the binary log, but for cur this is impractical because there
are often multiple updates of the same article within a short period of
time.
In my opinion, the most natural file format is a sequence of SQL
statements that can be sent dircetly to the database to update it,
whether extracted from the binary log or from the dumps.
I've written two simple programs that use the dumps (see attachment).
Only Perl is needed to run them. "wsqlpatch" is for programs that read
the dumps directly. It produces an exact copy of the newer dump that
was used by wsqldiff. So the users can compare the MD5 sums to check if
the update was successful.
Here are the results of some tests:
Test files (German database):
20040109_cur_table.sql.bz2:  33.7 MB
20040117_cur_table.sql.bz2:  34.9 MB
20040109_old_table.sql.bz2: 429.9 MB
20040117_old_table.sql.bz2: 450.1 MB
export LANG=C   # otherwise the umlauts get converted :-(
wsqldiff 20040109_cur_table.sql.bz2 20040117_cur_table.sql.bz2 |
    bzip2 -c9 > cur_diff.sql.bz2
(17 min, Athlon XP 2400+)
wsqldiff 20040109_old_table.sql.bz2 20040117_old_table.sql.bz2 |
    bzip2 -c9 > old_diff.sql.bz2
(117 min)
cur_diff.sql.bz2:  7.8 MB
old_diff.sql.bz2: 22.1 MB
wsqlpatch 20040109_cur_table.sql.bz2 cur_diff.sql.bz2 |
    bzip2 -c9 > 20040117_cur_table2.sql.bz2
(10 min)
wsqlpatch 20040109_old_table.sql.bz2 old_diff.sql.bz2 |
    bzip2 -c9 > 20040117_old_table2.sql.bz2
(80 min)
Memory usage of wsqldiff is moderate (64 MB for the German old-table,
so prabably about 300 MB for the English old-table). wsqlpatch needs
much less memory (about 10 MB, largely independent of the file sizes).
The perl scripts are quite slow, but maybe fast enough for the cur
tables. I'd like to know what is planned for the old tables. Will
they be dumped periodically in the future or is that only a temporary
solution? If I understand it correctly, dumping means that the
database must be locked, which is bad unless the dump is made on a
replicated server.
As an alternative, the binary log could be archived. But in this
case someone who has access to it on the Wikipedia server will have
to write a program that extracts the statements that alter the old table,
which probably involves nothing more than selecting lines that start
with "INSERT INTO old" or "DELETE FROM old" or "UPDATE old"
(using mysqlbinlog of course).
BTW, when I wrote the programs I noticed that in
maintenance/tables.sql the last two fields of the cur table
are cur_touched and inverse_timestamp in that order, whereas in the
dumps the order is reversed.
de:Benutzer:El
-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail