Two recent posts about incremental backups:
http://mail.wikipedia.org/pipermail/wikitech-l/2003-September/006036.html http://mail.wikipedia.org/pipermail/wikitech-l/2003-December/007134.html
Is anyone still working on it? Incremental backups would be very useful for people like me who have a monthly transfer limit.
xdelta seems to be good for small files. But it has a 2 GB file size limit. As I didn't manage to install xdelta2, I can't tell if that version can handle larger files. With two 1.6 and 1.9 GB files (uncompressed) xdelta starts, but it needs more than 500 MB main memory and is very slow (10 MB output after approx. 2 hours when I stopped it; my computer has 1 GB main memory, so swapping is not the problem).
There may also be other suitable programs like rdiff, but the users have to install them (and may fail as I did with xdelta2). Some of these programs may not be available for non-Unix operating systems.
It's also possible to extract the difference information for the old table from the binary log, but for cur this is impractical because there are often multiple updates of the same article within a short period of time.
In my opinion, the most natural file format is a sequence of SQL statements that can be sent dircetly to the database to update it, whether extracted from the binary log or from the dumps.
I've written two simple programs that use the dumps (see attachment). Only Perl is needed to run them. "wsqlpatch" is for programs that read the dumps directly. It produces an exact copy of the newer dump that was used by wsqldiff. So the users can compare the MD5 sums to check if the update was successful.
Here are the results of some tests:
Test files (German database): 20040109_cur_table.sql.bz2: 33.7 MB 20040117_cur_table.sql.bz2: 34.9 MB 20040109_old_table.sql.bz2: 429.9 MB 20040117_old_table.sql.bz2: 450.1 MB
export LANG=C # otherwise the umlauts get converted :-(
wsqldiff 20040109_cur_table.sql.bz2 20040117_cur_table.sql.bz2 | bzip2 -c9 > cur_diff.sql.bz2 (17 min, Athlon XP 2400+)
wsqldiff 20040109_old_table.sql.bz2 20040117_old_table.sql.bz2 | bzip2 -c9 > old_diff.sql.bz2 (117 min)
cur_diff.sql.bz2: 7.8 MB old_diff.sql.bz2: 22.1 MB
wsqlpatch 20040109_cur_table.sql.bz2 cur_diff.sql.bz2 | bzip2 -c9 > 20040117_cur_table2.sql.bz2 (10 min)
wsqlpatch 20040109_old_table.sql.bz2 old_diff.sql.bz2 | bzip2 -c9 > 20040117_old_table2.sql.bz2 (80 min)
Memory usage of wsqldiff is moderate (64 MB for the German old-table, so prabably about 300 MB for the English old-table). wsqlpatch needs much less memory (about 10 MB, largely independent of the file sizes).
The perl scripts are quite slow, but maybe fast enough for the cur tables. I'd like to know what is planned for the old tables. Will they be dumped periodically in the future or is that only a temporary solution? If I understand it correctly, dumping means that the database must be locked, which is bad unless the dump is made on a replicated server.
As an alternative, the binary log could be archived. But in this case someone who has access to it on the Wikipedia server will have to write a program that extracts the statements that alter the old table, which probably involves nothing more than selecting lines that start with "INSERT INTO old" or "DELETE FROM old" or "UPDATE old" (using mysqlbinlog of course).
BTW, when I wrote the programs I noticed that in maintenance/tables.sql the last two fields of the cur table are cur_touched and inverse_timestamp in that order, whereas in the dumps the order is reversed.
de:Benutzer:El