Two recent posts about incremental backups:
http://mail.wikipedia.org/pipermail/wikitech-l/2003-September/006036.html http://mail.wikipedia.org/pipermail/wikitech-l/2003-December/007134.html
Is anyone still working on it? Incremental backups would be very useful for people like me who have a monthly transfer limit.
xdelta seems to be good for small files. But it has a 2 GB file size limit. As I didn't manage to install xdelta2, I can't tell if that version can handle larger files. With two 1.6 and 1.9 GB files (uncompressed) xdelta starts, but it needs more than 500 MB main memory and is very slow (10 MB output after approx. 2 hours when I stopped it; my computer has 1 GB main memory, so swapping is not the problem).
There may also be other suitable programs like rdiff, but the users have to install them (and may fail as I did with xdelta2). Some of these programs may not be available for non-Unix operating systems.
It's also possible to extract the difference information for the old table from the binary log, but for cur this is impractical because there are often multiple updates of the same article within a short period of time.
In my opinion, the most natural file format is a sequence of SQL statements that can be sent dircetly to the database to update it, whether extracted from the binary log or from the dumps.
I've written two simple programs that use the dumps (see attachment). Only Perl is needed to run them. "wsqlpatch" is for programs that read the dumps directly. It produces an exact copy of the newer dump that was used by wsqldiff. So the users can compare the MD5 sums to check if the update was successful.
Here are the results of some tests:
Test files (German database): 20040109_cur_table.sql.bz2: 33.7 MB 20040117_cur_table.sql.bz2: 34.9 MB 20040109_old_table.sql.bz2: 429.9 MB 20040117_old_table.sql.bz2: 450.1 MB
export LANG=C # otherwise the umlauts get converted :-(
wsqldiff 20040109_cur_table.sql.bz2 20040117_cur_table.sql.bz2 | bzip2 -c9 > cur_diff.sql.bz2 (17 min, Athlon XP 2400+)
wsqldiff 20040109_old_table.sql.bz2 20040117_old_table.sql.bz2 | bzip2 -c9 > old_diff.sql.bz2 (117 min)
cur_diff.sql.bz2: 7.8 MB old_diff.sql.bz2: 22.1 MB
wsqlpatch 20040109_cur_table.sql.bz2 cur_diff.sql.bz2 | bzip2 -c9 > 20040117_cur_table2.sql.bz2 (10 min)
wsqlpatch 20040109_old_table.sql.bz2 old_diff.sql.bz2 | bzip2 -c9 > 20040117_old_table2.sql.bz2 (80 min)
Memory usage of wsqldiff is moderate (64 MB for the German old-table, so prabably about 300 MB for the English old-table). wsqlpatch needs much less memory (about 10 MB, largely independent of the file sizes).
The perl scripts are quite slow, but maybe fast enough for the cur tables. I'd like to know what is planned for the old tables. Will they be dumped periodically in the future or is that only a temporary solution? If I understand it correctly, dumping means that the database must be locked, which is bad unless the dump is made on a replicated server.
As an alternative, the binary log could be archived. But in this case someone who has access to it on the Wikipedia server will have to write a program that extracts the statements that alter the old table, which probably involves nothing more than selecting lines that start with "INSERT INTO old" or "DELETE FROM old" or "UPDATE old" (using mysqlbinlog of course).
BTW, when I wrote the programs I noticed that in maintenance/tables.sql the last two fields of the cur table are cur_touched and inverse_timestamp in that order, whereas in the dumps the order is reversed.
de:Benutzer:El
elwp@gmx.de writes:
Two recent posts about incremental backups:
http://mail.wikipedia.org/pipermail/wikitech-l/2003-September/006036.html http://mail.wikipedia.org/pipermail/wikitech-l/2003-December/007134.html
Is anyone still working on it?
This all is much to complicate and resource hungry. Simply dump the raw article (plus meta info) into a CVS system and let the user fetch the files with a simple "cvs up". And a simple shell script (awk/sed/python) can fetch the local CVS files into the database.
Also rsync can save bandwidth as we all know.
Otherwise your solution looks interesting.
Karl Eichwalder schrieb:
This all is much to complicate and resource hungry. Simply dump the raw article (plus meta info) into a CVS system and let the user fetch the files with a simple "cvs up". And a simple shell script (awk/sed/python) can fetch the local CVS files into the database.
Your proposal sounds very interesting to me. But did I get the point right that you also want CVS as a way for writing to the database? That would be a great thing for offline editing, but makes mass vandalism possible. Simply fill in random crap in your local files, say "cvs commit" and voilá, you've ruined a whole wiki...
Alwin Meschede
Alwin Meschede schrieb:
Karl Eichwalder schrieb:
This all is much to complicate and resource hungry. Simply dump the raw article (plus meta info) into a CVS system and let the user fetch the files with a simple "cvs up". And a simple shell script (awk/sed/python) can fetch the local CVS files into the database.
Your proposal sounds very interesting to me. But did I get the point right that you also want CVS as a way for writing to the database? That would be a great thing for offline editing, but makes mass vandalism possible. Simply fill in random crap in your local files, say "cvs commit" and voilá, you've ruined a whole wiki...
Oh, I see it was the other way round... Writing to the *local* database is OK.
Alwin Meschede
Alwin Meschede ameschede@gmx.de writes:
Alwin Meschede schrieb:
Your proposal sounds very interesting to me. But did I get the point right that you also want CVS as a way for writing to the database?
At the moment I'm interested in ro access via CVS only.
That would be a great thing for offline editing,
Yes :) already now I do most of my editing outside the webbrowsers editor (via cut-and-paste - thus far I don't know how to plug into mozilla (Linux) an external editor).
but makes mass vandalism possible. Simply fill in random crap in your local files, say "cvs commit" and voilá, you've ruined a whole wiki...
Thus far I'm not sure about the commits - some kind of verifying proxy in between might be appropriate.
Oh, I see it was the other way round... Writing to the *local* database is OK.
Yes, my initial proposal is pretty innocent.
elwp@gmx.de writes:
Two recent posts about incremental backups:
http://mail.wikipedia.org/pipermail/wikitech-l/2003-September/006036.html
http://mail.wikipedia.org/pipermail/wikitech-l/2003-December/007134.html
Is anyone still working on it?
This all is much to complicate and resource hungry. Simply dump the raw article (plus meta info) into a CVS system and let the user fetch the files with a simple "cvs up". And a simple shell script (awk/sed/python) can fetch the local CVS files into the database.
I wonder how fast CVS is with so many small files. And I doubt that the scripts that have to be develped to mirror the database to CVS on the server and to put the files back to the database on the client machine are much simpler than my programs, but you are welcome to write them, of course. :-)
de:Benutzer:El
elwp@gmx.de writes:
I wonder how fast CVS is with so many small files.
I guess one has to spread the files over subdirs, e.g. as follows:
Nancy -> N/a/Nancy New York -> N/e/New York Nueremberg -> N/u/Nuremberg
or
Nancy -> N/Na/Nancy New York -> N/Ne/New York Nueremberg -> N/Nu/Nuremberg
And I doubt that the scripts that have to be develped to mirror the database to CVS on the server and to put the files back to the database on the client machine are much simpler than my programs, but you are welcome to write them, of course. :-)
But i expect the scripts to be trivial - as yours ;)
I'd simply write every successfull commit into a separate table (or an own database) from where you can dump to the CVS - all this could be done at the servers idle time.
wikitech-l@lists.wikimedia.org