Incremental backup - Wikitech-l

14 Sep 2003


      Following is a discussion which started on the village pump, but I thought
it would be more appropriate to continue it here, considering the
closely-related discussion which has been going on here.
------------------------------
[Start village pump quote]
Once the full Wikipedia is downloaded, can smaller periodic updates covering
new stuff and changes be obtained and used to synch the local? --[[User:Ted
Clayton|Ted Clayton]] 04:26, 13 Sep 2003 (UTC)
:No, you can't. I've been thinking the same thing myself. I think we need
to:
:*Allow incremental updates for all types of download
:*Allow bulk image downloads
:*Package a stripped-down version of the old table in with the cur dumps,
where the revision history (users, times, comments etc.) is included, but
the old text itself is not
:*Develop a method of compressing the old table so that the similarity
between adjacent revisions can be used to full advantage
: -- [[User:Tim Starling|Tim Starling]] 04:38, Sep 13, 2003 (UTC)
Would it be easier to have incremental updates on something like a
subscription basis? The server packages dailies or weeklies and shoots them
out to everyone on the list? During off hours, mass-mail fashion?
Can you suggest sources or search-terms for table manipulations treatments,
as background for stripping and compressing? --[[User:Ted Clayton|Ted
Clayton]] 03:14, 14 Sep 2003 (UTC)
[End village pump quote]
-------------------------------
Regarding sending them to a list: do you mean by email? That would depend on
size -- anything more than a couple of megs a week and we'll max out
people's inboxes. I think we'd be better off making available a series of
patches available via HTTP, and provide a client tool (probably just a PHP
script) to download the required patches and merge them into the local
database.
Regarding sources: I'm not aware of anyone doing exactly this task before,
although I haven't really looked. I imagine we would just roll our own.
Perhaps dumping the data using the mysql client, then doing some text
processing and finally running it through gzip. Assuming the text processing
can be done fast enough, a PHP script would probably be best for this part
too, for consistency.
We could use the binary log, like Brion suggested. I've been meaning to
reply to that message. We can convert the log to runnable SQL using
"mysqlbinlog". Then we'd have to parse each query to determine what table it
writes to, just like what MySQL slaves do when tables are excluded. It
wouldn't be perfectly efficient, because multiple writes to the same row
would all be included. So a cur dump might have 100 copies of the village
pump in it. But that's better than including 100,000 unchanged articles. If
we get ambitious we can always filter out the redundant writes.
-- Tim Starling <t`starling`physics`unimelb`edu`au>