Following is a discussion which started on the village pump, but I thought it would be more appropriate to continue it here, considering the closely-related discussion which has been going on here.
------------------------------ [Start village pump quote]
Once the full Wikipedia is downloaded, can smaller periodic updates covering new stuff and changes be obtained and used to synch the local? --[[User:Ted Clayton|Ted Clayton]] 04:26, 13 Sep 2003 (UTC)
:No, you can't. I've been thinking the same thing myself. I think we need to:
:*Allow incremental updates for all types of download :*Allow bulk image downloads :*Package a stripped-down version of the old table in with the cur dumps, where the revision history (users, times, comments etc.) is included, but the old text itself is not :*Develop a method of compressing the old table so that the similarity between adjacent revisions can be used to full advantage
: -- [[User:Tim Starling|Tim Starling]] 04:38, Sep 13, 2003 (UTC)
Would it be easier to have incremental updates on something like a subscription basis? The server packages dailies or weeklies and shoots them out to everyone on the list? During off hours, mass-mail fashion?
Can you suggest sources or search-terms for table manipulations treatments, as background for stripping and compressing? --[[User:Ted Clayton|Ted Clayton]] 03:14, 14 Sep 2003 (UTC)
[End village pump quote] -------------------------------
Regarding sending them to a list: do you mean by email? That would depend on size -- anything more than a couple of megs a week and we'll max out people's inboxes. I think we'd be better off making available a series of patches available via HTTP, and provide a client tool (probably just a PHP script) to download the required patches and merge them into the local database.
Regarding sources: I'm not aware of anyone doing exactly this task before, although I haven't really looked. I imagine we would just roll our own. Perhaps dumping the data using the mysql client, then doing some text processing and finally running it through gzip. Assuming the text processing can be done fast enough, a PHP script would probably be best for this part too, for consistency.
We could use the binary log, like Brion suggested. I've been meaning to reply to that message. We can convert the log to runnable SQL using "mysqlbinlog". Then we'd have to parse each query to determine what table it writes to, just like what MySQL slaves do when tables are excluded. It wouldn't be perfectly efficient, because multiple writes to the same row would all be included. So a cur dump might have 100 copies of the village pump in it. But that's better than including 100,000 unchanged articles. If we get ambitious we can always filter out the redundant writes.
-- Tim Starling <t`starling`physics`unimelb`edu`au>
wikitech-l@lists.wikimedia.org