On 10/20/07, Gregory Maxwell <gmaxwell(a)gmail.com> wrote:
On 10/20/07, Platonides <Platonides(a)gmail.com>
wrote:
Gregory Maxwell wrote:
Bleh. Someone pulling increments couldn't
build a point in time
snapshot, they would need to always pull the full. And we want people
using point in time versions of the site not mangled mixes.
They'd use the stubs version.
Okay, you didn't mention that.... but please no: I have had a hard
enough time explaining to people that the separate SQL dumps aren't
consistent with the history dumps.
I don't want to end up in a situation where the only way to get a sane
copy of the site is stitching together dozens of files on the
recipients side.... people will do it wrong, or just skip building a
point in time version at all.. and make a big mess.
I'd rather go back to having separate metadata and text dumps than end
up with people needing to combine an old full dump, N large
incremental files, and a new stub dump through a bunch of complex
manipulation in order to arrive at a consistent copy of the site.
If we wanted to do that on the back end.. fine.
Additionally I just don't see a lot of demand for incremental
full-history dumps. For research purposes you're generally going to
have to download the whole dump anyway, and even if it takes a few
days or you have to get someone to make you a few DVD-Rs it's no big
deal (*). For mirror/fork purposes you want a live feed and/or some
sort of API access.
API access would be great. Reasonably priced live feeds would be
great too. But incremental full-history dumps would be a lot of work
for little benefit, IMO.
(*) In my experience it takes about 3-5 times as long to uncompress
and import the dump as it does to download it, and that's for the .bz2
dump; if something could be done to cut down *that* component, I'd be
all for it. I guess incremental dumps would help that part too,
though some sort of index file would probably be a better solution.
Also, I expect that once 7zed the incremets will not
be too much
smaller than the full, especially if partitoned by revid.
I wasn't proposing a
file per revid, but a file per N revisions, where N
is a number which fits our needs ;-)
Partition by revid doesn't necessarily mean one rev per file... and
thats certainly not what I thought you were suggesting.
You will screw compression if you partition by revid (i.e. in groups
of revs, failing to keep all revs of a single article in one place).
If you don't want to take my word for it try it yourself.
Very good point, but if you still grouped the revs by article surely
it'd be a smaller file with fewer revs. Pathological cases aside,
prepending extra data to a file makes the compressed file bigger, not
smaller, right?
A key question is what's the demand? Who wants dumps, and for what
purposes? Are you willing to pay for them? If 10-15 people each
chipped in $10/month toward a dedicated server, the possibilities are
fairly endless. Each person could create a custom dump geared toward
their particular needs, if necessary.