Re: [Wikitech-l] RFC: Incremental history dumps

21 Oct 2007

On 10/20/07, Gregory Maxwell &lt;gmaxwell(a)gmail.com&gt; wrote:
...
  On 10/20/07, Platonides &lt;Platonides(a)gmail.com&gt;
wrote:
  Gregory Maxwell wrote:
  Bleh. Someone pulling increments couldn't
build a point in time
 snapshot, they would need to always pull the full.  And we want people
 using point in time versions of the site not mangled mixes. 
 They'd use the stubs version. 
 Okay, you didn't mention that.... but please no: I have had a hard
 enough time explaining to people that the separate SQL dumps aren't
 consistent with the history dumps.

 I don't want to end up in a situation where the only way to get a sane
 copy of the site is stitching together dozens of files on the
 recipients side.... people will do it wrong, or just skip building a
 point in time version at all.. and make a big mess.

 I'd rather go back to having separate metadata and text dumps than end
 up with people needing to combine an old full dump, N large
 incremental files, and a new stub dump through a bunch of complex
 manipulation in order to arrive at a consistent copy of the site.

 If we wanted to do that on the back end.. fine.
 Additionally I just don't see a lot of demand for incremental
full-history dumps.  For research purposes you're generally going to
have to download the whole dump anyway, and even if it takes a few
days or you have to get someone to make you a few DVD-Rs it's no big
deal (*).  For mirror/fork purposes you want a live feed and/or some
sort of API access.

API access would be great.  Reasonably priced live feeds would be
great too.  But incremental full-history dumps would be a lot of work
for little benefit, IMO.

(*) In my experience it takes about 3-5 times as long to uncompress
and import the dump as it does to download it, and that's for the .bz2
dump; if something could be done to cut down *that* component, I'd be
all for it.  I guess incremental dumps would help that part too,
though some sort of index file would probably be a better solution.

...
    Also, I expect that once 7zed the incremets will not
be too much
 smaller than the full, especially if partitoned by revid.  I wasn't proposing a
file per revid, but a file per N revisions, where N
 is a number which fits our needs ;-) 
 Partition by revid doesn't necessarily mean one rev per file... and
 thats certainly not what I thought you were suggesting.

 You will screw compression if you partition by revid (i.e. in groups
 of revs, failing to keep all revs of a single article in one place).
 If you don't want to take my word for it try it yourself.
 Very good point, but if you still grouped the revs by article surely
it'd be a smaller file with fewer revs.  Pathological cases aside,
prepending extra data to a file makes the compressed file bigger, not
smaller, right?

A key question is what's the demand?  Who wants dumps, and for what
purposes?  Are you willing to pay for them?  If 10-15 people each
chipped in $10/month toward a dedicated server, the possibilities are
fairly endless.  Each person could create a custom dump geared toward
their particular needs, if necessary.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Wikitech-l] RFC: Incremental history dumps