Timwi:
Two questions spring to mind:
Firstly, when you say "unchanged sections are stored only once", does
this apply even if someone changes a section and someone else reverts
it,
Yes, if both revision texts reside in the same history blob. Up to
20 consecutive revisions are stored in one blob.
or if someone copies a section to another page?
No.
Maybe all the pages
should be split into sections, and all the sections stored individually?
I doubt that this would improve the compression much, because texts aren't
copied that often.
Secondly, how great will the dependence between a
revision and the
previous revision be? In other words, how many (compressed) revisions
will have to be retrieved in order to reconstruct the (uncompressed)
text of just one revision?
The complete history blob must be decompressed of course. But no previous
revisions need to be reconstructed. At the beginning of the uncompressed
history blob there is a section index for each revision followed by a list
of (position, length)-pairs for each section. So when a revision text is
to be extracted, this is what happens:
* uncompress history blob
* look up section list for the requested revision
* loop up section offsets and lengths
* concatenate sections
This is an example header (first 20 revisions of the german article
"Stern"):
00000020 00000025 00000142 00000260 00000001 # 20 revisions, 25 different
sections
0 # first revision has no heading: only one section
1 2 3
4 5 6
4 5 6 # conversion script: nothing changed
7 5 8
7 9 8
10 9 8
7 9 8
7 11 8
12 11 8
12 13 8
12 14 8
15 14 8
16 14 8
17 14 8
18 14 8
19 14 8
19 20 8
21 22 23
21 24 23
0 2579 # offset and length of the first section
2579 1176
...
--
Weitersagen: GMX DSL-Flatrates mit Tempo-Garantie!
Ab 4,99 Euro/Monat:
http://www.gmx.net/de/go/dsl