Netocrat wrote:
On Mon, 12 Sep 2005 01:56:53 -0700, Brion Vibber wrote:
See Tim's presentation from 21C3: http://zwinger.wikimedia.org/berlin/
That's exactly the sort of info I was looking for. Was any attempt made to compress the diffs? I would be interested to know how the result compared for compression and overall speed to the compressed concatenated revisions.
No, no work has been done along these lines that I'm aware of.
The three main reasons to find an improvement to rcs diffs were stated as:
- moved paragraphs
- reverted edits
- minor changes within a line
The 1st and 3rd could be handled by a customised diff format and the 2nd could be handled by links in the database - have those possibilities been considered and what pros/cons are there to this approach vs the current compression scheme?
No, these possibilities have not been rigorously examined. Note that those aren't really reasons. They're illustrative only, addressing each of them in turn does not guarantee that your compression algorithm is effective. I was just describing my train of thought in arriving at the idea that LZ77 might be worth a try. If you have your own idea, please download a dump and try it out.
The main thing which put me off implementing diff-based compression was the complexity, in particular the required schema change. If you need to load some large number of diffs in order to generate a revision, those diffs need to be loaded in a single database query, if any kind of efficiency is to be reached.
In other words, don't do a proof of principle and then nag me to write the real thing, as if that were the easy part.
Since that talk, we've addressed the scalability issue by implementing external storage, allowing us to store text on the terabytes of apache hard drive space which were previously unused. Because of this, we're less concerned about size now, and more about performance and manageability. We'd like to have faster backups and much simpler administration. Effective use of the existing compression and external storage features has been hampered by high system administration overhead. Any new storage proposal needs to be evaluated in this context.
The disadvantage to the current compression scheme seems to me to be that the wiki software must work on the full text of a set of revisions at a time (i.e. when uncompressed).
The advantage is that when a number of adjacent revisions are required (such as during a backup), those revisions can be loaded quickly with a minimum of seeking.
-- Tim Starling