Differential storage

List overview All Threads
Download

newer

older

Allow one-word editing

RE: [Wikitech-l] Re: Wikipedia API?

Netocrat

12 Sep 2005 12 Sep '05

2:56 p.m.

What is/are the reason/s for storing the full text of page revisions in the database as opposed to some form of differential? Am I correct in assuming that speed has been given priority over storage space requirements, and if so, has any benchmarking been done to find out how much overhead would be added by storing revision as diffs and how much space would be saved?

Also, has there been any discussion of the possibility of branching a page (as is possible in e.g. a CVS repository)?

I haven't seen anything on either of these issues after a quick search through the list archives and some non-exhaustive reading of the main wiki, so pointers to anything I've missed are welcome.

-- http://members.dodo.com.au/~netocrat

Show replies by date

Brion Vibber

12 Sep 12 Sep

3:56 p.m.

Netocrat wrote:

...

What is/are the reason/s for storing the full text of page revisions in the database as opposed to some form of differential?

Expedience; it hasn't been written yet.

...

Am I correct in assuming that speed has been given priority over storage space requirements, and if so, has any benchmarking been done to find out how much overhead would be added by storing revision as diffs and how much space would be saved?

See Tim's presentation from 21C3: http://zwinger.wikimedia.org/berlin/

...

Also, has there been any discussion of the possibility of branching a page (as is possible in e.g. a CVS repository)?

Not really. Tagging of revisions is likely to happen soonish, branching not so likely.

-- brion vibber (brion @ pobox.com)

Netocrat

4:55 p.m.

On Mon, 12 Sep 2005 01:56:53 -0700, Brion Vibber wrote:

...

Netocrat wrote:

...
What is/are the reason/s for storing the full text of page revisions in the database as opposed to some form of differential?

Expedience; it hasn't been written yet.

...
Am I correct in assuming that speed has been given priority over storage space requirements, and if so, has any benchmarking been done to find out how much overhead would be added by storing revision as diffs and how much space would be saved?

See Tim's presentation from 21C3: http://zwinger.wikimedia.org/berlin/

That's exactly the sort of info I was looking for. Was any attempt made to compress the diffs? I would be interested to know how the result compared for compression and overall speed to the compressed concatenated revisions.

The three main reasons to find an improvement to rcs diffs were stated as: * moved paragraphs * reverted edits * minor changes within a line

The 1st and 3rd could be handled by a customised diff format and the 2nd could be handled by links in the database - have those possibilities been considered and what pros/cons are there to this approach vs the current compression scheme?

The disadvantage to the current compression scheme seems to me to be that the wiki software must work on the full text of a set of revisions at a time (i.e. when uncompressed).

...

...
Also, has there been any discussion of the possibility of branching a page (as is possible in e.g. a CVS repository)?

Not really. Tagging of revisions is likely to happen soonish, branching not so likely.

Being able to specify a particular revision in a link would be useful - I presume that's why tagging is being considered.

-- http://members.dodo.com.au/~netocrat

Rowan Collins

7:38 p.m.

On 12/09/05, Netocrat netocrat@dodo.com.au wrote: [snip]

...

The three main reasons to find an improvement to rcs diffs were stated as:

moved paragraphs

reverted edits

minor changes within a line

The 1st and 3rd could be handled by a customised diff format and the 2nd could be handled by links in the database - have those possibilities been considered and what pros/cons are there to this approach vs the current compression scheme?

There was discussion about this when the new DB schema was created (live in v1.5) - a "revision" is now stored as metadata, distinct from the "text" it references, so you can have null revisions (appear in the edit history but point at the same piece of text), and potentially store reverts by pointing back at an earlier piece of text.

Seems to me, this would be trivial to implement (if it isn't already) for the admin "rollback" button, pretty easy when people edit an out of date revision (diff their edit against the version they're editting; reuse the same text object if same), but slightly harder for people manually reverting a small change as a normal edit (you'd need to run diffs against the last few versions to see if any of them were identical to the new one).

Of course, the same logic needed for the second and third cases would be what allowed a diff-based system to pick a better version to diff against rather than assuming the previous. (It could then ignore major changes which were later reverted, even if the text was also changed during the revert)

...

...
Not really. Tagging of revisions is likely to happen soonish, branching not so likely.

Being able to specify a particular revision in a link would be useful - I presume that's why tagging is being considered.

You can already specify a particular revision numerically (as of 1.5; before that, the current revision wasn't stored as a revision) - indeed, there is now a "permanent link" in the toolbox to bookmark the current revision.

The point about tagging is for things like releasing "stable" snapshots of a project - you need to have a database marker on the "stable" revision of each page.

-- Rowan Collins BSc [IMSoP]

Netocrat

9:58 p.m.

On Mon, 12 Sep 2005 13:38:57 +0100, Rowan Collins wrote:

...

On 12/09/05, Netocrat netocrat@dodo.com.au wrote: [snip]

...
The three main reasons to find an improvement to rcs diffs were stated as: * moved paragraphs

reverted edits

minor changes within a line

The 1st and 3rd could be handled by a customised diff format and the 2nd could be handled by links in the database - have those possibilities been considered and what pros/cons are there to this approach vs the current compression scheme?

There was discussion about this when the new DB schema was created (live in v1.5) - a "revision" is now stored as metadata, distinct from the "text" it references, so you can have null revisions (appear in the edit history but point at the same piece of text), and potentially store reverts by pointing back at an earlier piece of text.

Seems to me, this would be trivial to implement (if it isn't already) for the admin "rollback" button, pretty easy when people edit an out of date revision (diff their edit against the version they're editting; reuse the same text object if same),

Yup, that's the sort of thing I was thinking of.

...

but slightly harder for people manually reverting a small change as a normal edit (you'd need to run diffs against the last few versions to see if any of them were identical to the new one).

That would have to be done on every save as far as I can see; the previous revisions would need to be called up from the database. It may be slow, but it may also improve compression.

...

Of course, the same logic needed for the second and third cases would be what allowed a diff-based system to pick a better version to diff against rather than assuming the previous. (It could then ignore major changes which were later reverted, even if the text was also changed during the revert)

Right; the neat thing is that you would generate some meta-data to indicate which recent version an edit is most similar to. So in the case of edit wars, you could kind of have a branched page without explicit support for branching.

[...]

...

You can already specify a particular revision numerically (as of 1.5; before that, the current revision wasn't stored as a revision) - indeed, there is now a "permanent link" in the toolbox to bookmark the current revision.

I meant in a wiki link. i.e. something like [[Some Page?Revision]]

As far as I know that isn't possible.

-- http://members.dodo.com.au/~netocrat

Rowan Collins

13 Sep 13 Sep

3:10 a.m.

On 12/09/05, Netocrat netocrat@dodo.com.au wrote:

...

On Mon, 12 Sep 2005 13:38:57 +0100, Rowan Collins wrote: [...]

...
You can already specify a particular revision numerically (as of 1.5; before that, the current revision wasn't stored as a revision) - indeed, there is now a "permanent link" in the toolbox to bookmark the current revision.

I meant in a wiki link. i.e. something like [[Some Page?Revision]]

As far as I know that isn't possible.

Ah... http://bugzilla.wikimedia.org/268 :p

-- Rowan Collins BSc [IMSoP]

Tim Starling

12 Sep 12 Sep

8:40 p.m.

Netocrat wrote:

...

On Mon, 12 Sep 2005 01:56:53 -0700, Brion Vibber wrote:

...
See Tim's presentation from 21C3: http://zwinger.wikimedia.org/berlin/

That's exactly the sort of info I was looking for. Was any attempt made to compress the diffs? I would be interested to know how the result compared for compression and overall speed to the compressed concatenated revisions.

No, no work has been done along these lines that I'm aware of.

...

The three main reasons to find an improvement to rcs diffs were stated as:

moved paragraphs

reverted edits

minor changes within a line

The 1st and 3rd could be handled by a customised diff format and the 2nd could be handled by links in the database - have those possibilities been considered and what pros/cons are there to this approach vs the current compression scheme?

No, these possibilities have not been rigorously examined. Note that those aren't really reasons. They're illustrative only, addressing each of them in turn does not guarantee that your compression algorithm is effective. I was just describing my train of thought in arriving at the idea that LZ77 might be worth a try. If you have your own idea, please download a dump and try it out.

The main thing which put me off implementing diff-based compression was the complexity, in particular the required schema change. If you need to load some large number of diffs in order to generate a revision, those diffs need to be loaded in a single database query, if any kind of efficiency is to be reached.

In other words, don't do a proof of principle and then nag me to write the real thing, as if that were the easy part.

Since that talk, we've addressed the scalability issue by implementing external storage, allowing us to store text on the terabytes of apache hard drive space which were previously unused. Because of this, we're less concerned about size now, and more about performance and manageability. We'd like to have faster backups and much simpler administration. Effective use of the existing compression and external storage features has been hampered by high system administration overhead. Any new storage proposal needs to be evaluated in this context.

...

The disadvantage to the current compression scheme seems to me to be that the wiki software must work on the full text of a set of revisions at a time (i.e. when uncompressed).

The advantage is that when a number of adjacent revisions are required (such as during a backup), those revisions can be loaded quickly with a minimum of seeking.

-- Tim Starling

Netocrat

9:24 p.m.

On Mon, 12 Sep 2005 23:40:55 +1000, Tim Starling wrote:

[...]

...

No, these possibilities have not been rigorously examined. Note that those aren't really reasons. They're illustrative only, addressing each of them in turn does not guarantee that your compression algorithm is effective. I was just describing my train of thought in arriving at the idea that LZ77 might be worth a try. If you have your own idea, please download a dump and try it out.

I have ideas, I may do that at some point. My instinct is that a diff scheme that took account of intra-line changes as well as block text moves would result in the most compression; and that if that alone didn't, then compressing that diff scheme in the way that the entire revision set currently is, would. The question I haven't delved into is: how difficult would it be to do that and how computation-intensive compared to what is currently done?

...

The main thing which put me off implementing diff-based compression was the complexity, in particular the required schema change. If you need to load some large number of diffs in order to generate a revision, those diffs need to be loaded in a single database query, if any kind of efficiency is to be reached.

Yes, I realise that.

...

In other words, don't do a proof of principle and then nag me to write the real thing, as if that were the easy part.

To start with I want to get an idea of what's currently done and why, and any ideas previously proposed and/or rejected and why. I understand that the main wikis are huge and that performance issues are important.

-- http://members.dodo.com.au/~netocrat

Tomasz Wegrzanowski

13 Sep 13 Sep

3:51 a.m.

On Tue, Sep 13, 2005 at 12:24:27AM +1000, Netocrat wrote:

...

I have ideas, I may do that at some point. My instinct is that a diff scheme that took account of intra-line changes as well as block text moves would result in the most compression; and that if that alone didn't, then compressing that diff scheme in the way that the entire revision set currently is, would. The question I haven't delved into is: how difficult would it be to do that and how computation-intensive compared to what is currently done?

Maybe download a dump and try with xdelta ?

Netocrat

11:17 a.m.

On Mon, 12 Sep 2005 22:51:28 +0200, Tomasz Wegrzanowski wrote:

...

Maybe download a dump and try with xdelta ?

Looks promising. The only drawback that I can see is that it stores an md5 sum which for very small changes can make it less space efficient than ordinary diff in rcs format and is just plain unnecessary for mediawiki. I'll see if there's a way to disable the md5 sum; perhaps the source will need to be hacked.

Now I have a David and Goliath problem... 56k dial-up vs 31G xml download. Can anyone suggest a source for a smaller data set in English with some representative multiple-revision articles, preferably a few edit wars etc.

-- http://members.dodo.com.au/~netocrat

Netocrat

14 Sep 14 Sep

11:13 a.m.

Modified repost: 1st msg apparently reached moderator as I received an email but never appeared on gmane nor did I receive a rejection email.

On Mon, 12 Sep 2005 22:51:28 +0200, Tomasz Wegrzanowski wrote:

...

On Tue, Sep 13, 2005 at 12:24:27AM +1000, Netocrat wrote:

...
I have ideas, I may do that at some point. My instinct is that a diff scheme that took account of intra-line changes as well as block text moves would result in the most compression; and that if that alone didn't, then compressing that diff scheme in the way that the entire revision set currently is, would. The question I haven't delved into is: how difficult would it be to do that and how computation-intensive compared to what is currently done?

Maybe download a dump and try with xdelta ?

Looks promising. xdelta stores an md5 sum in the differential but xdelta3 allows this to be omitted which for very small changes makes a big difference. There seems to be a bit of other extraneous stuff included (e.g. reconstructed filename) that perhaps can be avoided through using the api instead of the command line. Other than that it does more or less what I described, although it produces binary output and it compromises on optimality for speed.

A preliminary test on *very* limited data shows that it may be a fair improvement over gzip for space.

xdelta3 supports internally compressing the diff with a method such as gzip, which again yields improvements but does not seem to be worth the (significant) extra time required - if this were to be done it should be done on a group of diffs rather than individually. Testing shows that this group approach offers a huge improvement in compression and performance over the one-at-a-time approach. It also, as I expected, again hugely reduces final space requirements compared to the current gzip-a-set-of-revisions approach.

In terms of speed, in the case of a single revision it slightly outperforms gzip at compression. Given that from what I understand (and I may have this wrong) under the current approach gzip must decompress an entire set of revisions and then add the latest revision to that set and recompress it; whereas xdelta3 need compress only the latest revision, this would probably yield an overall improvement.

For decompression speed, xdelta3 again is faster in the case of 1 revision. Here though, the fact that gzip works on multiple sets of revisions at a time goes in its favour. On average gzip would be faster since xdelta3 must perform multiple sequential operations to get at an arbitrary revision but gzip only requires one.

The other related issue that Tim Starling raised - that of accessing all the data in one query - remains open. A stored procedure seems appropriate - I know the MySQL team have proposed them but I don't know if they've yet materialised. The other option is as I described above - gzip a set of xdelta3 diffs. This is the most space-efficient method but also the most computation-intensive (aside from the compress-each-individual-diff approach).

So overall there's potential, but I need to work on some representative data to present a proper summary of the pros and cons. Which presents a David and Goliath problem... 56k dial-up vs 31G xml download. Can anyone suggest a source for a smaller data set in English with some representative multiple-revision articles, preferably a few edit wars etc.

-- http://members.dodo.com.au/~netocrat

Rowan Collins

8:22 p.m.

On 14/09/05, Netocrat netocrat@dodo.com.au wrote:

...

So overall there's potential, but I need to work on some representative data to present a proper summary of the pros and cons. Which presents a David and Goliath problem... 56k dial-up vs 31G xml download. Can anyone suggest a source for a smaller data set in English with some representative multiple-revision articles, preferably a few edit wars etc.

You can use the Special:Export page to export the complete histories of individual pages, so you could pick a few which seemed suitably representative and play with them.

-- Rowan Collins BSc [IMSoP]

Netocrat

10:22 p.m.

On Wed, 14 Sep 2005 14:22:42 +0100, Rowan Collins wrote:

...

You can use the Special:Export page to export the complete histories of individual pages, so you could pick a few which seemed suitably representative and play with them.

The complete history option doesn't have any affect on the articles I tried on wikipedia although it works on my server - is it disabled?

-- http://members.dodo.com.au/~netocrat

Brion Vibber

15 Sep 15 Sep

3:19 a.m.

Netocrat wrote:

...

On Wed, 14 Sep 2005 14:22:42 +0100, Rowan Collins wrote:

...
You can use the Special:Export page to export the complete histories of individual pages, so you could pick a few which seemed suitably representative and play with them.

The complete history option doesn't have any affect on the articles I tried on wikipedia although it works on my server - is it disabled?

Yeah, that's temporarily disabled as it tends to die on very long pages and that was causing some problems. Needs some work...

-- brion vibber (brion @ pobox.com)

Netocrat

3:38 a.m.

On Wed, 14 Sep 2005 13:19:49 -0700, Brion Vibber wrote:

...

Netocrat wrote:

...
On Wed, 14 Sep 2005 14:22:42 +0100, Rowan Collins wrote:

...

...
...
You can use the Special:Export page to export the complete histories of individual pages, so you could pick a few which seemed suitably representative and play with them.

The complete history option doesn't have any affect on the articles I tried on wikipedia although it works on my server - is it disabled?

Yeah, that's temporarily disabled as it tends to die on very long pages and that was causing some problems. Needs some work...

OK, any other way to access a few revision histories short of manually editing and copy-and-paste-ing each one?

-- http://members.dodo.com.au/~netocrat

Tim Starling

9:26 a.m.

Netocrat wrote:

...

On Wed, 14 Sep 2005 13:19:49 -0700, Brion Vibber wrote:

...
Netocrat wrote:

...
On Wed, 14 Sep 2005 14:22:42 +0100, Rowan Collins wrote:

...
...
...
You can use the Special:Export page to export the complete histories of individual pages, so you could pick a few which seemed suitably representative and play with them.

The complete history option doesn't have any affect on the articles I tried on wikipedia although it works on my server - is it disabled?

Yeah, that's temporarily disabled as it tends to die on very long pages and that was causing some problems. Needs some work...

OK, any other way to access a few revision histories short of manually editing and copy-and-paste-ing each one?

You can have my test set if you like. I put it at

http://wikimedia.org/junk/history_expt.tar.bz2

It's 7.5 MB. I used that test set for the proof of principle detailed at

http://meta.wikimedia.org/wiki/History_compression

Once I had a working implementation within MediaWiki, I downloaded the full dump for ca.wikipedia.org, which was about 40 MB. I had a 56K modem at the time too, so that's no excuse. It's no big deal that it's not in English, you can still tell if it's working.

-- Tim Starling

7050

Age (days ago)

7053

Last active (days ago)

wikitech-l@lists.wikimedia.org

15 comments

5 participants

tags (0)

participants (5)

Brion Vibber
Netocrat
Rowan Collins
Tim Starling
Tomasz Wegrzanowski