Hi!
You didn't address his idea one iota. Isn't this the relevant doc? http://upload.wikimedia.org/wikipedia/commons/4/41/Mediawiki-database-schema...
It is relevant for mediawiki-l@ audience, not relevant for wikimedia- tech@ (if we get to wikimedia technology, it doesn't rely on default settings)
Maybe you could explain how the storage class renders his idea irrelevant?
Though probably Tim can be much better at explaining this, but 'text' just provides pointers to a "storage cloud", which can be whatever you want (different ES implementations can do different things).
It can point to sub-entries in bigger blobs, and supports two methods:
a) DiffHistoryBlob - differential storage, that has passed compression, with some adjustments for page blankings, etc b) ConcatenatedGzipHistoryBlob - just plain concatenation of revisions, and compression on top
Both already guard against not only same but also similar text in subsequent revisions.
There're some other optimizations that we could do (optimized packing of pointers/flags in text table), but keep in mind, that every time you edit a page:
~180 bytes are added to revision table (and make additional 200 bytes in indexing) ~300 bytes are added to recentchanges (and make additional 400 bytes in indexing) ~370 bytes are added to cu_changes (300 bytes in indexing, these two tables are round-buffers though) text is 85 bytes with no additional indexing (and even that was skewed by few cases, when we wrote directly to it)
even if it could be possible to reduce amount of pointers in text by reusing them (one can point same text entry to multiple revisions, as it was already noted), it could make maintenance/batch operations much more complicated. Also, as blobs can get migrated, transformed, etc, it is better to do that in separate table, without touching the bigger 'revision' monster in the long run.
Also, if one would want to know 'what revision this text belongs to', another index would be added to revision, which is not that necessary with our one-direction join approaches. There're lots and lots of things you really don't want to do for the 1/7 storage cut. If we were always interested about storage cuts, mediawiki would not be able to do, what it can do now.
I am not against efficiency overall, but there are always tradeoffs.
Anyway, there's a bit more visual expression of our data sizes within core databases: http://spreadsheets.google.com/pub?key=pfjIQrTbpVkaIStok1hWAdg