Hi!
It is relevant for mediawiki-l@ audience, not relevant for wikimedia-
tech@ (if we get to wikimedia technology, it doesn't rely on default
settings)
Maybe you could explain how the storage class renders
his idea
irrelevant?
Though probably Tim can be much better at explaining this, but 'text'
just provides pointers to a "storage cloud", which can be whatever you
want (different ES implementations can do different things).
It can point to sub-entries in bigger blobs, and supports two methods:
a) DiffHistoryBlob - differential storage, that has passed
compression, with some adjustments for page blankings, etc
b) ConcatenatedGzipHistoryBlob - just plain concatenation of
revisions, and compression on top
Both already guard against not only same but also similar text in
subsequent revisions.
There're some other optimizations that we could do (optimized packing
of pointers/flags in text table), but keep in mind, that every time
you edit a page:
~180 bytes are added to revision table (and make additional 200 bytes
in indexing)
~300 bytes are added to recentchanges (and make additional 400 bytes
in indexing)
~370 bytes are added to cu_changes (300 bytes in indexing, these two
tables are round-buffers though)
text is 85 bytes with no additional indexing (and even that was skewed
by few cases, when we wrote directly to it)
even if it could be possible to reduce amount of pointers in text by
reusing them (one can point same text entry to multiple revisions, as
it was already noted), it could make maintenance/batch operations much
more complicated.
Also, as blobs can get migrated, transformed, etc, it is better to do
that in separate table, without touching the bigger 'revision' monster
in the long run.
Also, if one would want to know 'what revision this text belongs to',
another index would be added to revision, which is not that necessary
with our one-direction join approaches. There're lots and lots of
things you really don't want to do for the 1/7 storage cut. If we were
always interested about storage cuts, mediawiki would not be able to
do, what it can do now.
I am not against efficiency overall, but there are always tradeoffs.
Anyway, there's a bit more visual expression of our data sizes within
core databases:
http://spreadsheets.google.com/pub?key=pfjIQrTbpVkaIStok1hWAdg
--
Domas Mituzas --
http://dammit.lt/ -- [[user:midom]]
P.S. I just came back from Berlin, where had all the memories about
2004 presentation done there by Tim and Brion, how we treat text
storage - that was pre-ES era ;-)
P.P.S. Not only we were sitting in same c-base, where some original
parties back then happened (it was first mediawiki dev meetup), but
also went to same falafel place at 3AM (though last time I remember
going there at 5AM :-)