On 7/26/06, Domas Mituzas <midom.lists(a)gmail.com> wrote:
[snip]
With all
revision pages its around 3 TB total.
That really requires advanced tech. At Wikipedia revision pages are
compressed, and a proper compression run contracts whole dataset into
0.5T or so (or less).
A minor nit.. with braindead stupid compression (toasted columns in
PGsql which use a modified LZ algo which gets less compression than
gzip -3 but is much faster and compresses a single row at a time) you
can get the whole of english wikipedia into 0.4TB including the needed
indexes and the (not insubstantial) DB overhead.
With state of the art compression (lzma) you can get all the revisions
into 6gb, but you lose random access. At wikimania tech days I'll be
presenting a system which achieves similar compression perform ace but
preserves random access... Which is at least a mildly interesting
subject, although perhaps without practical implications for wikimedia
until the disk/cpu performance gap widens a bit further. :)