Tim Starling:
The nice thing
about the revision table is that it's small enough to be altered
quickly. I think you can expect to see a VARCHAR(255) field containing
something similar to a URI.
Is such a field really needed? The software could assemble an URI
from existing information like http://$server/$project/$rev_id. This
would also be more flexible because if data needs to be transferred to
another server there is no need to update all rows in the revision
table.
But I don't want to give unwanted advice here. The background of my
question is that I have written a Perl program that compresses page
histories much better than the currently used algorithm. And now I
want to write PHP code so that MediaWiki can access the data. But
HistoryBlobStubs make this more complicated.
This is how my method works: All revision texts are split into sections
(the delimiter is "\n=="). Unchanged sections are stored only once.
Sections are sorted by their headings. Then everything is compressed
with deflate().
This has several advantages:
* Less memory is used because many sections don't change from revision
to revision. This is especially true for discussion pages.
* Compression and decompression is fast because deflate() is used and it
needs to (de)compress less data.
* If the pages are larger than 32kB, compression is much better than
can be achieved by simply concatenating revisions and compressing them
with deflate() and even bzip2. On average, my method compresses about
five times better than the currently used one. I have listed some values
on
http://meta.wikipedia.org/wiki/User:El/History_compression
Now that I've read that the revision texts are about to be relocated
to external storage servers I wonder if it would be worth to write
the PHP classes at all. What software is going to run on the storage
servers? Is my method still competitive there?
--
5 GB Mailbox, 50 FreeSMS
http://www.gmx.net/de/go/promail
+++ GMX - die erste Adresse f�r Mail, Message, More +++