Tim Starling:
The nice thing about the revision table is that it's small enough to be altered quickly. I think you can expect to see a VARCHAR(255) field containing something similar to a URI.
Is such a field really needed? The software could assemble an URI from existing information like http://$server/$project/$rev_id. This would also be more flexible because if data needs to be transferred to another server there is no need to update all rows in the revision table.
But I don't want to give unwanted advice here. The background of my question is that I have written a Perl program that compresses page histories much better than the currently used algorithm. And now I want to write PHP code so that MediaWiki can access the data. But HistoryBlobStubs make this more complicated.
This is how my method works: All revision texts are split into sections (the delimiter is "\n=="). Unchanged sections are stored only once. Sections are sorted by their headings. Then everything is compressed with deflate().
This has several advantages: * Less memory is used because many sections don't change from revision to revision. This is especially true for discussion pages. * Compression and decompression is fast because deflate() is used and it needs to (de)compress less data. * If the pages are larger than 32kB, compression is much better than can be achieved by simply concatenating revisions and compressing them with deflate() and even bzip2. On average, my method compresses about five times better than the currently used one. I have listed some values on http://meta.wikipedia.org/wiki/User:El/History_compression
Now that I've read that the revision texts are about to be relocated to external storage servers I wonder if it would be worth to write the PHP classes at all. What software is going to run on the storage servers? Is my method still competitive there?