"Anthony" wikimail@inbox.org wrote in message news:AANLkTi=UK+UF3y_B+ZLd57WCfUEF_7rf-Bt8TNvtg+2f@mail.gmail.com...
No, that's not the question. The question is why are you uncompressing and undiffing (from DiffHistoryBlobs) only to recompress (to bz2) and then uncompress and recompress (to 7z) when you can get roughly the same compression by just extracting the blobs and removing any non-public data.
That's probably not nearly as straightforward as it sounds. RevDel'd and suppressed revisions are not removed from the text storage; even Oversighted revisions are left there, only the entry in the revision table is removed or altered. I don't know OTTOMH how regularly the DiffHistoryBlob system stores a 'key frame', and how easy it would be to break diff chains in order to snip out non-public data from them, but I'd guess a) not very, and b) that the current code doesn't give any consideration to doing so because there's no reason for it to do so. So refactoring it to incorporate that, while not impossible, is a non-trivial amount of work.
And there are lots of lower-priority things that are being done. And lots of dollars sitting on the sidelines doing nothing.
Low-priority interesting things tend to get done when you have volunteers doing them. While the value of some of the Foundation's expenditure is commonly debated, I think you'd struggle to argue that many of the WMF's dollars are not doing *anything*.
--HM