On Wed, Jul 29, 2015 at 3:48 PM, Kent/wp mirror wpmirrordev@gmail.com wrote:
When I build a mirror, I would like to compress the <text ...>plaintext</text> to get:
old_text: ciphertext old_flags: utf-8,gzip
I would like this done for every text revision, so as to save both disk space...
Maybe https://www.mediawiki.org/wiki/Manual:Reduce_size_of_the_database will help. maintenance/storage/compressOld.php will compress older revisions, optionally using gzip, and you can set the parameters to compress every revision.
Did you set $wgCompressRevisions in your installation before importing? I'm not sure if that has effect when building a mirror. It feels like it should, and/or importDump.php should have some option to compress all revisions imported; you could file a bug in Phabricator.
and communication bandwidth between web server and browser.
If I understand you correctly, that's a separate issue. MediaWiki doesn't send compressed page data to the browser, it sends HTML. However, most browsers send the Accept-Encoding: gzip, deflate HTTP header, and in response most web servers will gzip the HTML of MediaWiki pages and other web content. To verify, load a page from your wiki in your browser and look in your web browser's developer tools' Network tab for the request and response headers; the latter will probably have Content-Encoding: gzip Or you could do something like `curl -H 'Accept-Encoding: gzip, deflate' --dump-header - http://localhost/wiki/Main_Page | less` and see what you get.
2) Problem
There is little relevant documentation on https://www.mediawiki.org. So I have run a few experiments.
exp1) I pipe the plaintext through gzip, escape for MySQL, and build the mirror.
I wouldn't try to do this yourself. If import with $wgCompressRevisions = true doesn't do what you want and you don't want to run a compressOld.php maintenance step afterwards, I would suggest modifying some PHP somewhere solely during the import to your mirror to encourage MediaWiki it to compress every revision.
Please provide documentation as to how mediawiki handles compressed old_text. a) How is plaintext compressed?
From looking at core/includes/Revision.php, if PHP's gzdeflate() exists
then MediaWiki will use this to compress the contents of old_text. http://php.net/manual/en/function.gzdeflate.php has some documentation on the function works.
b) Is the ciphertext escaped for MySQL after compression?
No idea, old_text is a mediumblob storing binary data. As I understand it escaping applies only to transfer in and out of the DB.
c) How does mediawiki handle old_flags=utf-8,gzip?
d) How are the contents of old_text unescaped and decompressed for rendering? e) Where in the mediawiki code should I be looking to understand this better?
As above, PHP's gzdeflate/gzinflate in Revision::compressRevisionText() and decompressRevisionText() in core/includes/Revision.php
Hope this helps. I didn't know anything about this 25 minutes ago :)