On Wed, Jul 29, 2015 at 3:48 PM, Kent/wp mirror <wpmirrordev(a)gmail.com
wrote:
When I build a mirror, I would like to compress the
<text
...>plaintext</text> to get:
old_text: ciphertext
old_flags: utf-8,gzip
I would like this done for every text revision, so as to save both disk
space...
Maybe
https://www.mediawiki.org/wiki/Manual:Reduce_size_of_the_database
will help. maintenance/storage/compressOld.php will compress older
revisions, optionally using gzip, and you can set the parameters to
compress every revision.
Did you set $wgCompressRevisions in your installation before importing? I'm
not sure if that has effect when building a mirror. It feels like it
should, and/or importDump.php should have some option to compress all
revisions imported; you could file a bug in Phabricator.
and communication bandwidth between web server and browser.
If I understand you correctly, that's a separate issue. MediaWiki doesn't
send compressed page data to the browser, it sends HTML. However, most
browsers send the
Accept-Encoding: gzip, deflate
HTTP header, and in response most web servers will gzip the HTML of
MediaWiki pages and other web content. To verify, load a page from your
wiki in your browser and look in your web browser's developer tools'
Network tab for the request and response headers; the latter will probably
have
Content-Encoding: gzip
Or you could do something like `curl -H 'Accept-Encoding: gzip, deflate'
--dump-header -
http://localhost/wiki/Main_Page | less` and see what you
get.
2) Problem
> There is little relevant documentation on
<https://www.mediawiki.org>. So
> I
> have run a few experiments.
> exp1) I pipe the plaintext through gzip,
escape for MySQL, and build the
> mirror.
I wouldn't try to do this yourself. If import with $wgCompressRevisions =
true doesn't do what you want and you don't want to run a compressOld.php
maintenance step afterwards, I would suggest modifying some PHP somewhere
solely during the import to your mirror to encourage MediaWiki it to
compress every revision.
> Please provide documentation as to how mediawiki handles compressed
> old_text.
> a) How is plaintext compressed?
From looking at core/includes/Revision.php, if
PHP's gzdeflate() exists
then MediaWiki will use this to compress the contents
of old_text.
http://php.net/manual/en/function.gzdeflate.php has some documentation on
the function works.
> b) Is the ciphertext escaped for MySQL after compression?
No idea, old_text is a mediumblob storing binary
data. As I understand it
escaping applies only to transfer in and out of the DB.
c) How does mediawiki handle old_flags=utf-8,gzip?
> d) How are the contents of old_text unescaped and decompressed for
> rendering?
> e) Where in the mediawiki code should I be looking to understand this
> better?
As above, PHP's gzdeflate/gzinflate in Revision::compressRevisionText() and
decompressRevisionText() in core/includes/Revision.php
Hope this helps. I didn't know anything about this 25 minutes ago :)
--
=S Page WMF Tech writer