0) Context
In the XML dump files, I get <text ...>plaintext</text>. When I build a mirror using XML dump files, I get:
old_text: plaintext old_flags: utf-8
However, when I then create a new page on my mirror, I get:
old_text: ciphertext old_flags: utf-8.gzip
1) Objective
When I build a mirror, I would like to compress the <text ...>plaintext</text> to get:
old_text: ciphertext old_flags: utf-8,gzip
I would like this done for every text revision, so as to save both disk space and communication bandwidth between web server and browser.
2) Problem
There is little relevant documentation on https://www.mediawiki.org. So I have run a few experiments.
exp1) I pipe the plaintext through gzip, escape for MySQL, and build the mirror. However, when I browse, I get the message:
``The revision #165770 of the page named "Main Page" does not exist''
When I look in the database, some kind of ciphertext does indeed exist.
3) Variants
Many utilities compress plaintext using LZ77 and Huffman encoding, but each differs as to the file header and tail. Some versions of deflate have no header at all. So I try four more experiments:
exp2) gzip, but throw away the 10 byte header (to simulate deflate)
/bin/gzip | tail -c +11
exp3) perl compress
/usr/bin/perl -MCompress::Zlib -e 'undef $/; print compress(<>)'
exp4) python compress, then throw away the single-quotes
/usr/bin/python -c "import zlib,sys;print repr(zlib.compress(sys.stdin.read()))" | /bin/sed 's/^.//; s/.$//'
exp5) zlib-flate from the qpdf DEB package
/usr/bin/zlib-flate -compress
For all experiments, the browser gives the same error message.
4) Reading compressed old_text
It should be possible to read the old_text ciphertext using command-line tools. I created a user page which mediawiki stored compressed. It is displayed correctly by the browser. But when I tried to read it directly from the database, there were problems.
(shell) mysql --host=localhost --user=root --password simplewiki --skip-column-names --silent --execute 'select old_text from simplewiki.text where old_id=5146705' | zlib-flate -uncompress Enter password: flate: inflate: data: incorrect data check
5) Request
Please provide documentation as to how mediawiki handles compressed old_text. a) How is plaintext compressed? b) Is the ciphertext escaped for MySQL after compression? c) How does mediawiki handle old_flags=utf-8,gzip? d) How are the contents of old_text unescaped and decompressed for rendering? e) Where in the mediawiki code should I be looking to understand this better?
SIncerely Yours, Kent
On Wed, Jul 29, 2015 at 3:48 PM, Kent/wp mirror wpmirrordev@gmail.com wrote:
When I build a mirror, I would like to compress the <text ...>plaintext</text> to get:
old_text: ciphertext old_flags: utf-8,gzip
I would like this done for every text revision, so as to save both disk space...
Maybe https://www.mediawiki.org/wiki/Manual:Reduce_size_of_the_database will help. maintenance/storage/compressOld.php will compress older revisions, optionally using gzip, and you can set the parameters to compress every revision.
Did you set $wgCompressRevisions in your installation before importing? I'm not sure if that has effect when building a mirror. It feels like it should, and/or importDump.php should have some option to compress all revisions imported; you could file a bug in Phabricator.
and communication bandwidth between web server and browser.
If I understand you correctly, that's a separate issue. MediaWiki doesn't send compressed page data to the browser, it sends HTML. However, most browsers send the Accept-Encoding: gzip, deflate HTTP header, and in response most web servers will gzip the HTML of MediaWiki pages and other web content. To verify, load a page from your wiki in your browser and look in your web browser's developer tools' Network tab for the request and response headers; the latter will probably have Content-Encoding: gzip Or you could do something like `curl -H 'Accept-Encoding: gzip, deflate' --dump-header - http://localhost/wiki/Main_Page | less` and see what you get.
2) Problem
There is little relevant documentation on https://www.mediawiki.org. So I have run a few experiments.
exp1) I pipe the plaintext through gzip, escape for MySQL, and build the mirror.
I wouldn't try to do this yourself. If import with $wgCompressRevisions = true doesn't do what you want and you don't want to run a compressOld.php maintenance step afterwards, I would suggest modifying some PHP somewhere solely during the import to your mirror to encourage MediaWiki it to compress every revision.
Please provide documentation as to how mediawiki handles compressed old_text. a) How is plaintext compressed?
From looking at core/includes/Revision.php, if PHP's gzdeflate() exists
then MediaWiki will use this to compress the contents of old_text. http://php.net/manual/en/function.gzdeflate.php has some documentation on the function works.
b) Is the ciphertext escaped for MySQL after compression?
No idea, old_text is a mediumblob storing binary data. As I understand it escaping applies only to transfer in and out of the DB.
c) How does mediawiki handle old_flags=utf-8,gzip?
d) How are the contents of old_text unescaped and decompressed for rendering? e) Where in the mediawiki code should I be looking to understand this better?
As above, PHP's gzdeflate/gzinflate in Revision::compressRevisionText() and decompressRevisionText() in core/includes/Revision.php
Hope this helps. I didn't know anything about this 25 minutes ago :)
wikitech-l@lists.wikimedia.org