0) Context
In the XML dump files, I get <text ...>plaintext</text>. When I build a mirror using XML dump files, I get:
old_text: plaintext old_flags: utf-8
However, when I then create a new page on my mirror, I get:
old_text: ciphertext old_flags: utf-8.gzip
1) Objective
When I build a mirror, I would like to compress the <text ...>plaintext</text> to get:
old_text: ciphertext old_flags: utf-8,gzip
I would like this done for every text revision, so as to save both disk space and communication bandwidth between web server and browser.
2) Problem
There is little relevant documentation on https://www.mediawiki.org. So I have run a few experiments.
exp1) I pipe the plaintext through gzip, escape for MySQL, and build the mirror. However, when I browse, I get the message:
``The revision #165770 of the page named "Main Page" does not exist''
When I look in the database, some kind of ciphertext does indeed exist.
3) Variants
Many utilities compress plaintext using LZ77 and Huffman encoding, but each differs as to the file header and tail. Some versions of deflate have no header at all. So I try four more experiments:
exp2) gzip, but throw away the 10 byte header (to simulate deflate)
/bin/gzip | tail -c +11
exp3) perl compress
/usr/bin/perl -MCompress::Zlib -e 'undef $/; print compress(<>)'
exp4) python compress, then throw away the single-quotes
/usr/bin/python -c "import zlib,sys;print repr(zlib.compress(sys.stdin.read()))" | /bin/sed 's/^.//; s/.$//'
exp5) zlib-flate from the qpdf DEB package
/usr/bin/zlib-flate -compress
For all experiments, the browser gives the same error message.
4) Reading compressed old_text
It should be possible to read the old_text ciphertext using command-line tools. I created a user page which mediawiki stored compressed. It is displayed correctly by the browser. But when I tried to read it directly from the database, there were problems.
(shell) mysql --host=localhost --user=root --password simplewiki --skip-column-names --silent --execute 'select old_text from simplewiki.text where old_id=5146705' | zlib-flate -uncompress Enter password: flate: inflate: data: incorrect data check
5) Request
Please provide documentation as to how mediawiki handles compressed old_text. a) How is plaintext compressed? b) Is the ciphertext escaped for MySQL after compression? c) How does mediawiki handle old_flags=utf-8,gzip? d) How are the contents of old_text unescaped and decompressed for rendering? e) Where in the mediawiki code should I be looking to understand this better?
SIncerely Yours, Kent