On Tue, Nov 10, 2020 at 5:50 PM Gergo Tisza <gtisza(a)wikimedia.org> wrote:
On Tue, Nov 3, 2020 at 1:59 AM Daniel Kinzler
<dkinzler(a)wikimedia.org>
wrote:
TemplateData already uses JSON serialization, but
then compresses the
JSON output, to make the data fit into the page_props table. This results
in binary data in ParserOutput, which we can't directly put into JSON.
I'm not sure I understand the problem. Binary data can be trivially
represented as JSON, by treating it as a string. Is it an issue of storage
size? JSON escaping of the control characters is (assuming binary data with
a somewhat random distribution of bytes) an ~50% size increase, UTF-8
encoding the top half of bytes is another 50%, so it will approximately
double the length - certainly worse than the ~33% increase for base64, but
not tragic. (And if size increase matters that much, you probably shouldn't
be using base64 either.)
The binary aspect here refers to the gzip output buffer. While these are
represented in PHP as a string, the string is not encodable as UTF-8 or
indeed as JSON. Attempting to do so results in a PHP json error with
boolean false returned.
Condensed example: