On Tue, Nov 10, 2020 at 5:50 PM Gergo Tisza <gtisza@wikimedia.org> wrote:
On Tue, Nov 3, 2020 at 1:59 AM Daniel Kinzler <dkinzler@wikimedia.org> wrote:
TemplateData already uses JSON serialization, but then compresses the JSON output, to make the data fit into the page_props table. This results in binary data in ParserOutput, which we can't directly put into JSON.

I'm not sure I understand the problem. Binary data can be trivially represented as JSON, by treating it as a string. Is it an issue of storage size? JSON escaping of the control characters is (assuming binary data with a somewhat random distribution of bytes) an ~50% size increase, UTF-8 encoding the top half of bytes is another 50%, so it will approximately double the length - certainly worse than the ~33% increase for base64, but not tragic. (And if size increase matters that much, you probably shouldn't be using base64 either.)

The binary aspect here refers to the gzip output buffer. While these are represented in PHP as a string, the string is not encodable as UTF-8 or indeed as JSON. Attempting to do so results in a PHP json error with boolean false returned.

Condensed example: https://3v4l.org/cJttU