On Tue, Nov 3, 2020 at 1:59 AM Daniel Kinzler dkinzler@wikimedia.org wrote:
TemplateData already uses JSON serialization, but then compresses the JSON output, to make the data fit into the page_props table. This results in binary data in ParserOutput, which we can't directly put into JSON.
I'm not sure I understand the problem. Binary data can be trivially represented as JSON, by treating it as a string. Is it an issue of storage size? JSON escaping of the control characters is (assuming binary data with a somewhat random distribution of bytes) an ~50% size increase, UTF-8 encoding the top half of bytes is another 50%, so it will approximately double the length - certainly worse than the ~33% increase for base64, but not tragic. (And if size increase matters that much, you probably shouldn't be using base64 either.)
* Don't write the data to page_props, treat it as extension data in
ParserOutput. Compression would become unnecessary. However, batch loading of the data becomes much slower, since each ParserOutput needs to be loaded from ParserCache. Would it be too slow?
It would also mean that fetching template data or some other page property might require a parse, as parser cache entries expire. It would also also mean the properties could not be searched, which I think is a dealbreaker.
* Apply compression for page_props, but not for the data in ParserOutput.
We would have to introduce some kind of serialization mechanism into PageProps and LinksUpdate. Do we want to encourage this use of page_props?
IMO we don't want to. page_props is for page *properties*, not arbitrary structured data. Also it's somewhat problematic in that it is per-page data but it represents the result of a parse, so it doesn't necessarily match the current revision, nor what a user with non-canonical parser options sees. New features should probably use MCR for structured data.
* Introduce a dedicated database table for templatedata. Cleaner, but
schema changes and data migration take a long time.
That seems like a decent solution to me, and probably the one I would pick (unless there are more extensions in a similar situation). This is secondary data so it doesn't really need to be migrated, just make TemplateData write from the new table and fall back to the old one when reading. Creating new tables should also not be time-consuming.
* Put templatedata into the BlobStore, and just the address into
page_props. Makes loading slower, maybe even slower than the solution that relies on ParserCache.
Doesn't BlobStore support batch loading, unlike ParserCache?
* Convert TemplateData to MCR. This is the cleanest solution, but would
require us to create an editing interface for templatedata, and migrate out existing data from wikitext. This is a long term perspective.
MCR has fairly different semantics from parser metadata. There are many ways TemplateData data can be generated for a page without having a <templatedata> tag in the wikitext (e.g. a doc subpage, or a template which generates both documentation HTML and hidden TemplateData). Switching to MCR should be thought of as a workflow adjustment for contributors, not just a data migration.