On Tue, Nov 3, 2020 at 1:59 AM Daniel Kinzler <dkinzler(a)wikimedia.org>
wrote:
TemplateData already uses JSON serialization, but then
compresses the JSON
output, to make the data fit into the page_props table. This results in
binary data in ParserOutput, which we can't directly put into JSON.
I'm not sure I understand the problem. Binary data can be trivially
represented as JSON, by treating it as a string. Is it an issue of storage
size? JSON escaping of the control characters is (assuming binary data with
a somewhat random distribution of bytes) an ~50% size increase, UTF-8
encoding the top half of bytes is another 50%, so it will approximately
double the length - certainly worse than the ~33% increase for base64, but
not tragic. (And if size increase matters that much, you probably shouldn't
be using base64 either.)
* Don't write the data to page_props, treat it as extension data in
ParserOutput. Compression would become unnecessary.
However, batch loading
of the data becomes much slower, since each ParserOutput needs to be loaded
from ParserCache. Would it be too slow?
It would also mean that fetching template data or some other page property
might require a parse, as parser cache entries expire.
It would also also mean the properties could not be searched, which I think
is a dealbreaker.
* Apply compression for page_props, but not for the data in ParserOutput.
We would have to introduce some kind of serialization
mechanism into
PageProps and LinksUpdate. Do we want to encourage this use of page_props?
IMO we don't want to. page_props is for page *properties*, not arbitrary
structured data. Also it's somewhat problematic in that it is per-page data
but it represents the result of a parse, so it doesn't necessarily match
the current revision, nor what a user with non-canonical parser options
sees. New features should probably use MCR for structured data.
* Introduce a dedicated database table for templatedata. Cleaner, but
schema changes and data migration take a long time.
That seems like a decent solution to me, and probably the one I would pick
(unless there are more extensions in a similar situation). This is
secondary data so it doesn't really need to be migrated, just make
TemplateData write from the new table and fall back to the old one when
reading. Creating new tables should also not be time-consuming.
* Put templatedata into the BlobStore, and just the address into
page_props. Makes loading slower, maybe even slower
than the solution that
relies on ParserCache.
Doesn't BlobStore support batch loading, unlike ParserCache?
* Convert TemplateData to MCR. This is the cleanest solution, but would
require us to create an editing interface for
templatedata, and migrate out
existing data from wikitext. This is a long term perspective.
MCR has fairly different semantics from parser metadata. There are many
ways TemplateData data can be generated for a page without having a
<templatedata> tag in the wikitext (e.g. a doc subpage, or a template which
generates both documentation HTML and hidden TemplateData). Switching to
MCR should be thought of as a workflow adjustment for contributors, not
just a data migration.