It seems there is disagreement about what the correct interpretation of NULL in the rev_content_model column is. Should NULL there mean
(a) "the current page content model, as recorded in page_content_model"
or should it mean
(b) "the default for this title, no matter what page_content_model says"?
Kunal and I have had an unintentional edit war about this question in Revision.php:
Kunal changed it from (a) to (b) in https://gerrit.wikimedia.org/r/#/c/222043/ I later changed it from (b) to (a) in https://gerrit.wikimedia.org/r/#/c/297787/ Kunal reverted me from (a) to (b) in https://gerrit.wikimedia.org/r/#/c/298239/
So, which way do we want it?
The conflict seems to arise from (at least) three competing use cases:
I) re-interpreting page content. For instance, a user may move a misnamed User:Foo.jss to User:Foo.js. In this case, the content should be re-interpreted as JavaScript, including all old revisions. This would be in favor of behavior (a), though it still works with (b), because the default model changes based on the suffix ".js". I think it would however be better to only rely on title parsing magic once, when creating the page, not later, when rendering old revisions.
II) converting page content. For instance, if a talk page gets converted to using Flow, new revisions (and page_content_model) will have the Flow model, while old revisions need to keep their original wikitext model (even though their rev_content_model is null). That would need behavior (b).
III) changing a namespace's default content model. E.g. when installing an extension that changes the default content model of a namespace (such as Wikibase with Items in the main namespace, or Flow-per-default for Talk pages), existing pages that were already in that namespace should still be readable. With (b), this would fail: even though page_content_model has the correct model for reading the page, rev_content_model is null, so the new namespace default is used, which will fail. With (a), this would simply work: the page will be rendered according to page_content_model.
In all cases it's possible to resolve the issue by replacing the NULL entries for all revisions of a page with the current model id. The question is just when and how we do that, and when and how we can even detect that this needs doing.
There is also an in-between option, let's call it a/b: fall back to page_content_model for the latest revision (that should *always* be right), but to ignore page_content_model for older revisions. That would cater to use case III at least in so far as it would be possible to view the "misplaced" pages. But viewing old revisions or diffs would still fail with a nasty error. This option may look better on the surface, but I fear it will just add to the confusion.
There's another fix: never write null into rev_content_model. Always put the actual model ID there. That's pretty wasteful, but it's robust and reliable. When we decided to use null as a placeholder for the default, we assumed the default would never change. But as we now see, it sometimes does...
So, what should it be, option (a) or (b)? And how do we address the use case that is then broken? What should we write into rev_content_model in the future?
I personally think that option (a) makes more sense, because the resolutions of defaults is then local to the database. It could even be done within the SQL query. It's easier to maintain consistency that way. For use case II, that would require us to "fill in" all the rev_content_model fields in old revisions when converting a page. I think it would be a good thing to do that. If we have the content model change between revisions, it seems prudent to record it explicitly.
wikitech-l@lists.wikimedia.org