What's the "correct" content model when rev_content_model is NULL?

List overview All Threads
Download

newer

older

Re: [Wikitech-l] [action required]...

Re: [Wikitech-l] phpunit:...

Daniel Kinzler

11 Jul 2016 11 Jul '16

2:07 p.m.

It seems there is disagreement about what the correct interpretation of NULL in the rev_content_model column is. Should NULL there mean

(a) "the current page content model, as recorded in page_content_model"

or should it mean

(b) "the default for this title, no matter what page_content_model says"?

Kunal and I have had an unintentional edit war about this question in Revision.php:

Kunal changed it from (a) to (b) in https://gerrit.wikimedia.org/r/#/c/222043/ I later changed it from (b) to (a) in https://gerrit.wikimedia.org/r/#/c/297787/ Kunal reverted me from (a) to (b) in https://gerrit.wikimedia.org/r/#/c/298239/

So, which way do we want it?

The conflict seems to arise from (at least) three competing use cases:

I) re-interpreting page content. For instance, a user may move a misnamed User:Foo.jss to User:Foo.js. In this case, the content should be re-interpreted as JavaScript, including all old revisions. This would be in favor of behavior (a), though it still works with (b), because the default model changes based on the suffix ".js". I think it would however be better to only rely on title parsing magic once, when creating the page, not later, when rendering old revisions.

II) converting page content. For instance, if a talk page gets converted to using Flow, new revisions (and page_content_model) will have the Flow model, while old revisions need to keep their original wikitext model (even though their rev_content_model is null). That would need behavior (b).

III) changing a namespace's default content model. E.g. when installing an extension that changes the default content model of a namespace (such as Wikibase with Items in the main namespace, or Flow-per-default for Talk pages), existing pages that were already in that namespace should still be readable. With (b), this would fail: even though page_content_model has the correct model for reading the page, rev_content_model is null, so the new namespace default is used, which will fail. With (a), this would simply work: the page will be rendered according to page_content_model.

In all cases it's possible to resolve the issue by replacing the NULL entries for all revisions of a page with the current model id. The question is just when and how we do that, and when and how we can even detect that this needs doing.

There is also an in-between option, let's call it a/b: fall back to page_content_model for the latest revision (that should *always* be right), but to ignore page_content_model for older revisions. That would cater to use case III at least in so far as it would be possible to view the "misplaced" pages. But viewing old revisions or diffs would still fail with a nasty error. This option may look better on the surface, but I fear it will just add to the confusion.

There's another fix: never write null into rev_content_model. Always put the actual model ID there. That's pretty wasteful, but it's robust and reliable. When we decided to use null as a placeholder for the default, we assumed the default would never change. But as we now see, it sometimes does...

So, what should it be, option (a) or (b)? And how do we address the use case that is then broken? What should we write into rev_content_model in the future?

I personally think that option (a) makes more sense, because the resolutions of defaults is then local to the database. It could even be done within the SQL query. It's easier to maintain consistency that way. For use case II, that would require us to "fill in" all the rev_content_model fields in old revisions when converting a page. I think it would be a good thing to do that. If we have the content model change between revisions, it seems prudent to record it explicitly.

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Show replies by date

Jaime Crespo

11 Jul 11 Jul

2:27 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On Mon, Jul 11, 2016 at 2:07 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

It seems there is disagreement about what the correct interpretation of NULL in the rev_content_model column is. Should NULL there mean

...

What should we write into rev_content_model in the future

Content model handling is pending a refactoring: https://www.mediawiki.org/wiki/Requests_for_comment/Content_model_storage Once that happens, they should never be NULL.

Daniel Kinzler

9:26 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Hi Jaime, thanks for the pointer! I had completely forgotten about that.

A few thoughts about that RFC:

* I have long thought that content_format is pretty pointless and redundant. I haven't seen any content model that uses different serialization formats (I wrote a few that support two, but only ever used one). If the serialization does need to change for some reason, it's usually easy to detect from the first few bytes.

* What we need instead is versioning on the content model. It happens quite often that the data structure you store changes slightly. Knowing what version you are dealing with is quite helpful when deserializing and processing. These differences are much harder to auto-detect than the serialization format,

* Per-page and per-revision content model will become redundant with Multi-Content-Revisions. We will instead have this info in the revision_slot table (multiple per revision). The same design still applies, but changing the page and revision table would be pointless. We would just ignore the content model (and format) in the page and revision table, and rely on the info for the slot table instead. At some point, we can then drop this info from page and revision.

I propose to introduce the content_model (and maybe also content_format) tables, but not touch the page and revision table for now. Instead, we introduce revision_slots for Multi-Content-Revisions first, using the content_model table, and introduce model versioning; maybe drop the format in the process.

What do you think?

Am 11.07.2016 um 14:27 schrieb Jaime Crespo:

...

On Mon, Jul 11, 2016 at 2:07 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...
It seems there is disagreement about what the correct interpretation of NULL in the rev_content_model column is. Should NULL there mean

...
What should we write into rev_content_model in the future

Content model handling is pending a refactoring: https://www.mediawiki.org/wiki/Requests_for_comment/Content_model_storage Once that happens, they should never be NULL.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Brian Wolff

9:44 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On Monday, July 11, 2016, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Hi Jaime, thanks for the pointer! I had completely forgotten about that.

A few thoughts about that RFC:

I have long thought that content_format is pretty pointless and

redundant. I

...

haven't seen any content model that uses different serialization formats

...

wrote a few that support two, but only ever used one). If the

serialization does

...

need to change for some reason, it's usually easy to detect from the

first few

...

bytes.

As an aside, ive been recently (as in literally last week) been doing some stuff using multiple serialization formats (specificly i wanted the user to be able to choose what format to edit as, but always save in the canonical format). Its working pretty well for my use case. Two issues i encountered was the show diff button on edit page totally broken (T139249) and there is no way to separate out default format for editing from default format for db.

(Sorry if this is off topic, i just wanted to mention im actually using content format, albeit not the db part of it).

-- bawolff

Daniel Kinzler

12 Jul 12 Jul

10:40 a.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Addendum, after sleeping over this:

Do we really want to manage something that is essentially configuration, namely the set of available content models and formats, in a database table? How is it maintained?

For context: * As per T113034, we are movign away from managing interwiki prefixes in the database, in favor of configuration files. * Namespace IDs are defined in LocalSettings.php.

The original design of ContentHandler used integer IDs for content models and formats in the DB. A mapping to human readable names is only needed for logging and error messages anyway. Such a mapping could be maintain in LocalSettings.php, just like we do for namespaces. This would also serve to avoid ID clashes. My idea back then was to have a sort of registry on mediawiki.org where extensions could reserve an ID for themselves, so that the same ID would stand for the same model everywhere.

The disadvantage is of course that the model and format are not obvious when eyeballing the result of an SQL query. It also makes database dumps more brittle, since they cannot be interpreted without knowledge of the format and model identifiers. That's an argument for having these in the DB.

Still... configuration in the database is nasty to maintain by hand, and also annoying for extensions that define content models. Do we introduce a simple hook that makes sure the content model and format gets registered in the database?

Am 11.07.2016 um 21:26 schrieb Daniel Kinzler:

...

Hi Jaime, thanks for the pointer! I had completely forgotten about that.

A few thoughts about that RFC:

I have long thought that content_format is pretty pointless and redundant. I

haven't seen any content model that uses different serialization formats (I wrote a few that support two, but only ever used one). If the serialization does need to change for some reason, it's usually easy to detect from the first few bytes.

What we need instead is versioning on the content model. It happens quite

often that the data structure you store changes slightly. Knowing what version you are dealing with is quite helpful when deserializing and processing. These differences are much harder to auto-detect than the serialization format,

Per-page and per-revision content model will become redundant with

Multi-Content-Revisions. We will instead have this info in the revision_slot table (multiple per revision). The same design still applies, but changing the page and revision table would be pointless. We would just ignore the content model (and format) in the page and revision table, and rely on the info for the slot table instead. At some point, we can then drop this info from page and revision.

I propose to introduce the content_model (and maybe also content_format) tables, but not touch the page and revision table for now. Instead, we introduce revision_slots for Multi-Content-Revisions first, using the content_model table, and introduce model versioning; maybe drop the format in the process.

What do you think?

Am 11.07.2016 um 14:27 schrieb Jaime Crespo:

...
On Mon, Jul 11, 2016 at 2:07 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...
It seems there is disagreement about what the correct interpretation of NULL in the rev_content_model column is. Should NULL there mean

...
What should we write into rev_content_model in the future

Content model handling is pending a refactoring: https://www.mediawiki.org/wiki/Requests_for_comment/Content_model_storage Once that happens, they should never be NULL.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Jaime Crespo

12:25 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Your last question is a non issue for me- I do not care if things are on the database or on configuration- that is not the issue I have been complaining about.

What I blocked is having 6000 million rows (x40 due to redundancy) with the same column value "gzip; version 3 (1-2-3-testing-testing. It seems to work)" when it can be summarized as a 1-byte or less id (and that id be explained somewhere else). The difference between both options is extremely cheap to code and not only it would save thousands of dollars in server cost, it would also minimize maintenance cost and dramatically increase performance (or not decrease it) on one of the largest bottlenecks for large wikis, as it could fit fully into memory (yes, we have 515 GB servers now).

To give you an idea how how bad things are currently: WMF's architecture technically does not store on the main databases servers any data (a lot of asterisks here, allow me be inexact for the sake of simplicity), only metadata, as the wiki content is stored on the "external storage" subsystem. I gave a try to InnoDB compression [0] (which has a very low compression ratio and a very small block size, as it is for real-time purposes only), yet I was able to reduce the disk usage to less than half by only compressing the top 10 tables: [1]. If this is not an objective measurement of how inefficient mediawiki schema is, I do not know how I can convince you otherwise.

Of course there are a lot of history and legacy and maintenance issues, but when the guy that actually would spend days of his life running schema changes so they do not affect production is the one begging for them to happen you know there is an issue. And this is not a "mediawiki" is bad complain- I think mediawiki is a very good piece of software- I only want to make it better with very, very small maintenance-like changes.

...

The disadvantage is of course that the model and format are not obvious when eyeballing the result of an SQL query.

Are you serious? Because this is super-clear already :-P:

MariaDB db1057 enwiki > SELECT * FROM revision LIMIT 1000,1\G *************************** 1. row *************************** rev_text_id: 1161 -- what? [...] rev_content_model: NULL -- what? rev_content_format: NULL 1 row in set (0.00 sec)

I am joking at this point, but emulating what someone that looks at the db would say. My point is that mediawiki is no longer simple.

More recommended reading (not for you, for many developers that still are afraid of them- and I really found many cases in the wild for otherwise good contributors): https://en.wikipedia.org/wiki/Join_(SQL)

[0] https://phabricator.wikimedia.org/T139055 [1] https://grafana.wikimedia.org/dashboard/db/server-board?panelId=17&fullscreen&from=1467294350779&to=1467687175941&var-server=db1073&var-network=eth0

On Tue, Jul 12, 2016 at 10:40 AM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Addendum, after sleeping over this:

Do we really want to manage something that is essentially configuration, namely the set of available content models and formats, in a database table? How is it maintained?

-- Jaime Crespo http://wikimedia.org

Daniel Kinzler

12:40 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Am 12.07.2016 um 12:25 schrieb Jaime Crespo:

...

Your last question is a non issue for me- I do not care if things are on the database or on configuration- that is not the issue I have been complaining about.

Yea, still something we need to figure out :)

I'm fine with the DB based solution, if we have decent tooling for extensions to register their content models, etc.

...

What I blocked is having 6000 million rows (x40 due to redundancy) with the same column value "gzip; version 3 (1-2-3-testing-testing. It seems to work)" when it can be summarized as a 1-byte or less id (and that id be explained somewhere else).

Yea, that's not what I would recommend either. What I meant is that we can now, as a stepping stone and without blocking on a schema change, fill in the null values in the revision table for the revisions of a page that is being converted to a new model, to avoid confusion. Converting pages to a different model is relatively rare, so I think it would not have much of an impact on the big picture.

...

Of course there are a lot of history and legacy and maintenance issues, but when the guy that actually would spend days of his life running schema changes so they do not affect production is the one begging for them to happen you know there is an issue. And this is not a "mediawiki" is bad complain- I think mediawiki is a very good piece of software- I only want to make it better with very, very small maintenance-like changes.

I'm all for it!

...

...
The disadvantage is of course that the model and format are not obvious when eyeballing the result of an SQL query.

Are you serious? Because this is super-clear already :-P:

That was, if I remember correctly, one of the arguments for using readable strings there, instead of int values and a config variable, as I originally proposed. This was discussed at the last Berlin hackathon, must have been 2012. Tim may remember more details. We should probably re-consider the pros and cons we discussed back then when planning to change the scham now.

-- daniel

Jaime Crespo

1:23 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On Tue, Jul 12, 2016 at 12:40 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Yea, still something we need to figure out :)

...

That was, if I remember correctly, one of the arguments for using readable strings there, instead of int values and a config variable, as I originally proposed. This was discussed at the last Berlin hackathon, must have been 2012. Tim may remember more details. We should probably re-consider the pros and cons we discussed back then when planning to change the scham now.

But that was already re-reviewed and discussed and approved by Tim himself (among others) on 2015: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-07-29-20.59.html.

-- Jaime Crespo http://wikimedia.org

Daniel Kinzler

4:06 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Am 12.07.2016 um 13:23 schrieb Jaime Crespo:

...

On Tue, Jul 12, 2016 at 12:40 PM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...
Yea, still something we need to figure out :)

...
That was, if I remember correctly, one of the arguments for using readable strings there, instead of int values and a config variable, as I originally proposed. This was discussed at the last Berlin hackathon, must have been 2012. Tim may remember more details. We should probably re-consider the pros and cons we discussed back then when planning to change the scham now.

But that was already re-reviewed and discussed and approved by Tim himself (among others) on 2015: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-07-29-20.59.html.

Yes, I saw that. And I'm happy about it! But the aspect of maintainance and tooling seems to be completely absent from the discussion and proposal. From a DB perspective, looks fine. I just feel it is missing a few crucial bits. Like, how does anything ever get into these tables?

-- daniel

Brad Jorsch (Anomie)

5:02 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On Tue, Jul 12, 2016 at 4:40 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de

...

wrote:

...

Do we really want to manage something that is essentially configuration, namely the set of available content models and formats, in a database table? How is it maintained?

One simple method: assign the numeric IDs by making the numeric ID column auto-increment, and insert the model strings into the table as needed. PageAssessments uses this model for tracking its project tags.[1]

The disadvantage is that there wouldn't be any cross-wiki mapping between model names and ids, which can be mitigated somewhat by never exposing the ids externally.

[1]: https://phabricator.wikimedia.org/diffusion/EPAS/browse/master/PageAssessmen...

...

Such a mapping could be maintain in LocalSettings.php, just like we do for namespaces. This would also serve to avoid ID clashes. My idea back then was to have a sort of registry on mediawiki.org where extensions could reserve an ID for themselves, so that the same ID would stand for the same model everywhere.

Does the registry idea work all that smoothly for namespaces, though?

-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation

Daniel Kinzler

5:47 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Am 12.07.2016 um 17:02 schrieb Brad Jorsch (Anomie):

...

On Tue, Jul 12, 2016 at 4:40 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de

One simple method: assign the numeric IDs by making the numeric ID column auto-increment, and insert the model strings into the table as needed.

When exactly? When update.php runs? Should work fine, but I'd like a nice interface that extensions can use for this. Or should we check and auto-insert on every page edit?

To answer my own question about config in the database: unlike interwiki/sites and namespaces, this isn't realyl configuration, it's a registry used by extensions. Users may freely derfine namespaces for their wiki, but they can't freely define content models.

...

The disadvantage is that there wouldn't be any cross-wiki mapping between model names and ids, which can be mitigated somewhat by never exposing the ids externally.

Yes, we should definitly not expose those!

...

Does the registry idea work all that smoothly for namespaces, though?

I don't think it was ever really tried for namespace. But it's not a perfect solution. Just a possibility.

-- daniel

Brad Jorsch (Anomie)

6 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On Tue, Jul 12, 2016 at 11:47 AM, Daniel Kinzler < daniel.kinzler@wikimedia.de> wrote:

...

Am 12.07.2016 um 17:02 schrieb Brad Jorsch (Anomie):

...
On Tue, Jul 12, 2016 at 4:40 AM, Daniel Kinzler <

daniel.kinzler@wikimedia.de

...
One simple method: assign the numeric IDs by making the numeric ID column auto-increment, and insert the model strings into the table as needed.

When exactly? When update.php runs? Should work fine, but I'd like a nice interface that extensions can use for this. Or should we check and auto-insert on every page edit?

The linked example is inserting (if necessary) on every page edit. The check part needs to happen on every edit anyway because it needs to fetch the ID for the name.

update.php would work too as long as things blow up clearly when someone didn't run update.php recently enough. That could also allow us to let the extension suggest an ID, so the registrar would only have to assign a "random" ID in case of a conflict.

...

...
Does the registry idea work all that smoothly for namespaces, though?

I don't think it was ever really tried for namespace. But it's not a perfect solution. Just a possibility.

https://www.mediawiki.org/wiki/Extension_default_namespaces?

-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation

Rob Lanphier

9:02 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On Tue, Jul 12, 2016 at 8:02 AM, Brad Jorsch (Anomie) bjorsch@wikimedia.org wrote:

...

One simple method: assign the numeric IDs by making the numeric ID column auto-increment, and insert the model strings into the table as needed. PageAssessments uses this model for tracking its project tags.[1]

The disadvantage is that there wouldn't be any cross-wiki mapping between model names and ids, which can be mitigated somewhat by never exposing the ids externally.

Could you explain this idea in a way that doesn't require diving into the codebase to figure out what you mean? Cloaking the mapping of local ids (e.g. auto incremented in the DB) to global ids ("model names") seems to suggest a new way of making our system behave in an inscrutable way.

On Tue, Jul 12, 2016 at 9:00 AM, Brad Jorsch (Anomie) bjorsch@wikimedia.org wrote:

...

[Does this namespace registry idea work?]

https://www.mediawiki.org/wiki/Extension_default_namespaces?

That doesn't seem like a good model to emulate. We're not iana.org, and we don't have anywhere near the rigor defined in IETF RFC 5226. I may put further thoughts on this topic in the Interwiki map RFC (T113034) task

Rob

Daniel Kinzler

9:08 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Am 12.07.2016 um 21:02 schrieb Rob Lanphier:

...

On Tue, Jul 12, 2016 at 8:02 AM, Brad Jorsch (Anomie) bjorsch@wikimedia.org wrote:

...
One simple method: assign the numeric IDs by making the numeric ID column auto-increment, and insert the model strings into the table as needed. PageAssessments uses this model for tracking its project tags.[1]

The disadvantage is that there wouldn't be any cross-wiki mapping between model names and ids, which can be mitigated somewhat by never exposing the ids externally.

Could you explain this idea in a way that doesn't require diving into the codebase to figure out what you mean? Cloaking the mapping of local ids (e.g. auto incremented in the DB) to global ids ("model names") seems to suggest a new way of making our system behave in an inscrutable way.

The idea is that in API responses (and requests), in XML dumps, etc, the content model for wikitext will be represented as the string "wikitext", even if the internal ID is 1 in the database of one wiki, and 37 on another. Clients have to know the canonical names, they are not concerned with the internal ids. They are considered an internal optimization, an implementation detail.

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Rob Lanphier

6 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On Tue, Jul 12, 2016 at 1:40 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de

...

wrote:

...

Do we really want to manage something that is essentially configuration, namely the set of available content models and formats, in a database table? How is it maintained?

For context:

As per T113034, we are movign away from managing interwiki prefixes in

the database, in favor of configuration files.

Namespace IDs are defined in LocalSettings.php.

The original design of ContentHandler used integer IDs for content models and formats in the DB. A mapping to human readable names is only needed for logging and error messages anyway.

This oversimplifies things greatly. Integer IDs need to be mapped to some well-specified, non-local (global?) identifier for many many purposes (reading exports, writing exports, reading site content, displaying site content for many contexts, etc)

As Jaime points out, we don't want or need 6 billion copies of the same identifier in our database. However, relegating that information to LocalSettings.php means that we'll have to manually sync that critical configuration data for use by non-PHP implementations interacting with the information.

On Tue, Jul 12, 2016 at 3:40 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de

...

wrote:

I'm fine with the DB based solution, if we have decent tooling for extensions to register their content models, etc.

We need to put a lot of thought into content model management generally. This statement implies managing content models outside of the database is easy.

Rob

...

Daniel Kinzler

6:16 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Am 12.07.2016 um 18:00 schrieb Rob Lanphier:

...

On Tue, Jul 12, 2016 at 1:40 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de

...
The original design of ContentHandler used integer IDs for content models and formats in the DB. A mapping to human readable names is only needed for logging and error messages anyway.

This oversimplifies things greatly. Integer IDs need to be mapped to some well-specified, non-local (global?) identifier for many many purposes (reading exports, writing exports, reading site content, displaying site content for many contexts, etc)

Yea, sorry. That we only need this for logging is what I assumed back then. Not exposing the numeric ID at all, and using the canonical name in dumps, the API, etc, avoids a lot of trouble (but doesn't come free).

...

We need to put a lot of thought into content model management generally. This statement implies managing content models outside of the database is easy.

Well, it's the same as namespaces: they are easy to set up, but also too easy to change, so it's easy to create a mess...

As explained in my earlier response, I now realized that content models differ from namespaces in that they are not really configured by people, but rather registered by extensions. That makes it a lot less awkward to have them in the database. We still have to agree on a good trigger for the registration, but it doesn't seem to be a tricky issue.

What we still need to figure out is how to solve the chicken-and-egg situation with Multi-Content-Rev. At the moment, I'm thinking this might work:

* introduce content model (and format) registry in the DB, and populate it. * leave page and revision table as they are for now. * introduce slots table, use the new content_model (and content_format) table. * stop using the content model (and format) from the page and revision tables * drop the content model (and format) from the page and revision tables

Does that sound liek a good plan? Let's for a moment assume we can get slots fully rolled out by the end of the year.

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Brion Vibber

10:34 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On Tuesday, July 12, 2016, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Am 12.07.2016 um 18:00 schrieb Rob Lanphier:

...
On Tue, Jul 12, 2016 at 1:40 AM, Daniel Kinzler <

daniel.kinzler@wikimedia.de javascript:;

...
...
The original design of ContentHandler used integer IDs for content

models

...
...
and formats in the DB. A mapping to human readable names is only needed for logging and error messages anyway.

This oversimplifies things greatly. Integer IDs need to be mapped to

some

...
well-specified, non-local (global?) identifier for many many purposes (reading exports, writing exports, reading site content, displaying site content for many contexts, etc)

Yea, sorry. That we only need this for logging is what I assumed back then. Not exposing the numeric ID at all, and using the canonical name in dumps, the API, etc, avoids a lot of trouble (but doesn't come free).

Yes, numeric ids are internal and never to be exposed ideally. We should've done same wth namespaces but got dragged into compat hell. :)

...

...
We need to put a lot of thought into content model management generally. This statement implies managing content models outside of the database is easy.

Well, it's the same as namespaces: they are easy to set up, but also too easy to change, so it's easy to create a mess...

As explained in my earlier response, I now realized that content models differ from namespaces in that they are not really configured by people, but rather registered by extensions. That makes it a lot less awkward to have them in the database. We still have to agree on a good trigger for the registration, but it doesn't seem to be a tricky issue.

Yeah an auto insert if needed is good in theory, though I worry about write contention on the central mapping table. If no write locks kept in the common case of no insertion needed then I think the ideas proposed should work.

...

What we still need to figure out is how to solve the chicken-and-egg situation with Multi-Content-Rev. At the moment, I'm thinking this might work:

introduce content model (and format) registry in the DB, and populate it.

leave page and revision table as they are for now.

introduce slots table, use the new content_model (and content_format)

table.

stop using the content model (and format) from the page and revision

tables

drop the content model (and format) from the page and revision tables

Does that sound liek a good plan? Let's for a moment assume we can get slots fully rolled out by the end of the year.

This sounds good to me - lets us introduce a more space efficient model mapping and drop the extra fields from page and rev later.

-- brion

...

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org javascript:; https://lists.wikimedia.org/mailman/listinfo/wikitech-l

MZMcBride

14 Jul 14 Jul

6:54 a.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Daniel Kinzler wrote:

...

What we still need to figure out is how to solve the chicken-and-egg situation with Multi-Content-Rev. At the moment, I'm thinking this might work:

introduce content model (and format) registry in the DB, and populate

it.

leave page and revision table as they are for now.

introduce slots table, use the new content_model (and content_format)

table.

stop using the content model (and format) from the page and revision

tables

drop the content model (and format) from the page and revision tables

Does that sound liek a good plan? Let's for a moment assume we can get slots fully rolled out by the end of the year.

I just read some chatter about slots and multiplexing(?). It seems vaguely interesting, but I don't have enough context or knowledge to understand much of the discussion currently. Is there a request for comments page or some kind of documentation that defines and explains these concepts?

MZMcBride

Daniel Kinzler

7:11 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Am 14.07.2016 um 06:54 schrieb MZMcBride:

...

I just read some chatter about slots and multiplexing(?). It seems vaguely interesting, but I don't have enough context or knowledge to understand much of the discussion currently. Is there a request for comments page or some kind of documentation that defines and explains these concepts?

Currently, there is only https://phabricator.wikimedia.org/T107595. I plan to move it to a wiki page and update it with the current draft and open questions from the various discussions.

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Brad Jorsch (Anomie)

11 Jul 11 Jul

4:10 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On Mon, Jul 11, 2016 at 8:07 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de

...

wrote:

...

There's another fix: never write null into rev_content_model. Always put the actual model ID there. That's pretty wasteful, but it's robust and reliable.

This. We probably would have done this a long time ago except it's blocked on T105652 so it won't significantly expand the size of the revision table, and that you blocked by T107595.

Both your (a) and (b) are wrong in some cases. Until we really fix it, we should probably just stick with the current (b) instead of dealing with the hassle of switching between one bad option and another.

-- Brad Jorsch (Anomie) Senior Software Engineer Wikimedia Foundation

Matthew Flaschen

14 Jul 14 Jul

1:22 a.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On 07/11/2016 10:10 AM, Brad Jorsch (Anomie) wrote:

...

On Mon, Jul 11, 2016 at 8:07 AM, Daniel Kinzler <daniel.kinzler@wikimedia.de

...
wrote:

...
There's another fix: never write null into rev_content_model. Always put the actual model ID there. That's pretty wasteful, but it's robust and reliable.

This. We probably would have done this a long time ago except it's blocked on T105652 so it won't significantly expand the size of the revision table, and that you blocked by T107595.

Both your (a) and (b) are wrong in some cases. Until we really fix it, we should probably just stick with the current (b) instead of dealing with the hassle of switching between one bad option and another.

Yes, I think we should leave it as is until it's stored explicitly for every revision. I don't have a strong opinion between "implement Legoktm's RFC now" or "implement the mapping tables then the slots RFC".

It's hard to predict how long slots will take, but if it's going to happen promptly that would probably make more sense.

Matt

Brian Wolff

11 Jul 11 Jul

5:43 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On Monday, July 11, 2016, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

It seems there is disagreement about what the correct interpretation of

NULL in

...

the rev_content_model column is. Should NULL there mean

(a) "the current page content model, as recorded in page_content_model"

or should it mean

(b) "the default for this title, no matter what page_content_model says"?

Kunal and I have had an unintentional edit war about this question in

Revision.php:

...

Kunal changed it from (a) to (b) in

https://gerrit.wikimedia.org/r/#/c/222043/

...

I later changed it from (b) to (a) in

https://gerrit.wikimedia.org/r/#/c/297787/

...

Kunal reverted me from (a) to (b) in

https://gerrit.wikimedia.org/r/#/c/298239/

...

So, which way do we want it?

The conflict seems to arise from (at least) three competing use cases:

I) re-interpreting page content. For instance, a user may move a misnamed User:Foo.jss to User:Foo.js. In this case, the content should be

re-interpreted

...

as JavaScript, including all old revisions. This would be in favor of

behavior

...

(a), though it still works with (b), because the default model changes

based on

...

the suffix ".js". I think it would however be better to only rely on title parsing magic once, when creating the page, not later, when rendering old

revisions.

...

II) converting page content. For instance, if a talk page gets converted

...

using Flow, new revisions (and page_content_model) will have the Flow

model,

...

while old revisions need to keep their original wikitext model (even

though

...

their rev_content_model is null). That would need behavior (b).

III) changing a namespace's default content model. E.g. when installing an extension that changes the default content model of a namespace (such as Wikibase with Items in the main namespace, or Flow-per-default for Talk

pages),

...

existing pages that were already in that namespace should still be

readable.

...

With (b), this would fail: even though page_content_model has the correct

model

...

for reading the page, rev_content_model is null, so the new namespace

default is

...

used, which will fail. With (a), this would simply work: the page will be rendered according to page_content_model.

In all cases it's possible to resolve the issue by replacing the NULL

entries

...

for all revisions of a page with the current model id. The question is

just when

...

and how we do that, and when and how we can even detect that this needs

doing.

...

There is also an in-between option, let's call it a/b: fall back to page_content_model for the latest revision (that should *always* be

right), but

...

to ignore page_content_model for older revisions. That would cater to use

case

...

III at least in so far as it would be possible to view the "misplaced"

pages.

...

But viewing old revisions or diffs would still fail with a nasty error.

This

...

option may look better on the surface, but I fear it will just add to the

confusion.

...

There's another fix: never write null into rev_content_model. Always put

the

...

actual model ID there. That's pretty wasteful, but it's robust and

reliable.

...

When we decided to use null as a placeholder for the default, we assumed

the

...

default would never change. But as we now see, it sometimes does...

So, what should it be, option (a) or (b)? And how do we address the use

case

...

that is then broken? What should we write into rev_content_model in the

future?

...

I personally think that option (a) makes more sense, because the

resolutions of

...

defaults is then local to the database. It could even be done within the

SQL

...

query. It's easier to maintain consistency that way. For use case II,

that would

...

require us to "fill in" all the rev_content_model fields in old revisions

when

...

converting a page. I think it would be a good thing to do that. If we

have the

...

content model change between revisions, it seems prudent to record it

explicitly.

...

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l

To me, (b) makes more sense, as all the other fields in page represent the info for the current revision. Additionally all the fields in revision (except rev_deleted) are immutable and never change, and definitely dont change interpretation based on other db fields. Having old revisions have a dependency on the page table (especially a dependency going in the direction revision->page) seems wrong to me.

-- bawolff

Daniel Kinzler

9:32 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Am 11.07.2016 um 17:43 schrieb Brian Wolff:

...

To me, (b) makes more sense, as all the other fields in page represent the info for the current revision. Additionally all the fields in revision (except rev_deleted) are immutable and never change, and definitely dont change interpretation based on other db fields. Having old revisions have a dependency on the page table (especially a dependency going in the direction revision->page) seems wrong to me.

The question is whether you want the interpretation of that field to depend on another database field related to the same page, or on global configuration. Both seem wrong, but depending on config seems worse: in the cases where it happens, there is no way to fix it. A database field can at least be updated.

Am 11.07.2016 um 16:10 schrieb Brad Jorsch (Anomie):

...

Both your (a) and (b) are wrong in some cases. Until we really fix it, we should probably just stick with the current (b) instead of dealing with the hassle of switching between one bad option and another.

Yea, I agree that it's generally better to stick with the evil you know. But then, if one kind of wrongness has a lot more impact than the other, that may tip the scale the other way...

But in any case, it seems we have to just fix the data in the database to get around the issue. My problem is now, when installing Wikibase, how do we detect which revisions need rewriting?

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Brian Wolff

9:50 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

On Monday, July 11, 2016, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Am 11.07.2016 um 17:43 schrieb Brian Wolff:

...
To me, (b) makes more sense, as all the other fields in page represent

the

...

...
info for the current revision. Additionally all the fields in revision (except rev_deleted) are immutable and never change, and definitely dont change interpretation based on other db fields. Having old revisions

have a

...

...
dependency on the page table (especially a dependency going in the direction revision->page) seems wrong to me.

The question is whether you want the interpretation of that field to

depend on

...

another database field related to the same page, or on global

configuration.

...

Both seem wrong, but depending on config seems worse: in the cases where

...

happens, there is no way to fix it. A database field can at least be

updated.

...

I guess this ultimately comes down to fairly arbitrary opinions, but I do actually think making the interpertation dependency graph of the database that convoluted is a bigger evil then depending on global config.

-- bawolff

Stas Malyshev

11:05 p.m.

New subject: What's the "correct" content model when rev_content_model is NULL?

Hi!

...

It seems there is disagreement about what the correct interpretation of NULL in the rev_content_model column is. Should NULL there mean

(a) "the current page content model, as recorded in page_content_model"

or should it mean

(b) "the default for this title, no matter what page_content_model says"?

As I understand, NULL is there as a space-saving measure. So I guess we want to ask ourselves if we want to go to so much trouble to save space...

Abstractly, a) looks better than b) for me since the scenario where default changed and all pages with all default are now broken is avoided there. OTOH, if the pages are updated together with the default, that must have caused page_content_model to update too, so in this case a) should work too.

...

There is also an in-between option, let's call it a/b: fall back to page_content_model for the latest revision (that should *always* be right), but to ignore page_content_model for older revisions. That would cater to use case

This may be even better, since page record is supposed to match latest revisions, but not prior revisions. That still leaves prior revisions in case of default change broken, but at least current one isn't.

-- Stas Malyshev smalyshev@wikimedia.org

3091

Age (days ago)

3094

Last active (days ago)

wikitech-l@lists.wikimedia.org

24 comments

9 participants

tags (0)

participants (9)

Brad Jorsch (Anomie)
Brian Wolff
Brion Vibber
Daniel Kinzler
Jaime Crespo
Matthew Flaschen
MZMcBride
Rob Lanphier
Stas Malyshev