On interface stability and forward compatibility

List overview All Threads
Download

newer

older

Technical information about the...

Long QIDs in Wikidata dump

Daniel Kinzler

5 Feb 2016 5 Feb '16

7:10 p.m.

Hi all!

In the context of introducing the new "math" and "external-id" data types, the question came up whether this introduction constitutes a breaking change to the data model. The answer to this depends on whether you take the "English" or the "German" approach to interpreting the format: According to https://en.wikipedia.org/wiki/Everything_which_is_not_forbidden_is_allowed, in England, "everything which is not forbidden is allowed", while, in Germany, the opposite applies, so "everything which is not allowed is forbidden".

In my mind, the advantage of formats like JSON, XML and RDF is that they provide good discovery by eyeballing, and that they use a mix-and-match approach. In this context, I favour the English approach: anything not explicitly forbidden in the JSON or RDF is allowed.

So I think clients should be written in a forward-compatible way: they should handle unknown constructs or values gracefully.

In this vein, I would like to propose a few guiding principles for the design of client libraries that consume Wikibase RDF and particularly JSON output:

* When encountering an unknown structure, such as an unexpected key in a JSON encoded object, the consumer SHOULD skip that structure. Depending on context and use case, a warning MAY be issued to alert the user that some part of the data was not processed.

* When encountering a malformed structure, such as missing a required key in a JSON encoded object, the consumer MAY skip that structure, but then a warning MUST be issued to alert the user that some part of the data was not processed. If the structure is not skipped, the consumer MUST fail with a fatal error.

* Clients MUST make a clear distinction of data types and values types: A Snak's data type determines the interpretation of the value, while the type of the Snak's data value specifies the structure of the value representation.

* Clients SHOULD be able to process a Snak about a Property of unknown data type, as long as the value type is known. In such a case, the client SHOULD fall back to the behaviour defined for the value type. If this is not possible, the Snak MUST be skipped and a warning SHOULD be issued to alert the user that some part of the data could not be interpreted.

* When encountering an unknown type of data value (value type), the client MUST either ignore the respective Snak, or fail with a fatal error. A warning SHOULD be issued to alert the user that some part of the data could not be processed.

Do you think these guidelines are reasonable? It seems to me that adopting them should save everyone some trouble.

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Show replies by date

Markus Krötzsch

5 Feb 5 Feb

8:24 p.m.

Hi Daniel,

I feel that this tries to evade the real issue by making formal rules about what kind of "breaking" you have to care about. It would be better to define "breaking change" based on its consequences: if important services will stop working, then you should make sure you announce it in time so this will not happen. This requires you to talk to people on this list. I think the whole proposal below is mainly trying to give you some justification to avoid communication with your stakeholders. This is not the way to go.

This said, it is always nice to have some guidelines as to what is likely to change and what isn't. It is probably enough to give some warnings about this ("there might be additional keys in this map in the future" or "there might be additional datatype URIs in the future"). However, this is no recipe to avoid breaking changes. In particular, the guideline to ignore snaks of properties that have no understandable declaration is just codifying a controlled way of failing, not avoiding failure:

* Browsing interfaces (e.g., Reasonator, Miga Class & Property Browser) are expected to show all data to users. If they don't, this is breaking them. * Query services are expected to use all data. If you do an aggregate query to count all properties on Wikidata, then the number returned will not be incomplete but simply wrong if the service ignores half of the data. * Editing tools (including bot frameworks) are most heavily affected, since they might create duplicates of statements if they fail to see some of the data following your guideline.

This does not mean that your guideline is unreasonable -- in fact, I think this is what most tools are doing anyway. But as the examples show, it's not enough to prevent major service disruptions that would affect many people. The guideline that tools should sometimes raise an alert or issue a warning does work in many cases, since we have a complex ecosystem with many inter-dependent services (for example, how should a SPARQL Web service communicate problems that occurred when importing the data? All of them or somehow only the ones that might have affected they query result?).

Our tools rely on being able to use all data, and the easiest way to ensure that they will work is to announce technical changes to the JSON format well in advance using this list. For changes that affect a particular subset of widely used tools, it would also be possible to seek the feedback from the main contributors of these tools at design/development time. I am sure everybody here is trying their best to keep up with whatever changes you implement, but it is not always possible for all of us to sacrifice part of our weekend on short notice for making a new release before next Wednesday.

Cheers,

Markus

On 05.02.2016 13:10, Daniel Kinzler wrote:

...

Hi all!

In the context of introducing the new "math" and "external-id" data types, the question came up whether this introduction constitutes a breaking change to the data model. The answer to this depends on whether you take the "English" or the "German" approach to interpreting the format: According to https://en.wikipedia.org/wiki/Everything_which_is_not_forbidden_is_allowed, in England, "everything which is not forbidden is allowed", while, in Germany, the opposite applies, so "everything which is not allowed is forbidden".

In my mind, the advantage of formats like JSON, XML and RDF is that they provide good discovery by eyeballing, and that they use a mix-and-match approach. In this context, I favour the English approach: anything not explicitly forbidden in the JSON or RDF is allowed.

So I think clients should be written in a forward-compatible way: they should handle unknown constructs or values gracefully.

In this vein, I would like to propose a few guiding principles for the design of client libraries that consume Wikibase RDF and particularly JSON output:

When encountering an unknown structure, such as an unexpected key in a JSON

encoded object, the consumer SHOULD skip that structure. Depending on context and use case, a warning MAY be issued to alert the user that some part of the data was not processed.

When encountering a malformed structure, such as missing a required key in a

JSON encoded object, the consumer MAY skip that structure, but then a warning MUST be issued to alert the user that some part of the data was not processed. If the structure is not skipped, the consumer MUST fail with a fatal error.

Clients MUST make a clear distinction of data types and values types: A Snak's

data type determines the interpretation of the value, while the type of the Snak's data value specifies the structure of the value representation.

Clients SHOULD be able to process a Snak about a Property of unknown data

type, as long as the value type is known. In such a case, the client SHOULD fall back to the behaviour defined for the value type. If this is not possible, the Snak MUST be skipped and a warning SHOULD be issued to alert the user that some part of the data could not be interpreted.

When encountering an unknown type of data value (value type), the client MUST

either ignore the respective Snak, or fail with a fatal error. A warning SHOULD be issued to alert the user that some part of the data could not be processed.

Do you think these guidelines are reasonable? It seems to me that adopting them should save everyone some trouble.

Daniel Kinzler

9:22 p.m.

Am 05.02.2016 um 14:24 schrieb Markus Krötzsch:

...

I feel that this tries to evade the real issue by making formal rules about what kind of "breaking" you have to care about. It would be better to define "breaking change" based on its consequences: if important services will stop working, then you should make sure you announce it in time so this will not happen. This requires you to talk to people on this list. I think the whole proposal below is mainly trying to give you some justification to avoid communication with your stakeholders. This is not the way to go.

It's a way to prevent unpleasant surprises, and avoid unnecessary work.

Talking about planned changes early on is certainly good, and we should get more organized at this.

However, I would like to avoid having to treat *any* change like a breaking change. Breaking changes should be communicated a lot earlier, and a lot more carefully, then, say, additions and extensions.

I tried to write down what clients *shouldn't* rely on. As Tom pointed out, these are really general design principles. They are not really specific to Wikibase, except for the "data type vs. value type" thing. Any software processing third party data should follow them.

...

how should a SPARQL Web service communicate problems that occurred when importing the data?

By informing whoever maintains the import, by writing to a log file or sending mail. That's the person who can fix the problem. That's who should be informed.

...

Our tools rely on being able to use all data, and the easiest way to ensure that they will work is to announce technical changes to the JSON format well in advance using this list. For changes that affect a particular subset of widely used tools, it would also be possible to seek the feedback from the main contributors of these tools at design/development time.

Any we do that for breaking changes. I did not expect additional data types to cause any trouble. After all, you can still inject the data, since the value type is know. For a long time, out dumps didn't even mention the data type at all.

...

I am sure everybody here is trying their best to keep up with whatever changes you implement, but it is not always possible for all of us to sacrifice part of our weekend on short notice for making a new release before next Wednesday.

To avoid this problem in the future, I tried to spell out what guaranties we *don't* give, so a simple addition doesn't things don't break horribly.

That doesn't mean we don't plan to communicate such changes at all, or better than we did now. We do. But this kind of thing is clearly distinct from actual "breaking changes" in my mind.

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Tom Morris

8:55 p.m.

Sounds a lot like a restatement of Postel's Law

https://en.wikipedia.org/wiki/Robustness_principle

Tom

On Fri, Feb 5, 2016 at 7:10 AM, Daniel Kinzler daniel.kinzler@wikimedia.de wrote:

...

Hi all!

In the context of introducing the new "math" and "external-id" data types, the question came up whether this introduction constitutes a breaking change to the data model. The answer to this depends on whether you take the "English" or the "German" approach to interpreting the format: According to < https://en.wikipedia.org/wiki/Everything_which_is_not_forbidden_is_allowed%3..., in England, "everything which is not forbidden is allowed", while, in Germany, the opposite applies, so "everything which is not allowed is forbidden".

In my mind, the advantage of formats like JSON, XML and RDF is that they provide good discovery by eyeballing, and that they use a mix-and-match approach. In this context, I favour the English approach: anything not explicitly forbidden in the JSON or RDF is allowed.

So I think clients should be written in a forward-compatible way: they should handle unknown constructs or values gracefully.

In this vein, I would like to propose a few guiding principles for the design of client libraries that consume Wikibase RDF and particularly JSON output:

When encountering an unknown structure, such as an unexpected key in a

JSON encoded object, the consumer SHOULD skip that structure. Depending on context and use case, a warning MAY be issued to alert the user that some part of the data was not processed.

When encountering a malformed structure, such as missing a required key

in a JSON encoded object, the consumer MAY skip that structure, but then a warning MUST be issued to alert the user that some part of the data was not processed. If the structure is not skipped, the consumer MUST fail with a fatal error.

Clients MUST make a clear distinction of data types and values types: A

Snak's data type determines the interpretation of the value, while the type of the Snak's data value specifies the structure of the value representation.

Clients SHOULD be able to process a Snak about a Property of unknown data

type, as long as the value type is known. In such a case, the client SHOULD fall back to the behaviour defined for the value type. If this is not possible, the Snak MUST be skipped and a warning SHOULD be issued to alert the user that some part of the data could not be interpreted.

When encountering an unknown type of data value (value type), the client

MUST either ignore the respective Snak, or fail with a fatal error. A warning SHOULD be issued to alert the user that some part of the data could not be processed.

Do you think these guidelines are reasonable? It seems to me that adopting them should save everyone some trouble.

-- Daniel Kinzler Senior Software Developer

Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Daniel Kinzler

9:09 p.m.

Am 05.02.2016 um 14:55 schrieb Tom Morris:

...

Sounds a lot like a restatement of Postel's Law

https://en.wikipedia.org/wiki/Robustness_principle

Yes indeed: "Be conservative in what you send, be liberal in what you accept"

-- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V.

3259

Age (days ago)

3259

Last active (days ago)

wikidata-tech@lists.wikimedia.org

4 comments

3 participants

tags (0)

participants (3)

Daniel Kinzler
Markus Krötzsch
Tom Morris