Hi,
It seems that some changes have been made to the JSON serialization recently:
https://github.com/Wikidata/Wikidata-Toolkit/issues/237
Could somebody from the dev team please comment on this? Is this going to be in the dumps as well or just in the API? Are further changes coming up? Are we ever going to get email notifications of API changes implemented by the team rather than having to fix the damage after they happened?
Markus
On Thu, Aug 4, 2016 at 9:27 AM, Markus Kroetzsch markus.kroetzsch@tu-dresden.de wrote:
Hi,
It seems that some changes have been made to the JSON serialization recently:
https://github.com/Wikidata/Wikidata-Toolkit/issues/237
Could somebody from the dev team please comment on this? Is this going to be in the dumps as well or just in the API? Are further changes coming up? Are we ever going to get email notifications of API changes implemented by the team rather than having to fix the damage after they happened?
Hey Markus,
Sorry. You are right in that I should have announced the addition. It slipped through. As we've said before we don't consider adding fields a breaking change. Nonetheless I should have announced it.
For your particular usecase it seems the addition is actually useful because the entity id no longer needs to be created from the entity type and ID field. https://stackoverflow.com/questions/5455014/ignoring-new-fields-on-json-obje... might also be helpful.
Cheers Lydia
On 04.08.2016 11:45, Lydia Pintscher wrote:
On Thu, Aug 4, 2016 at 9:27 AM, Markus Kroetzsch markus.kroetzsch@tu-dresden.de wrote:
Hi,
It seems that some changes have been made to the JSON serialization recently:
https://github.com/Wikidata/Wikidata-Toolkit/issues/237
Could somebody from the dev team please comment on this? Is this going to be in the dumps as well or just in the API? Are further changes coming up? Are we ever going to get email notifications of API changes implemented by the team rather than having to fix the damage after they happened?
Hey Markus,
Sorry. You are right in that I should have announced the addition. It slipped through. As we've said before we don't consider adding fields a breaking change. Nonetheless I should have announced it.
For your particular usecase it seems the addition is actually useful because the entity id no longer needs to be created from the entity type and ID field. https://stackoverflow.com/questions/5455014/ignoring-new-fields-on-json-obje... might also be helpful.
Well, I know how to fix it. I just need a week or so of time to implement and release the fix (not because it takes so long, but because I need to find a slot of time to do it).
Markus
Hi Markus!
I would like to elaborate a little on what Lydia said.
Am 04.08.2016 um 09:27 schrieb Markus Kroetzsch:
It seems that some changes have been made to the JSON serialization recently:
This specific change has been announced in our JSON spec for as long as the document exists. https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON#wikibase-entityid sais:
WARNING: wikibase-entityid may in the future change to be represented as a single string literal, or may even be dropped in favor of using the string value type to reference entities.
NOTE: There is currently no reliable mechanism for clients to generate a prefixed ID or a URL from the information in the data value.
That was the problem: With the current format, all clients needed a hard coded mapping of entity types to prefixes, in order to construct ID strings from the JSON serialization of ID values. That means no entity types can be added without breaking clients. This has now been fixed.
Of course, it would have been good to announce this in advance. However, it is not a breaking change, and we do not plan to treat additions as breaking changes.
Adding something to a public interface is not a breaking change. Adding a method to an API isn't, adding an element to XML isn't, and adding a key to JSON isn't - unless there is a spec that explicitly states otherwise.
These are "mix and match" formats, in which anything that isn't forbidden is allowed. It's the responsibility of the client to accommodate such changes. This is simple best practice - a HTTP client shouldn't choke on header fields it doesn't know, etc. See https://en.wikipedia.org/wiki/Robustness_principle.
If you use a library that is touchy about extra data per default, configure it to be more accommodating, see for instance https://stackoverflow.com/questions/14343477/how-do-you-globally-set-jackson-to-ignore-unknown-properties-within-spring.
Could somebody from the dev team please comment on this? Is this going to be in the dumps as well or just in the API?
Yes, we use the same basic serialization for the API and the dumps. For the future, note that some parts (such as sitelink URLs) are optional, and we plan to add more optional bits (such as normalized quantities) soonish.
Are further changes coming up?
Yes. The next one in the pipeline is Quantities without upperBound and lowerBound, see https://phabricator.wikimedia.org/T115270. That IS a breaking change, and the implementation is thus blocked on announcing it, see https://gerrit.wikimedia.org/r/#/c/302248/.
Furthermore, we will probably remove the entity-type and numeric-id fields from the serialization of EntityIdValues eventually. But there is no concrete plan for that at the moment.
When we remove the old fields for ItemId and PropertyId, that IS a breaking change, and will be announced as such.
Are we ever going to get email notifications of API changes implemented by the team rather than having to fix the damage after they happened?
We aspire to communicate early, and we are sorry we did not announce this change ahead of time.
However, this is not a breaking change by the common understanding of the term, and will not be treated as such. We have argued about that on this list before, see https://www.mail-archive.com/wikidata-tech@lists.wikimedia.org/msg00902.html. I have made it clear back then what we consider a breaking change and what not, and I have advised you that being accommodating in what your client code accepts will avoid headaches in the future.
To make this even more clear, we will enact and document something similar to my email from February as official policy soon. Watch for an announcement on this list.
Daniel,
You present arguments on issues that I would never even bring up. I think we fully agree on many things here. Main points of misunderstanding:
* I was not talking about the WMDE definition of "breaking change". I just meant "a change that breaks things". You can define this term for yourself as you like and I won't argue with this.
* I would never say that it is "right" that things break in this case. It's annoying. However, it is the standard behaviour of widely used JSON parsing libraries. We won't discuss it away.
* I am not arguing that the change as such is bad. I just need to know about it to fix things before they break.
* I am fully aware of many places where my software should be improved, but I cannot fix all of them just to be prepared if a change should eventually happen (if it ever happens). I need to know about the next thing that breaks so I can prioritize this.
* The best way to fix this problem is to annotate all Jackson classes with the respective switch individually. The global approach you linked to requires that all users of the classes implement the fix, which is not working in a library.
* When I asked for announcements, I did not mean an information of the type "we plan to add more optional bits soonish". This ancient wiki page of yours that mentions that some kind of change should happen at some point is even more vague. It is more helpful to learn about changes when you know how they will look and when they will happen. My assumption is that this is a "low cost" improvement that is not too much to ask for.
* I did not follow what you want to make an "official policy" for. Software won't behave any differently just because there is a policy saying that it should.
Markus
On 04.08.2016 16:48, Daniel Kinzler wrote:
Hi Markus!
I would like to elaborate a little on what Lydia said.
Am 04.08.2016 um 09:27 schrieb Markus Kroetzsch:
It seems that some changes have been made to the JSON serialization recently:
This specific change has been announced in our JSON spec for as long as the document exists. https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON#wikibase-entityid sais:
WARNING: wikibase-entityid may in the future change to be represented as a single string literal, or may even be dropped in favor of using the string value type to reference entities.
NOTE: There is currently no reliable mechanism for clients to generate a prefixed ID or a URL from the information in the data value.
That was the problem: With the current format, all clients needed a hard coded mapping of entity types to prefixes, in order to construct ID strings from the JSON serialization of ID values. That means no entity types can be added without breaking clients. This has now been fixed.
Of course, it would have been good to announce this in advance. However, it is not a breaking change, and we do not plan to treat additions as breaking changes.
Adding something to a public interface is not a breaking change. Adding a method to an API isn't, adding an element to XML isn't, and adding a key to JSON isn't
- unless there is a spec that explicitly states otherwise.
These are "mix and match" formats, in which anything that isn't forbidden is allowed. It's the responsibility of the client to accommodate such changes. This is simple best practice - a HTTP client shouldn't choke on header fields it doesn't know, etc. See https://en.wikipedia.org/wiki/Robustness_principle.
If you use a library that is touchy about extra data per default, configure it to be more accommodating, see for instance https://stackoverflow.com/questions/14343477/how-do-you-globally-set-jackson-to-ignore-unknown-properties-within-spring.
Could somebody from the dev team please comment on this? Is this going to be in the dumps as well or just in the API?
Yes, we use the same basic serialization for the API and the dumps. For the future, note that some parts (such as sitelink URLs) are optional, and we plan to add more optional bits (such as normalized quantities) soonish.
Are further changes coming up?
Yes. The next one in the pipeline is Quantities without upperBound and lowerBound, see https://phabricator.wikimedia.org/T115270. That IS a breaking change, and the implementation is thus blocked on announcing it, see https://gerrit.wikimedia.org/r/#/c/302248/.
Furthermore, we will probably remove the entity-type and numeric-id fields from the serialization of EntityIdValues eventually. But there is no concrete plan for that at the moment.
When we remove the old fields for ItemId and PropertyId, that IS a breaking change, and will be announced as such.
Are we ever going to get email notifications of API changes implemented by the team rather than having to fix the damage after they happened?
We aspire to communicate early, and we are sorry we did not announce this change ahead of time.
However, this is not a breaking change by the common understanding of the term, and will not be treated as such. We have argued about that on this list before, see https://www.mail-archive.com/wikidata-tech@lists.wikimedia.org/msg00902.html. I have made it clear back then what we consider a breaking change and what not, and I have advised you that being accommodating in what your client code accepts will avoid headaches in the future.
To make this even more clear, we will enact and document something similar to my email from February as official policy soon. Watch for an announcement on this list.
Hi Markus!
You are asking use to better communicate changes to our serialization, even if it's not a breaking change according to the spec. I agree we should do that. We are trying to improve our processes to achieve this.
Can we ask you in return to try to make your software more robust, by not making unwarranted assumptions about the serialization format?
With regards to communicating more - it's very hard to tell which changes might break something for someone. For instance, some software might rely on the order of fields in a JSON object, even though JSON says this is unspecified, just like you rely on no fields being added, even though there is no guarantee about this. Similarly, some software might rely on non-ascii characters being represented as unicode escape sequences, and will break if we use the more compact utf-8. Or they may break on changes whitespace. Who knows. We can not possibly know what kind of change will break some 3rd party software.
I don't think announcing any and all changes is feasible. So I think an official policy about what we announce can be useful. Something like "This is what we consider a breaking change, and we will definitely announce it. And these are some kinds of changes we will also communicate ahead of time. And these are some things that can happen unannounced."
You are right that policies don't change the behavior of software. But perhaps they can change the behavior of programmers, by telling them what they can (and can't) safely rely on.
It boils down to this: we can try to be more verbose, but if you make assumptions beyond the spec, things will break sooner or later. Writing robust software requires more time and thought initially, but it saves a lot of headaches later.
-- daniel
Am 04.08.2016 um 21:49 schrieb Markus Kroetzsch:
Daniel,
You present arguments on issues that I would never even bring up. I think we fully agree on many things here. Main points of misunderstanding:
- I was not talking about the WMDE definition of "breaking change". I just meant
"a change that breaks things". You can define this term for yourself as you like and I won't argue with this.
- I would never say that it is "right" that things break in this case. It's
annoying. However, it is the standard behaviour of widely used JSON parsing libraries. We won't discuss it away.
- I am not arguing that the change as such is bad. I just need to know about it
to fix things before they break.
- I am fully aware of many places where my software should be improved, but I
cannot fix all of them just to be prepared if a change should eventually happen (if it ever happens). I need to know about the next thing that breaks so I can prioritize this.
- The best way to fix this problem is to annotate all Jackson classes with the
respective switch individually. The global approach you linked to requires that all users of the classes implement the fix, which is not working in a library.
- When I asked for announcements, I did not mean an information of the type "we
plan to add more optional bits soonish". This ancient wiki page of yours that mentions that some kind of change should happen at some point is even more vague. It is more helpful to learn about changes when you know how they will look and when they will happen. My assumption is that this is a "low cost" improvement that is not too much to ask for.
- I did not follow what you want to make an "official policy" for. Software
won't behave any differently just because there is a policy saying that it should.
Markus
On 04.08.2016 16:48, Daniel Kinzler wrote:
Hi Markus!
I would like to elaborate a little on what Lydia said.
Am 04.08.2016 um 09:27 schrieb Markus Kroetzsch:
It seems that some changes have been made to the JSON serialization recently:
This specific change has been announced in our JSON spec for as long as the document exists. https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON#wikibase-entityid sais:
WARNING: wikibase-entityid may in the future change to be represented as a single string literal, or may even be dropped in favor of using the string value type to reference entities.
NOTE: There is currently no reliable mechanism for clients to generate a prefixed ID or a URL from the information in the data value.
That was the problem: With the current format, all clients needed a hard coded mapping of entity types to prefixes, in order to construct ID strings from the JSON serialization of ID values. That means no entity types can be added without breaking clients. This has now been fixed.
Of course, it would have been good to announce this in advance. However, it is not a breaking change, and we do not plan to treat additions as breaking changes.
Adding something to a public interface is not a breaking change. Adding a method to an API isn't, adding an element to XML isn't, and adding a key to JSON isn't
- unless there is a spec that explicitly states otherwise.
These are "mix and match" formats, in which anything that isn't forbidden is allowed. It's the responsibility of the client to accommodate such changes. This is simple best practice - a HTTP client shouldn't choke on header fields it doesn't know, etc. See https://en.wikipedia.org/wiki/Robustness_principle.
If you use a library that is touchy about extra data per default, configure it to be more accommodating, see for instance https://stackoverflow.com/questions/14343477/how-do-you-globally-set-jackson-to-ignore-unknown-properties-within-spring.
Could somebody from the dev team please comment on this? Is this going to be in the dumps as well or just in the API?
Yes, we use the same basic serialization for the API and the dumps. For the future, note that some parts (such as sitelink URLs) are optional, and we plan to add more optional bits (such as normalized quantities) soonish.
Are further changes coming up?
Yes. The next one in the pipeline is Quantities without upperBound and lowerBound, see https://phabricator.wikimedia.org/T115270. That IS a breaking change, and the implementation is thus blocked on announcing it, see https://gerrit.wikimedia.org/r/#/c/302248/.
Furthermore, we will probably remove the entity-type and numeric-id fields from the serialization of EntityIdValues eventually. But there is no concrete plan for that at the moment.
When we remove the old fields for ItemId and PropertyId, that IS a breaking change, and will be announced as such.
Are we ever going to get email notifications of API changes implemented by the team rather than having to fix the damage after they happened?
We aspire to communicate early, and we are sorry we did not announce this change ahead of time.
However, this is not a breaking change by the common understanding of the term, and will not be treated as such. We have argued about that on this list before, see https://www.mail-archive.com/wikidata-tech@lists.wikimedia.org/msg00902.html. I have made it clear back then what we consider a breaking change and what not, and I have advised you that being accommodating in what your client code accepts will avoid headaches in the future.
To make this even more clear, we will enact and document something similar to my email from February as official policy soon. Watch for an announcement on this list.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
I side firmly with Markus here.
Consumers of data generally cannot tell whether the addition of a new field to a data encoding is a breaking change or not. Given this, code that consumes encoded data should at least produce warnings when it encounters encodings that it is not expecting and preferably should refuse to produce output in such circumstances. Producers of data thus should signal in advance any changes to the encoding, even if they know that the changes can be safely ignored.
I would view software that consumes Wikidata information and silently ignores fields that it is not expecting as deficient and would counsel against using such software.
Peter F. Patel-Schneider Nuance Communications
PS: JSON is a particularly problematic encoding for data because many aspects of the data that a particular JSON text is meant to encode are left unspecified by the JSON standards.
On 08/05/2016 05:04 AM, Daniel Kinzler wrote:
Hi Markus!
You are asking use to better communicate changes to our serialization, even if it's not a breaking change according to the spec. I agree we should do that. We are trying to improve our processes to achieve this.
Can we ask you in return to try to make your software more robust, by not making unwarranted assumptions about the serialization format?
With regards to communicating more - it's very hard to tell which changes might break something for someone. For instance, some software might rely on the order of fields in a JSON object, even though JSON says this is unspecified, just like you rely on no fields being added, even though there is no guarantee about this. Similarly, some software might rely on non-ascii characters being represented as unicode escape sequences, and will break if we use the more compact utf-8. Or they may break on changes whitespace. Who knows. We can not possibly know what kind of change will break some 3rd party software.
I don't think announcing any and all changes is feasible. So I think an official policy about what we announce can be useful. Something like "This is what we consider a breaking change, and we will definitely announce it. And these are some kinds of changes we will also communicate ahead of time. And these are some things that can happen unannounced."
You are right that policies don't change the behavior of software. But perhaps they can change the behavior of programmers, by telling them what they can (and can't) safely rely on.
It boils down to this: we can try to be more verbose, but if you make assumptions beyond the spec, things will break sooner or later. Writing robust software requires more time and thought initially, but it saves a lot of headaches later.
-- daniel
Am 04.08.2016 um 21:49 schrieb Markus Kroetzsch:
Daniel,
You present arguments on issues that I would never even bring up. I think we fully agree on many things here. Main points of misunderstanding:
- I was not talking about the WMDE definition of "breaking change". I just meant
"a change that breaks things". You can define this term for yourself as you like and I won't argue with this.
- I would never say that it is "right" that things break in this case. It's
annoying. However, it is the standard behaviour of widely used JSON parsing libraries. We won't discuss it away.
- I am not arguing that the change as such is bad. I just need to know about it
to fix things before they break.
- I am fully aware of many places where my software should be improved, but I
cannot fix all of them just to be prepared if a change should eventually happen (if it ever happens). I need to know about the next thing that breaks so I can prioritize this.
- The best way to fix this problem is to annotate all Jackson classes with the
respective switch individually. The global approach you linked to requires that all users of the classes implement the fix, which is not working in a library.
- When I asked for announcements, I did not mean an information of the type "we
plan to add more optional bits soonish". This ancient wiki page of yours that mentions that some kind of change should happen at some point is even more vague. It is more helpful to learn about changes when you know how they will look and when they will happen. My assumption is that this is a "low cost" improvement that is not too much to ask for.
- I did not follow what you want to make an "official policy" for. Software
won't behave any differently just because there is a policy saying that it should.
Markus
On 04.08.2016 16:48, Daniel Kinzler wrote:
Hi Markus!
I would like to elaborate a little on what Lydia said.
Am 04.08.2016 um 09:27 schrieb Markus Kroetzsch:
It seems that some changes have been made to the JSON serialization recently:
This specific change has been announced in our JSON spec for as long as the document exists. https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON#wikibase-entityid sais:
WARNING: wikibase-entityid may in the future change to be represented as a single string literal, or may even be dropped in favor of using the string value type to reference entities.
NOTE: There is currently no reliable mechanism for clients to generate a prefixed ID or a URL from the information in the data value.
That was the problem: With the current format, all clients needed a hard coded mapping of entity types to prefixes, in order to construct ID strings from the JSON serialization of ID values. That means no entity types can be added without breaking clients. This has now been fixed.
Of course, it would have been good to announce this in advance. However, it is not a breaking change, and we do not plan to treat additions as breaking changes.
Adding something to a public interface is not a breaking change. Adding a method to an API isn't, adding an element to XML isn't, and adding a key to JSON isn't
- unless there is a spec that explicitly states otherwise.
These are "mix and match" formats, in which anything that isn't forbidden is allowed. It's the responsibility of the client to accommodate such changes. This is simple best practice - a HTTP client shouldn't choke on header fields it doesn't know, etc. See https://en.wikipedia.org/wiki/Robustness_principle.
If you use a library that is touchy about extra data per default, configure it to be more accommodating, see for instance https://stackoverflow.com/questions/14343477/how-do-you-globally-set-jackson-to-ignore-unknown-properties-within-spring.
Could somebody from the dev team please comment on this? Is this going to be in the dumps as well or just in the API?
Yes, we use the same basic serialization for the API and the dumps. For the future, note that some parts (such as sitelink URLs) are optional, and we plan to add more optional bits (such as normalized quantities) soonish.
Are further changes coming up?
Yes. The next one in the pipeline is Quantities without upperBound and lowerBound, see https://phabricator.wikimedia.org/T115270. That IS a breaking change, and the implementation is thus blocked on announcing it, see https://gerrit.wikimedia.org/r/#/c/302248/.
Furthermore, we will probably remove the entity-type and numeric-id fields from the serialization of EntityIdValues eventually. But there is no concrete plan for that at the moment.
When we remove the old fields for ItemId and PropertyId, that IS a breaking change, and will be announced as such.
Are we ever going to get email notifications of API changes implemented by the team rather than having to fix the damage after they happened?
We aspire to communicate early, and we are sorry we did not announce this change ahead of time.
However, this is not a breaking change by the common understanding of the term, and will not be treated as such. We have argued about that on this list before, see https://www.mail-archive.com/wikidata-tech@lists.wikimedia.org/msg00902.html. I have made it clear back then what we consider a breaking change and what not, and I have advised you that being accommodating in what your client code accepts will avoid headaches in the future.
To make this even more clear, we will enact and document something similar to my email from February as official policy soon. Watch for an announcement on this list.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Am 05.08.2016 um 15:02 schrieb Peter F. Patel-Schneider:
I side firmly with Markus here.
Consumers of data generally cannot tell whether the addition of a new field to a data encoding is a breaking change or not.
Without additional information, they cannot know, though for "mix and match" formats like JSON and XML, it's common practice to assume that ignoring additions is harmless.
In any case, we had communicated before that we do not consider the addition of a field a breaking change. It only becomes a breaking change when it impacts the interpretation of other fields. In which case we would announce it well in advance.
Given this, code that consumes encoded data should at least produce warnings when it encounters encodings that it is not expecting and preferably should refuse to produce output in such circumstances.
Depends on the circumstances. For a web browser for example, this would be very annoying behavior. Nearly all websites would be unusable. Similarly, most email would become unreadable if mail clients would be that strict.
Producers of data thus should signal in advance any changes to the encoding, even if they know that the changes can be safely ignored.
I disagree on "any". For example, do you want announcements about changes to the order of attributes in XML tags? Why? In case someone uses a regex to process the XML? Should you not be able to rely on your clients conforming the to XML spec, which says that the order of attributes is undefined?
In the case at hand (adding a field), it would have been good to communicate it in advance. But since it wasn't tagged as "breaking", it slipped through. We are sorry for that. Clients should still not choke on an addition like this.
I would view software that consumes Wikidata information and silently ignores fields that it is not expecting as deficient and would counsel against using such software.
Is this just for Wikidata, or does that extend to other kinds of data too? Why, or why not?
By definition, any extensible format or protocol (HTTP, SMTP, HTML, XML, XMPP, IRC, etc) can contain parts (headers, elements, attributes) that the client does not know about, and should ignore. Of course, the spec will tell clients where to expect and allow extra bits. That's why I'm planning to put up a document saying clearly what kinds of changes clients should be prepared to see in Wikidata output:
Clients need to be prepared to encounter entity types and data types they don't know. But they should also allow additional fields in any JSON object. We guarantee that extra fields do not impact the interpretation of fields they know about - unless we have announced and documented a breaking change.
On 08/05/2016 06:46 AM, Daniel Kinzler wrote:
Am 05.08.2016 um 15:02 schrieb Peter F. Patel-Schneider:
I side firmly with Markus here.
Consumers of data generally cannot tell whether the addition of a new field to a data encoding is a breaking change or not.
Without additional information, they cannot know, though for "mix and match" formats like JSON and XML, it's common practice to assume that ignoring additions is harmless.
The assumption that ignoring additions is harmless is a very dangerous practice, even if it is common.
In any case, we had communicated before that we do not consider the addition of a field a breaking change. It only becomes a breaking change when it impacts the interpretation of other fields. In which case we would announce it well in advance.
So some additions are breaking changes then. What is a system that consumes this information supposed to do? If the system doesn't monitor announcements then it has to assume that any new field can be a breaking change and thus should not accept data that has any new fields.
Given this, code that consumes encoded data should at least produce warnings when it encounters encodings that it is not expecting and preferably should refuse to produce output in such circumstances.
Depends on the circumstances. For a web browser for example, this would be very annoying behavior. Nearly all websites would be unusable. Similarly, most email would become unreadable if mail clients would be that strict.
I assume that you are referring to the common practice of adding extra fields in HTTP and email transport and header structures under the assumption that these extra fields will just be passed on to downstream systems and then silently ignored when content is displayed. I view these as special cases where there is at least an implicit contract that no additional field will change the meaning of the existing fields and data. When such contracts are in place systems can indeed expect to see additional fields, and are permitted to ignore these extra fields.
Producers of data thus should signal in advance any changes to the encoding, even if they know that the changes can be safely ignored.
I disagree on "any". For example, do you want announcements about changes to the order of attributes in XML tags?
No.
Why?
Because XML specifically states that the order of attributes is not significant. Therefore changes to the order of XML attributes is not changing the encoding.
In case someone uses a regex to process the XML? Should you not be able to rely on your clients conforming the to XML spec, which says that the order of attributes is undefined?
Yes indeed. And there would be no problem in changing the order of entities in the JSON dump as this order is deemed to be insignificant in well-behaved JSON texts.
In the case at hand (adding a field), it would have been good to communicate it in advance. But since it wasn't tagged as "breaking", it slipped through. We are sorry for that. Clients should still not choke on an addition like this.
Here is where I disagree. As there is no contract that new fields in the Wikidata JSON dumps are not breaking, clients need to treat all new fields as potentially breaking and thus should not accept data with unknown fields.
I would view software that consumes Wikidata information and silently ignores fields that it is not expecting as deficient and would counsel against using such software.
Is this just for Wikidata, or does that extend to other kinds of data too? Why, or why not?
I say this for any data, except where there is a contract that such additional fields are not meaning-changing.
By definition, any extensible format or protocol (HTTP, SMTP, HTML, XML, XMPP, IRC, etc) can contain parts (headers, elements, attributes) that the client does not know about, and should ignore. Of course, the spec will tell clients where to expect and allow extra bits.
Yes, these standards have explicit wording that there are certain places where additional bits are allowed, and that these additional bits can be safely ignored. Consumers of data in these standards can verify that the data has not been corrupted and then safely ignore extra bits in certain places, because they have a contract that the encoding of the data that they care about is not affected by these extra bits. However, I don't see this contract with respect to the Wikidata JSON encoding.
That's why I'm planning to put up a document saying clearly what kinds of changes clients should be prepared to see in Wikidata output:
Clients need to be prepared to encounter entity types and data types they don't know. But they should also allow additional fields in any JSON object. We guarantee that extra fields do not impact the interpretation of fields they know about - unless we have announced and documented a breaking change.
Is this the contract that is going to be put forward? At some time in the not too distant future I hope that my company will be using Wikidata information in its products. This contract is likely to problematic for development groups, who want some notion how long they have to prepare for changes that can silently break their products.
Peter F. Patel-Schneider Nuance Communications
Am 05.08.2016 um 17:34 schrieb Peter F. Patel-Schneider:
So some additions are breaking changes then. What is a system that consumes this information supposed to do? If the system doesn't monitor announcements then it has to assume that any new field can be a breaking change and thus should not accept data that has any new fields.
The only way to avoid breakage is to monitor announcements. The format is not final, so changes can happen (not just additions, but also removals), and then things will break if they are unaware. We tend to be careful and conservative, and announce any breaking changes in advance, but do not guarantee full backwards compatibility forever.
The only alternative is a fully versioned interface, which we don't currently have for JSON, though it has been proposed, see https://phabricator.wikimedia.org/T92961.
I assume that you are referring to the common practice of adding extra fields in HTTP and email transport and header structures under the assumption that these extra fields will just be passed on to downstream systems and then silently ignored when content is displayed.
Indeed.
I view these as special cases where there is at least an implicit contract that no additional field will change the meaning of the existing fields and data.
In the name of the Robustness Principle, I would consider this the normal case, not the exception.
When such contracts are in place systems can indeed expect to see additional fields, and are permitted to ignore these extra fields.
Does this count? https://mail-archive.com/wikidata-tech@lists.wikimedia.org/msg00902.html
Because XML specifically states that the order of attributes is not significant. Therefore changes to the order of XML attributes is not changing the encoding.
That's why I'm proposing to formalize the same kind of contract for us, see https://phabricator.wikimedia.org/T142084.
Here is where I disagree. As there is no contract that new fields in the Wikidata JSON dumps are not breaking, clients need to treat all new fields as potentially breaking and thus should not accept data with unknown fields.
While you are correct that there is no formal contract yet, the topic had been explicitly discussed before, in particular with Markus.
I say this for any data, except where there is a contract that such additional fields are not meaning-changing.
Quote me on it:
For wikibase serializations, additional fields are not meaning changing. Changes to the format or interpretation of fields will be announced as a breaking change.
Clients need to be prepared to encounter entity types and data types they don't know. But they should also allow additional fields in any JSON object. We guarantee that extra fields do not impact the interpretation of fields they know about - unless we have announced and documented a breaking change.
Is this the contract that is going to be put forward? At some time in the not too distant future I hope that my company will be using Wikidata information in its products. This contract is likely to problematic for development groups, who want some notion how long they have to prepare for changes that can silently break their products.
This is indeed the gist of what I want to establish as a stability policy. Please comment on https://phabricator.wikimedia.org/T142084.
I'm not sure how this could be made less problematic. Even with a fully versioned JSON interface, available data types etc are a matter of configuration. All we can do is announce such changes, and advise consumers that they can safely ignore unknown things.
You raise a valid point about due notice. What do you think would be a good notice period? Two weeks? A month?
On 08/05/2016 08:57 AM, Daniel Kinzler wrote:
Am 05.08.2016 um 17:34 schrieb Peter F. Patel-Schneider:
So some additions are breaking changes then. What is a system that consumes this information supposed to do? If the system doesn't monitor announcements then it has to assume that any new field can be a breaking change and thus should not accept data that has any new fields.
The only way to avoid breakage is to monitor announcements. The format is not final, so changes can happen (not just additions, but also removals), and then things will break if they are unaware. We tend to be careful and conservative, and announce any breaking changes in advance, but do not guarantee full backwards compatibility forever.
The only alternative is a fully versioned interface, which we don't currently have for JSON, though it has been proposed, see https://phabricator.wikimedia.org/T92961.
I assume that you are referring to the common practice of adding extra fields in HTTP and email transport and header structures under the assumption that these extra fields will just be passed on to downstream systems and then silently ignored when content is displayed.
Indeed.
I view these as special cases where there is at least an implicit contract that no additional field will change the meaning of the existing fields and data.
In the name of the Robustness Principle, I would consider this the normal case, not the exception.
When such contracts are in place systems can indeed expect to see additional fields, and are permitted to ignore these extra fields.
Does this count? https://mail-archive.com/wikidata-tech@lists.wikimedia.org/msg00902.html
This email message is not a contract about how the Wikidata JSON data format can change. It instead describes how consumers of that (and other) data are supposed to act. My view is that without guarantees of what sort of changes will be made to the Wikidata JSON data format, these are dangerous behaviours for its consumers.
Because XML specifically states that the order of attributes is not significant. Therefore changes to the order of XML attributes is not changing the encoding.
That's why I'm proposing to formalize the same kind of contract for us, see https://phabricator.wikimedia.org/T142084.
This contract guarantees that new fields will not change the interpretation of pre-existing ones, which is strong, but I don't see where it guarantees that the meaning of entire structures will not change, which is very weak.
Consider the rank field. This doesn't change the interpretation of existing fields. However, it changes how the entire claim is to be considered.
Here is where I disagree. As there is no contract that new fields in the Wikidata JSON dumps are not breaking, clients need to treat all new fields as potentially breaking and thus should not accept data with unknown fields.
While you are correct that there is no formal contract yet, the topic had been explicitly discussed before, in particular with Markus.
I say this for any data, except where there is a contract that such additional fields are not meaning-changing.
Quote me on it:
For wikibase serializations, additional fields are not meaning changing. Changes to the format or interpretation of fields will be announced as a breaking change.
Clients need to be prepared to encounter entity types and data types they don't know. But they should also allow additional fields in any JSON object. We guarantee that extra fields do not impact the interpretation of fields they know about - unless we have announced and documented a breaking change.
Is this the contract that is going to be put forward? At some time in the not too distant future I hope that my company will be using Wikidata information in its products. This contract is likely to problematic for development groups, who want some notion how long they have to prepare for changes that can silently break their products.
This is indeed the gist of what I want to establish as a stability policy. Please comment on https://phabricator.wikimedia.org/T142084.
I'm not sure how this could be made less problematic. Even with a fully versioned JSON interface, available data types etc are a matter of configuration. All we can do is announce such changes, and advise consumers that they can safely ignore unknown things.
You raise a valid point about due notice. What do you think would be a good notice period? Two weeks? A month?
Human-only due notice can only be a part of well-behaved software ecosystem. Software ends up being used in places separated from its initial developers, indeed from any developer. Requiring software to silently accept breaking additions means that breaking additions will usually break something even with a long notice period.
There can also be no fixed notice period. Sometimes software can be changed and re-deployed in a day or a week. Often, however, change and re-deployment can take several months. Right now I would be leery of a two-week notice period, as it is entirely possible that this would fall within a vacation period for a group.
Peter F. Patel-Schneider Nuance Communications
Hi!
Consumers of data generally cannot tell whether the addition of a new field to a data encoding is a breaking change or not. Given this, code that consumes encoded data should at least produce warnings when it encounters encodings that it is not expecting and preferably should refuse to produce output in such circumstances. Producers of data thus should signal in advance any changes to the encoding, even if they know that the changes can be safely ignored.
I don't think this approach is always warranted. In some cases, yes, but in case where you importing data from external system using a generic data exchange format like JSON, I don't think this is warranted. This will only lead to software being more brittle without any additional benefit to the user. Formats like JSON allow to easily accommodate backwards-compatible incremental change, so there's no reason not to use it.
I would view software that consumes Wikidata information and silently ignores fields that it is not expecting as deficient and would counsel against using such software.
I think this approach is way too restrictive. Wikidata is a database that does not have fixed schema, and even its underlying data representations are not yet fixed, and probably won't be completely fixed for a long time. Having software break each time a field is added would lead to a software that breaks often and does not serve its users well. You need also to consider that Wikidata is a huge database with a very wide mission, and many users may not be interested in all the details of the data representation, but only in some aspects of it. Having the software refuse to operate on the data that is relevant to the user because some part that is not relevant to the user changed does not look like the best approach to me.
For Wikidata specifically I think better approach would be to ignore fields, types and other structures that are not known to the software, provided that ones that are known do not change their semantics with additions - and I understand that's the promise from Wikidata (at least excepting cases of specially announced BC-breaking changes). Maybe inform the user that some information is not understood and thus may be not available, but not refuse to function completely.
My view is that any tool that imports external data has to be very cautious about additions to the format of that data absent strong guarantees about the effects of these additions.
Consider a tool that imports the Wikidata JSON dump, extracts base facts from the dump, and outputs these facts in some other format (perhaps in RDF, but it doesn't really matter what format). This tool fits into the "importing data from [an] external system using a generic exchange format".
My view is that this tool should be extremely cautious when it sees new data structures or fields. The tool should certainly not continue to output facts without some indication that something is suspect, and preferably should refuse to produce output under these circumstances.
What can happen if the tool instead continues to operate without complaint when new data structures are seen? Consider what would happen if the tool was written for a version of Wikidata that didn't have rank, i.e., claim objects did not have a rank name/value pair. If ranks were then added, consumers of the output of the tool would have no way of distinguishing deprecated information from other information.
Of course this is an extreme case. Most changes to the Wikidata JSON dump format will not cause such severe problems. However, given the current situation with how the Wikidata JSON dump format can change, the tool cannot determine whether any particular change will affect the meaning of what it produces. Under these circumstances it is dangerous for a tool that extracts information from the Wikidata JSON dump to continue to produce output when it sees new data structures.
This does make consuming tools sensitive to changes to the Wikidata JSON dump format that are "non-breaking". To overcome this problem there should be a way for tools to distinguish changes to the Wikidata JSON dump format that do not change the meaning of existing constructs in the dump from those that can. Consuming tools can then continue to function without problems for the former kind of change.
Human-only signalling, e.g., an annoucement on some web page, is not adequate because there is no guarantee that consuming tools will be changed in response.
Peter F. Patel-Schneider Nuance Communications
On 08/05/2016 11:56 AM, Stas Malyshev wrote:
Hi!
Consumers of data generally cannot tell whether the addition of a new field to a data encoding is a breaking change or not. Given this, code that consumes encoded data should at least produce warnings when it encounters encodings that it is not expecting and preferably should refuse to produce output in such circumstances. Producers of data thus should signal in advance any changes to the encoding, even if they know that the changes can be safely ignored.
I don't think this approach is always warranted. In some cases, yes, but in case where you importing data from external system using a generic data exchange format like JSON, I don't think this is warranted. This will only lead to software being more brittle without any additional benefit to the user. Formats like JSON allow to easily accommodate backwards-compatible incremental change, so there's no reason not to use it.
I would view software that consumes Wikidata information and silently ignores fields that it is not expecting as deficient and would counsel against using such software.
I think this approach is way too restrictive. Wikidata is a database that does not have fixed schema, and even its underlying data representations are not yet fixed, and probably won't be completely fixed for a long time. Having software break each time a field is added would lead to a software that breaks often and does not serve its users well. You need also to consider that Wikidata is a huge database with a very wide mission, and many users may not be interested in all the details of the data representation, but only in some aspects of it. Having the software refuse to operate on the data that is relevant to the user because some part that is not relevant to the user changed does not look like the best approach to me.
For Wikidata specifically I think better approach would be to ignore fields, types and other structures that are not known to the software, provided that ones that are known do not change their semantics with additions - and I understand that's the promise from Wikidata (at least excepting cases of specially announced BC-breaking changes). Maybe inform the user that some information is not understood and thus may be not available, but not refuse to function completely.
Hi!
My view is that this tool should be extremely cautious when it sees new data structures or fields. The tool should certainly not continue to output facts without some indication that something is suspect, and preferably should refuse to produce output under these circumstances.
I don't think I agree. I find tools that are too picky about details that are not important to me hard to use, and I'd very much prefer a tool where I am in control of which information I need and which I don't need.
What can happen if the tool instead continues to operate without complaint when new data structures are seen? Consider what would happen if the tool was written for a version of Wikidata that didn't have rank, i.e., claim objects did not have a rank name/value pair. If ranks were then added, consumers of the output of the tool would have no way of distinguishing deprecated information from other information.
Ranks are a bit unusual because ranks are not just informational change, it's a semantic change. It introduces a concept of a statement that has different semantics than the rest. Of course, such change needs to be communicated - it's like I would make format change "each string beginning with letter X needs to be read backwards" but didn't tell the clients. Of course this is a breaking change if it changes semantics.
What I was talking are changes that don't break semantics, and majority of additions are just that.
Of course this is an extreme case. Most changes to the Wikidata JSON dump format will not cause such severe problems. However, given the current situation with how the Wikidata JSON dump format can change, the tool cannot determine whether any particular change will affect the meaning of what it produces. Under these circumstances it is dangerous for a tool that extracts information from the Wikidata JSON dump to continue to produce output when it sees new data structures.
The tool can not. It's not possible to write a tool that would derive semantics just from JSON dump, or even detect semantic changes. Semantic changes can be anywhere, it doesn't have to be additional field - it can be in the form of changing the meaning of the field, or format, or datatype, etc. Of course the tool can not know that - people should know that and communicate it. Again, that's why I think we need to distinguish changes that break semantics and changes that don't, and make the tools robust against the latter - but not the former because it's impossible. For dealing with the former, there is a known and widely used solution - format versioning.
This does make consuming tools sensitive to changes to the Wikidata JSON dump format that are "non-breaking". To overcome this problem there should be a way for tools to distinguish changes to the Wikidata JSON dump format that do not change the meaning of existing constructs in the dump from those that can. Consuming tools can then continue to function without problems for the former kind of change.
As I said, format versioning. Maybe even semver or some suitable modification of it. RDF exports BTW already carry version. Maybe JSON exports should too.
On 08/11/2016 01:35 PM, Stas Malyshev wrote:
Hi!
My view is that this tool should be extremely cautious when it sees new data structures or fields. The tool should certainly not continue to output facts without some indication that something is suspect, and preferably should refuse to produce output under these circumstances.
I don't think I agree. I find tools that are too picky about details that are not important to me hard to use, and I'd very much prefer a tool where I am in control of which information I need and which I don't need.
My point is that the tool has no way of determining what is important and what is not important, at least under the current state of affairs with respect to the Wikidata JSON dump. Given this, a tool that ignores what could easily be an important change is a dangerous tool.
What can happen if the tool instead continues to operate without complaint when new data structures are seen? Consider what would happen if the tool was written for a version of Wikidata that didn't have rank, i.e., claim objects did not have a rank name/value pair. If ranks were then added, consumers of the output of the tool would have no way of distinguishing deprecated information from other information.
Ranks are a bit unusual because ranks are not just informational change, it's a semantic change. It introduces a concept of a statement that has different semantics than the rest. Of course, such change needs to be communicated - it's like I would make format change "each string beginning with letter X needs to be read backwards" but didn't tell the clients. Of course this is a breaking change if it changes semantics.
What I was talking are changes that don't break semantics, and majority of additions are just that.
Yes, the majority of changes are not of this sort, but tools currently can't determine which changes are of this sort and which are not.
Of course this is an extreme case. Most changes to the Wikidata JSON dump format will not cause such severe problems. However, given the current situation with how the Wikidata JSON dump format can change, the tool cannot determine whether any particular change will affect the meaning of what it produces. Under these circumstances it is dangerous for a tool that extracts information from the Wikidata JSON dump to continue to produce output when it sees new data structures.
The tool can not. It's not possible to write a tool that would derive semantics just from JSON dump, or even detect semantic changes. Semantic changes can be anywhere, it doesn't have to be additional field - it can be in the form of changing the meaning of the field, or format, or datatype, etc. Of course the tool can not know that - people should know that and communicate it. Again, that's why I think we need to distinguish changes that break semantics and changes that don't, and make the tools robust against the latter - but not the former because it's impossible. For dealing with the former, there is a known and widely used solution - format versioning.
Yes, if a suitable sort of versioning contract was implemented then things would dramatically change. Tools could depend on "breaking" changes always being accompanied by a version bump and then they might be able to ignore new fields if the version does not change. However, this is not the current state of affairs with the Wikidata JSON dump format.
This does make consuming tools sensitive to changes to the Wikidata JSON dump format that are "non-breaking". To overcome this problem there should be a way for tools to distinguish changes to the Wikidata JSON dump format that do not change the meaning of existing constructs in the dump from those that can. Consuming tools can then continue to function without problems for the former kind of change.
As I said, format versioning. Maybe even semver or some suitable modification of it. RDF exports BTW already carry version. Maybe JSON exports should too.
Right. I'm all for version information being added to the Wikidata JSON dump format. It would make the production use of these dumps much safer.
Until suitable versioning is part of the Wikidata JSON dump format and contract, however, I don't think that consumers of the dumps should just ignore new fields.
Peter F. Patel-Schneider Nuance Communcations
Am 11.08.2016 um 23:12 schrieb Peter F. Patel-Schneider:
Until suitable versioning is part of the Wikidata JSON dump format and contract, however, I don't think that consumers of the dumps should just ignore new fields.
Full versioning is still in the future, but I'm happy that we are in the process of finalizing a policy on stable interfaces, including a contract regarding adding fields: https://www.wikidata.org/wiki/Wikidata:Stable_Interface_Policy. Please comment on the talk page.
On 08/16/2016 07:57 AM, Daniel Kinzler wrote:
Am 11.08.2016 um 23:12 schrieb Peter F. Patel-Schneider:
Until suitable versioning is part of the Wikidata JSON dump format and contract, however, I don't think that consumers of the dumps should just ignore new fields.
Full versioning is still in the future, but I'm happy that we are in the process of finalizing a policy on stable interfaces, including a contract regarding adding fields: https://www.wikidata.org/wiki/Wikidata:Stable_Interface_Policy. Please comment on the talk page.
Looks quite good. I put in a few comments, particularly to claim that this would be an ideal time to add versioning.
peter
Dear all,
There have been some interesting discussions about breaking changes here, but before we continue in this direction, let me repeat that I did not start this thread to define what is a "breaking change" in JSON. There are JSON libraries that define this in a strict way (siding with Peter) and browsers that are more tolerant (siding with Daniel). I don't think we can come to definite conclusions here. Format versioning, as Stas suggests, can't be a bad thing.
However, all I was asking for was to get a little email when JSON is changed. It is not necessary to discuss if this is really necessary due to some higher principle. Even if my software tolerates the change, I should *always* know about new information being available. It is usually there for a purpose, so my software should do better than "not breaking".
Lydia has already confirmed early on that suitable notification emails should be sent in the future, so I don't see a need to continue this particular discussion. Daniel's position seemed to be a mix of "I told you so" and "you volunteers should write better code", which is of little help to me or my users. It would be good to rethink how to approach the community in such cases, to make sure that a coherent and welcoming message is sent to contributors. (On that note, all the best on your new job, Léa! -- communicating with this crowd can be a challenge at times ;-).
Markus
On 11.08.2016 22:35, Stas Malyshev wrote:
Hi!
My view is that this tool should be extremely cautious when it sees new data structures or fields. The tool should certainly not continue to output facts without some indication that something is suspect, and preferably should refuse to produce output under these circumstances.
I don't think I agree. I find tools that are too picky about details that are not important to me hard to use, and I'd very much prefer a tool where I am in control of which information I need and which I don't need.
What can happen if the tool instead continues to operate without complaint when new data structures are seen? Consider what would happen if the tool was written for a version of Wikidata that didn't have rank, i.e., claim objects did not have a rank name/value pair. If ranks were then added, consumers of the output of the tool would have no way of distinguishing deprecated information from other information.
Ranks are a bit unusual because ranks are not just informational change, it's a semantic change. It introduces a concept of a statement that has different semantics than the rest. Of course, such change needs to be communicated - it's like I would make format change "each string beginning with letter X needs to be read backwards" but didn't tell the clients. Of course this is a breaking change if it changes semantics.
What I was talking are changes that don't break semantics, and majority of additions are just that.
Of course this is an extreme case. Most changes to the Wikidata JSON dump format will not cause such severe problems. However, given the current situation with how the Wikidata JSON dump format can change, the tool cannot determine whether any particular change will affect the meaning of what it produces. Under these circumstances it is dangerous for a tool that extracts information from the Wikidata JSON dump to continue to produce output when it sees new data structures.
The tool can not. It's not possible to write a tool that would derive semantics just from JSON dump, or even detect semantic changes. Semantic changes can be anywhere, it doesn't have to be additional field - it can be in the form of changing the meaning of the field, or format, or datatype, etc. Of course the tool can not know that - people should know that and communicate it. Again, that's why I think we need to distinguish changes that break semantics and changes that don't, and make the tools robust against the latter - but not the former because it's impossible. For dealing with the former, there is a known and widely used solution - format versioning.
This does make consuming tools sensitive to changes to the Wikidata JSON dump format that are "non-breaking". To overcome this problem there should be a way for tools to distinguish changes to the Wikidata JSON dump format that do not change the meaning of existing constructs in the dump from those that can. Consuming tools can then continue to function without problems for the former kind of change.
As I said, format versioning. Maybe even semver or some suitable modification of it. RDF exports BTW already carry version. Maybe JSON exports should too.