My view is that any tool that imports external data has to be very cautious about additions to the format of that data absent strong guarantees about the effects of these additions.
Consider a tool that imports the Wikidata JSON dump, extracts base facts from the dump, and outputs these facts in some other format (perhaps in RDF, but it doesn't really matter what format). This tool fits into the "importing data from [an] external system using a generic exchange format".
My view is that this tool should be extremely cautious when it sees new data structures or fields. The tool should certainly not continue to output facts without some indication that something is suspect, and preferably should refuse to produce output under these circumstances.
What can happen if the tool instead continues to operate without complaint when new data structures are seen? Consider what would happen if the tool was written for a version of Wikidata that didn't have rank, i.e., claim objects did not have a rank name/value pair. If ranks were then added, consumers of the output of the tool would have no way of distinguishing deprecated information from other information.
Of course this is an extreme case. Most changes to the Wikidata JSON dump format will not cause such severe problems. However, given the current situation with how the Wikidata JSON dump format can change, the tool cannot determine whether any particular change will affect the meaning of what it produces. Under these circumstances it is dangerous for a tool that extracts information from the Wikidata JSON dump to continue to produce output when it sees new data structures.
This does make consuming tools sensitive to changes to the Wikidata JSON dump format that are "non-breaking". To overcome this problem there should be a way for tools to distinguish changes to the Wikidata JSON dump format that do not change the meaning of existing constructs in the dump from those that can. Consuming tools can then continue to function without problems for the former kind of change.
Human-only signalling, e.g., an annoucement on some web page, is not adequate because there is no guarantee that consuming tools will be changed in response.
Peter F. Patel-Schneider Nuance Communications
On 08/05/2016 11:56 AM, Stas Malyshev wrote:
Hi!
Consumers of data generally cannot tell whether the addition of a new field to a data encoding is a breaking change or not. Given this, code that consumes encoded data should at least produce warnings when it encounters encodings that it is not expecting and preferably should refuse to produce output in such circumstances. Producers of data thus should signal in advance any changes to the encoding, even if they know that the changes can be safely ignored.
I don't think this approach is always warranted. In some cases, yes, but in case where you importing data from external system using a generic data exchange format like JSON, I don't think this is warranted. This will only lead to software being more brittle without any additional benefit to the user. Formats like JSON allow to easily accommodate backwards-compatible incremental change, so there's no reason not to use it.
I would view software that consumes Wikidata information and silently ignores fields that it is not expecting as deficient and would counsel against using such software.
I think this approach is way too restrictive. Wikidata is a database that does not have fixed schema, and even its underlying data representations are not yet fixed, and probably won't be completely fixed for a long time. Having software break each time a field is added would lead to a software that breaks often and does not serve its users well. You need also to consider that Wikidata is a huge database with a very wide mission, and many users may not be interested in all the details of the data representation, but only in some aspects of it. Having the software refuse to operate on the data that is relevant to the user because some part that is not relevant to the user changed does not look like the best approach to me.
For Wikidata specifically I think better approach would be to ignore fields, types and other structures that are not known to the software, provided that ones that are known do not change their semantics with additions - and I understand that's the promise from Wikidata (at least excepting cases of specially announced BC-breaking changes). Maybe inform the user that some information is not understood and thus may be not available, but not refuse to function completely.