My view is that any tool that imports external data has to be very cautious
about additions to the format of that data absent strong guarantees about
the effects of these additions.
Consider a tool that imports the Wikidata JSON dump, extracts base facts
from the dump, and outputs these facts in some other format (perhaps in RDF,
but it doesn't really matter what format). This tool fits into the
"importing data from [an] external system using a generic exchange format".
My view is that this tool should be extremely cautious when it sees new data
structures or fields. The tool should certainly not continue to output
facts without some indication that something is suspect, and preferably
should refuse to produce output under these circumstances.
What can happen if the tool instead continues to operate without complaint
when new data structures are seen? Consider what would happen if the tool
was written for a version of Wikidata that didn't have rank, i.e., claim
objects did not have a rank name/value pair. If ranks were then added,
consumers of the output of the tool would have no way of distinguishing
deprecated information from other information.
Of course this is an extreme case. Most changes to the Wikidata JSON dump
format will not cause such severe problems. However, given the current
situation with how the Wikidata JSON dump format can change, the tool cannot
determine whether any particular change will affect the meaning of what it
produces. Under these circumstances it is dangerous for a tool that
extracts information from the Wikidata JSON dump to continue to produce
output when it sees new data structures.
This does make consuming tools sensitive to changes to the Wikidata JSON
dump format that are "non-breaking". To overcome this problem there should
be a way for tools to distinguish changes to the Wikidata JSON dump format
that do not change the meaning of existing constructs in the dump from those
that can. Consuming tools can then continue to function without problems
for the former kind of change.
Human-only signalling, e.g., an annoucement on some web page, is not
adequate because there is no guarantee that consuming tools will be changed
in response.
Peter F. Patel-Schneider
Nuance Communications
On 08/05/2016 11:56 AM, Stas Malyshev wrote:
Hi!
Consumers of data generally cannot tell whether
the addition of a new field to
a data encoding is a breaking change or not. Given this, code that consumes
encoded data should at least produce warnings when it encounters encodings
that it is not expecting and preferably should refuse to produce output in
such circumstances. Producers of data thus should signal in advance any
changes to the encoding, even if they know that the changes can be safely ignored.
I don't think this approach is always warranted. In some cases, yes, but
in case where you importing data from external system using a generic
data exchange format like JSON, I don't think this is warranted. This
will only lead to software being more brittle without any additional
benefit to the user. Formats like JSON allow to easily accommodate
backwards-compatible incremental change, so there's no reason not to use
it.
I would view software that consumes Wikidata
information and silently ignores
fields that it is not expecting as deficient and would counsel against using
such software.
I think this approach is way too restrictive. Wikidata is a database
that does not have fixed schema, and even its underlying data
representations are not yet fixed, and probably won't be completely
fixed for a long time. Having software break each time a field is added
would lead to a software that breaks often and does not serve its users
well. You need also to consider that Wikidata is a huge database with a
very wide mission, and many users may not be interested in all the
details of the data representation, but only in some aspects of it.
Having the software refuse to operate on the data that is relevant to
the user because some part that is not relevant to the user changed does
not look like the best approach to me.
For Wikidata specifically I think better approach would be to ignore
fields, types and other structures that are not known to the software,
provided that ones that are known do not change their semantics with
additions - and I understand that's the promise from Wikidata (at least
excepting cases of specially announced BC-breaking changes). Maybe
inform the user that some information is not understood and thus may be
not available, but not refuse to function completely.