Re: [Wikidata] Breaking change in JSON serialization?

11 Aug 2016

My view is that any tool that imports external data has to be very cautious
about additions to the format of that data absent strong guarantees about
the effects of these additions.

Consider a tool that imports the Wikidata JSON dump, extracts base facts
from the dump, and outputs these facts in some other format (perhaps in RDF,
but it doesn't really matter what format).  This tool fits into the
"importing data from [an] external system using a generic exchange format".

My view is that this tool should be extremely cautious when it sees new data
structures or fields.  The tool should certainly not continue to output
facts without some indication that something is suspect, and preferably
should refuse to produce output under these circumstances.

What can happen if the tool instead continues to operate without complaint
when new data structures are seen?  Consider what would happen if the tool
was written for a version of Wikidata that didn't have rank, i.e., claim
objects did not have a rank name/value pair.  If ranks were then added,
consumers of the output of the tool would have no way of distinguishing
deprecated information from other information.

Of course this is an extreme case.  Most changes to the Wikidata JSON dump
format will not cause such severe problems.  However, given the current
situation with how the Wikidata JSON dump format can change, the tool cannot
determine whether any particular change will affect the meaning of what it
produces.  Under these circumstances it is dangerous for a tool that
extracts information from the Wikidata JSON dump to continue to produce
output when it sees new data structures.

This does make consuming tools sensitive to changes to the Wikidata JSON
dump format that are "non-breaking".  To overcome this problem there should
be a way for tools to distinguish changes to the Wikidata JSON dump format
that do not change the meaning of existing constructs in the dump from those
that can.  Consuming tools can then continue to function without problems
for the former kind of change.

Human-only signalling, e.g., an annoucement on some web page, is not
adequate because there is no guarantee that consuming tools will be changed
in response.

Peter F. Patel-Schneider
Nuance Communications

On 08/05/2016 11:56 AM, Stas Malyshev wrote:
...
  Hi!

  Consumers of data generally cannot tell whether
the addition of a new field to
 a data encoding is a breaking change or not.  Given this, code that consumes
 encoded data should at least produce warnings when it encounters encodings
 that it is not expecting and preferably should refuse to produce output in
 such circumstances.  Producers of data thus should signal in advance any
 changes to the encoding, even if they know that the changes can be safely ignored. 

 I don't think this approach is always warranted. In some cases, yes, but
 in case where you importing data from external system using a generic
 data exchange format like JSON, I don't think this is warranted. This
 will only lead to software being more brittle without any additional
 benefit to the user. Formats like JSON allow to easily accommodate
 backwards-compatible incremental change, so there's no reason not to use
 it.

  I would view software that consumes Wikidata
information and silently ignores
 fields that it is not expecting as deficient and would counsel against using
 such software.  
 I think this approach is way too restrictive. Wikidata is a database
 that does not have fixed schema, and even its underlying data
 representations are not yet fixed, and probably won't be completely
 fixed for a long time. Having software break each time a field is added
 would lead to a software that breaks often and does not serve its users
 well. You need also to consider that Wikidata is a huge database with a
 very wide mission, and many users may not be interested in all the
 details of the data representation, but only in some aspects of it.
 Having the software refuse to operate on the data that is relevant to
 the user because some part that is not relevant to the user changed does
 not look like the best approach to me.

 For Wikidata specifically I think better approach would be to ignore
 fields, types and other structures that are not known to the software,
 provided that ones that are known do not change their semantics with
 additions - and I understand that's the promise from Wikidata (at least
 excepting cases of specially announced BC-breaking changes). Maybe
 inform the user that some information is not understood and thus may be
 not available, but not refuse to function completely.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Breaking change in JSON serialization?