On 08/11/2016 01:35 PM, Stas Malyshev wrote:
Hi!
My view is that this tool should be extremely cautious when it sees new data structures or fields. The tool should certainly not continue to output facts without some indication that something is suspect, and preferably should refuse to produce output under these circumstances.
I don't think I agree. I find tools that are too picky about details that are not important to me hard to use, and I'd very much prefer a tool where I am in control of which information I need and which I don't need.
My point is that the tool has no way of determining what is important and what is not important, at least under the current state of affairs with respect to the Wikidata JSON dump. Given this, a tool that ignores what could easily be an important change is a dangerous tool.
What can happen if the tool instead continues to operate without complaint when new data structures are seen? Consider what would happen if the tool was written for a version of Wikidata that didn't have rank, i.e., claim objects did not have a rank name/value pair. If ranks were then added, consumers of the output of the tool would have no way of distinguishing deprecated information from other information.
Ranks are a bit unusual because ranks are not just informational change, it's a semantic change. It introduces a concept of a statement that has different semantics than the rest. Of course, such change needs to be communicated - it's like I would make format change "each string beginning with letter X needs to be read backwards" but didn't tell the clients. Of course this is a breaking change if it changes semantics.
What I was talking are changes that don't break semantics, and majority of additions are just that.
Yes, the majority of changes are not of this sort, but tools currently can't determine which changes are of this sort and which are not.
Of course this is an extreme case. Most changes to the Wikidata JSON dump format will not cause such severe problems. However, given the current situation with how the Wikidata JSON dump format can change, the tool cannot determine whether any particular change will affect the meaning of what it produces. Under these circumstances it is dangerous for a tool that extracts information from the Wikidata JSON dump to continue to produce output when it sees new data structures.
The tool can not. It's not possible to write a tool that would derive semantics just from JSON dump, or even detect semantic changes. Semantic changes can be anywhere, it doesn't have to be additional field - it can be in the form of changing the meaning of the field, or format, or datatype, etc. Of course the tool can not know that - people should know that and communicate it. Again, that's why I think we need to distinguish changes that break semantics and changes that don't, and make the tools robust against the latter - but not the former because it's impossible. For dealing with the former, there is a known and widely used solution - format versioning.
Yes, if a suitable sort of versioning contract was implemented then things would dramatically change. Tools could depend on "breaking" changes always being accompanied by a version bump and then they might be able to ignore new fields if the version does not change. However, this is not the current state of affairs with the Wikidata JSON dump format.
This does make consuming tools sensitive to changes to the Wikidata JSON dump format that are "non-breaking". To overcome this problem there should be a way for tools to distinguish changes to the Wikidata JSON dump format that do not change the meaning of existing constructs in the dump from those that can. Consuming tools can then continue to function without problems for the former kind of change.
As I said, format versioning. Maybe even semver or some suitable modification of it. RDF exports BTW already carry version. Maybe JSON exports should too.
Right. I'm all for version information being added to the Wikidata JSON dump format. It would make the production use of these dumps much safer.
Until suitable versioning is part of the Wikidata JSON dump format and contract, however, I don't think that consumers of the dumps should just ignore new fields.
Peter F. Patel-Schneider Nuance Communcations