Re: [Wikidata-tech] Representing invalid Snaks

25 Jun 2013

I have tried to detail my reasoning behind introducing the PropertyBadValueSnak
in the below document. I would propose to include it in the docs folder.

-- daniel

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This document describes the mechanism and rationale behind the
PropertyBadValueSnak(*) class

(*) Should perhaps just be "BadSnak"

== Motivation ==

The native (JSON) structure representing a snak can be invalid in several ways:
* The data representing the value of a PropertyValueSnak may be structurally or
otherwise invalid.
* The value given for a PropertyValueSnak may have a type incompatible with the
property's data type
* The Snak uses a property that does not exist in the system (e.g. because it
was deleted). (**)

Such a situation may be encountered whenever a Snak is unserialized (resp.
converted from native representation to object model). In particular this is the
case when:
1) receiving entity/snak data from a client in an API request
2) loading an entity from the database
3) importing entity data from a backup or transfer file

In case (1), the appropriate reaction would be to reject the request and send an
error message to the client. However, in the other cases, it is not desirable to
"fail fast": even is some part of an entity is invalid/corrupt/malformed, we
still want to be able to use the other parts. In addition, we want to be able to
inspect and perhaps repair the broken part.

This requires us to be able to construct an object model for an entity (resp. a
snak) from partially invalid data.

(**) It's somewhat unclear whether this consititutes "brokennes" on the
same
level as the other cases, and whether it should be handled the same way.

== Approaches ==

There seem to be two ways to allow interation with most of an entity even when
part of it is broken:

1) Provide an explicit representation for broken snaks in the data model. Such a
BadSnak (currently: PropertyBadValueSnak) class would contain all the
information from the original invalid native data structure, so that structure
can be restored. All code that is specific for certain snak types needs to be
aware of the possibility of encountering broken snaks.

2) Implement just-in-time unstubbing of snaks (or even parts of snaks), and
throw an exception when trying to unstub a snak based on broken data. That way,
the object structure or an entity can be traversed, and failure only ocurrs when
trying to access a broken part.

At present, Wikibase always fully initializes snaks, snak objects doe not have
access to the original native data structure. Entity objects, however, are just
wrappers around the native data structure; links, snaks, etc. are unstubbed when
first accessed. All Snaks are unstubbed when the first attempt is made to access
any snak.

Note that by requirement, both approaches have to be implemented in a way to
allows for "round tripping": we can load them as part of an entity, change some
other part of the entity, and save them again, without changing or removing the
"bad" data.

== Justification ==

It seems that the BadValue approach is more practical, if not conceptually
superior, to the unstubbing approach:

* It avoids "hidden danger": with the unstubbing approach, the simple task of
iterating over the list of snaks may "suddenly" fail at the 4th snak, even
though the code does not seem to be "doing" anything. Having to handle
exceptions from simple getters and iterators is generally surprising.

* It avoids unclear behavior of bulk operations. If we want a list of all
statements about a given property (that is, all statements whose main snak uses
that property), we need to iterate over the list of all statements (effectively,
over the list of all main snaks). If we encounter an exception while accessing
one of them, what shall be done? The exception could be caught and ignored, but
that way we would lose the information that something went wrong at all. It
seems useful to be able to report "we have 3 statements about this, but one of
them has a broken value".

* Existing code is easier to adapt: the unstub/exception approach requires all
code that accdess any part of an entity at any time in any way to be wary of
unstubbing exceptions. The BadSnak approach just requires all code that
explicitly varies on the different types of snaks to be aware of the additional
snak type. These plases should be few (ideally: only the factory) and easily
identified by looking for explicite mention of the concrete snak classes.

There is also the question of why we should use to model bad snaks, and not,
say, bad data values or bad statements. Here are some reasons:

* A valid data valid of the wrong type would still make the snak invalid. We
would still have to handle that somehow.
* A snak with a broken data value is useless.
* A statement that contains a broken snak can still be useful (especially if
it's not the main snak that is broken) and may contain sufficient information
for the user to fix the problem.

Thus, it seems appropriate to model the "brokenness" on the snak level.

== Implications ==

The implications of introducing a BadSnak (or PropertyBadValueSnak) are the same
as introducing any other cond of snak: call code that varies on the type of
snak, or is specific to a type of snak, must be made aware of the additional
snak type. That is, there need to be formatter/renderer, serializer, etc. for
"bad" snaks.

Serialization (resp. native representation as an arary structure) for bad snaks
raises an important issue: should the snak be serialized as it was originally
loaded, as a (failed) attempt to express some other kind of snak, or should it
explicitly be modeled as a broken snak?

* When saving an entity to the database or to a dump, "round tripping" bahavior
would be best, since the condition that made the snak bad may change (a property
may be restored, validation code may be fixed, the snak may be valid on another
wikibase instance, etc).

* When sending entity data to a client for processing (especially to the JS
based user interface, but also in an API response to a bot), bad snaks should
probably be makred as such. This way, the client does not have to re-implement
the validation logic, but can rely on the data returned from the API being
either valid or marked as invalid. This is useful for display in the UI, but also
for avoding processign invalid data in a bot.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Wikidata-tech] Representing invalid Snaks