Representing invalid Snaks

List overview All Threads
Download

newer

older

"Client to Repo move change...

Removing LBFactory_Multi...

Daniel Kinzler

18 Jun 2013 18 Jun '13

4:47 p.m.

I'm currently working on validation, and would like some input on the following issue:

Data we already have in the database may become "invalid" by some definition. This may happen because a property is deleted, the type of a property changes, or rules for validation become stricter, etc.

It would be bad if an invalid Snak somewhere in an entity would cause all processing of that claim or entity to fail - which would be the case if we just triggered an exception when we encountered such a snak. Just skipping over such Snaks would bean silently removing them with the next edit - also not a great option.

We really want to be able to show the snak as invalid in the UI, and allow users to replace it with something valid or remove it explicitly. So I came up with the notion of a PropertyBadValueSnak to represent invalid snaks. Relevant patches on gerrit:

https://gerrit.wikimedia.org/r/#/c/68952/ https://gerrit.wikimedia.org/r/#/c/68002/

(if you can't see these, it'S because they are tagged as "draft" and gerrit makes drafts visible only to invited reviewers... let me know and I'll add you).

So, to you think this is a good approach? Is the Snak level the right place for handling this case?

-- daniel

Show replies by date

Jeroen De Dauw

24 Jun 24 Jun

5:37 p.m.

Hey,

Ran into this thread shortly after posting some comment here: https://gerrit.wikimedia.org/r/#/c/68952/

Is it correct that the idea here is to only have this PropertyBadValueSnak to represent invalid PropertyValueSnaks? In other words , this does not cover the other types of snaks?

...

This may happen because a property is deleted

That also happens for PropertySomeValueSnak and PropertyNoValueSnak.

...

Data we already have in the database may become "invalid" by some

definition.

Indeed, by *some definition*. This is what makes me uncomfortable with this proposed change. It is not clear to me how it is intended to be used and what follow up changes will be made. We should be very careful to not force code that does not care about this to hold it into account. For instance, if I got some tool that keeps a mirror of certain entities up to date, it needs to be able to fetch entities via the API, and then represent them as entity objects, and be able to write those to persistence. It would not care at all about what is considered invalid, and would not even have the info required to determine this validity at hand.

In case the PropertyBadValueSnak class is only useful for what we are currently doing in Wikibase Repo, then we ought to just move it there. Or to lib, if client also needs it. Since I'm not sure how this is intended to be used, it's hard to say if it would be appropriate or not to put it in DataModel. It is however something to keep in mind. If it is not needed there, not putting it in decreases the hassle we'll have if it turns out we made a bad choice. Right now I only see PropertyBadValueSnak used in SnakFactory, which only contains code that does not belong in DataModel and should therefore be moved to lib. (And this should happen ASAP as it currently causes a violation of the acyclic dependency principle.)

...

Is the Snak level the right place for handling this case?

I'm guessing you considered doing this on the DataValue level as well - why did you not go with that approach?

Cheers

-- Jeroen De Dauw http://www.bn2vs.com Don't panic. Don't be evil. ~=[,,_,,]:3 --

Daniel Kinzler

25 Jun 25 Jun

10:52 a.m.

I have tried to detail my reasoning behind introducing the PropertyBadValueSnak in the below document. I would propose to include it in the docs folder.

-- daniel

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This document describes the mechanism and rationale behind the PropertyBadValueSnak(*) class

(*) Should perhaps just be "BadSnak"

== Motivation ==

The native (JSON) structure representing a snak can be invalid in several ways: * The data representing the value of a PropertyValueSnak may be structurally or otherwise invalid. * The value given for a PropertyValueSnak may have a type incompatible with the property's data type * The Snak uses a property that does not exist in the system (e.g. because it was deleted). (**)

Such a situation may be encountered whenever a Snak is unserialized (resp. converted from native representation to object model). In particular this is the case when: 1) receiving entity/snak data from a client in an API request 2) loading an entity from the database 3) importing entity data from a backup or transfer file

In case (1), the appropriate reaction would be to reject the request and send an error message to the client. However, in the other cases, it is not desirable to "fail fast": even is some part of an entity is invalid/corrupt/malformed, we still want to be able to use the other parts. In addition, we want to be able to inspect and perhaps repair the broken part.

This requires us to be able to construct an object model for an entity (resp. a snak) from partially invalid data.

(**) It's somewhat unclear whether this consititutes "brokennes" on the same level as the other cases, and whether it should be handled the same way.

== Approaches ==

There seem to be two ways to allow interation with most of an entity even when part of it is broken:

1) Provide an explicit representation for broken snaks in the data model. Such a BadSnak (currently: PropertyBadValueSnak) class would contain all the information from the original invalid native data structure, so that structure can be restored. All code that is specific for certain snak types needs to be aware of the possibility of encountering broken snaks.

2) Implement just-in-time unstubbing of snaks (or even parts of snaks), and throw an exception when trying to unstub a snak based on broken data. That way, the object structure or an entity can be traversed, and failure only ocurrs when trying to access a broken part.

At present, Wikibase always fully initializes snaks, snak objects doe not have access to the original native data structure. Entity objects, however, are just wrappers around the native data structure; links, snaks, etc. are unstubbed when first accessed. All Snaks are unstubbed when the first attempt is made to access any snak.

Note that by requirement, both approaches have to be implemented in a way to allows for "round tripping": we can load them as part of an entity, change some other part of the entity, and save them again, without changing or removing the "bad" data.

== Justification ==

It seems that the BadValue approach is more practical, if not conceptually superior, to the unstubbing approach:

* It avoids "hidden danger": with the unstubbing approach, the simple task of iterating over the list of snaks may "suddenly" fail at the 4th snak, even though the code does not seem to be "doing" anything. Having to handle exceptions from simple getters and iterators is generally surprising.

* It avoids unclear behavior of bulk operations. If we want a list of all statements about a given property (that is, all statements whose main snak uses that property), we need to iterate over the list of all statements (effectively, over the list of all main snaks). If we encounter an exception while accessing one of them, what shall be done? The exception could be caught and ignored, but that way we would lose the information that something went wrong at all. It seems useful to be able to report "we have 3 statements about this, but one of them has a broken value".

* Existing code is easier to adapt: the unstub/exception approach requires all code that accdess any part of an entity at any time in any way to be wary of unstubbing exceptions. The BadSnak approach just requires all code that explicitly varies on the different types of snaks to be aware of the additional snak type. These plases should be few (ideally: only the factory) and easily identified by looking for explicite mention of the concrete snak classes.

There is also the question of why we should use to model bad snaks, and not, say, bad data values or bad statements. Here are some reasons:

* A valid data valid of the wrong type would still make the snak invalid. We would still have to handle that somehow. * A snak with a broken data value is useless. * A statement that contains a broken snak can still be useful (especially if it's not the main snak that is broken) and may contain sufficient information for the user to fix the problem.

Thus, it seems appropriate to model the "brokenness" on the snak level.

== Implications ==

The implications of introducing a BadSnak (or PropertyBadValueSnak) are the same as introducing any other cond of snak: call code that varies on the type of snak, or is specific to a type of snak, must be made aware of the additional snak type. That is, there need to be formatter/renderer, serializer, etc. for "bad" snaks.

Serialization (resp. native representation as an arary structure) for bad snaks raises an important issue: should the snak be serialized as it was originally loaded, as a (failed) attempt to express some other kind of snak, or should it explicitly be modeled as a broken snak?

* When saving an entity to the database or to a dump, "round tripping" bahavior would be best, since the condition that made the snak bad may change (a property may be restored, validation code may be fixed, the snak may be valid on another wikibase instance, etc).

* When sending entity data to a client for processing (especially to the JS based user interface, but also in an API response to a bot), bad snaks should probably be makred as such. This way, the client does not have to re-implement the validation logic, but can rely on the data returned from the API being either valid or marked as invalid. This is useful for display in the UI, but also for avoding processign invalid data in a bot.

Daniel Kinzler

3:27 p.m.

A quick follow up to this morning's mail:

I discussed this issue with Denny for a while, and we came up with this:

* I'll explore the possibility of using a BadValue object instead of a BadSnak, that is, model the error on the DataValue level. My initial impression was that this would be more work, but I'm no longer sure, and will just try and see.

* We will represent the error as a string inside the BadValue/BadSnak object. There seem to be no immediate benefits or obvious use cases for wrapping that in an Error object. (This in reply to an earlier discussion on Gerrit).

-- daniel

4187

Age (days ago)

4194

Last active (days ago)

wikidata-tech@lists.wikimedia.org

3 comments

2 participants

tags (0)

participants (2)

Daniel Kinzler
Jeroen De Dauw