[Splitting the general (Wikidata reasoning; this thread) from the specific (Wikidata family relationships for horses; original thread).]
Many issues have been brought up, and we cannot solve all with one big hammer. I have now started a WikiProject (see below) to address one of the key points raised by Peter:
''' Nobody has ever defined which inferences can/should be drawn from the content of Wikidata. '''
We do in fact use several properties that seem to ask for inferencing. Probably the clearest is "subclass of" (P279). It has been related to rdfs:subClassOf in many community discussions, so it seems clear that a similar meaning is intended. This would lead to the following rule:
''' If an item A has a "subclass of" statement with value B, and if item B has a "subclass of" statement with value C, then it should follow that item A has a "subclass of" statement with value C." '''
I think there is wide agreement on this idea. Constraints rely on it (constraint checking travels the P279 hierarchy), and it's a main motivation for why Wikidata Query has its "tree" feature. There are similarly clear intentions for the properties "instance of" (P31) and "subproperty of" (P1647). I am not spelling them out here.
Nevertheless, Peter is right that even in these cases, the intention is not fully clear, because of two reasons:
(1) There is no machine-readable specification of the intended behaviour. It's part of user discussions, not of the data or templates. Even the user discussions are distributed over several pages, so a lot of wiki archaeology is needed to get a full picture of what we, the community, might have intended. (2) The informal discussions on the intended semantics are not precise about all relevant cases. Many questions remain open, such as what to do if qualifiers are used on a statement (rarely the case for "subclass of", but not so uncommon for "instance of").
To address these issues, I propose to come up with a format that allows us to clearly specify inference rules such as the one for "subclass of" above. Each rule should have one page where it is specified (for humans and machines), explained (to humans), and discussed. It is not possible to encode such rules as property values on data pages (for a start, it would not be clear which page this should be on, because rules typically refer to several properties and items). Therefore, the best we could do now seems to have standard wiki pages for this. They could be linked from all relevant properties/items (talk pages) though.
Even if we do not have any reasoner to compute all the results, writing down the intended rules would be useful documentation for other users to clarify what we expect (see the original family relationship discussion).
I propose to start by gathering use cases, that is, examples of rules that we might want to express. From this, we can then extract suitable template structure. I have created a WikiProject for getting us started:
https://www.wikidata.org/wiki/Wikidata:WikiProject_Reasoning
Feel free to contribute.
Best regards,
Markus
On 27.08.2015 06:26, Peter F. Patel-Schneider wrote:>
On 08/26/2015 06:01 PM, Svavar Kjarrval wrote:
On mið 26.ágú 2015 23:05, James Heald wrote:
There are a *lot* of problems with P279 (subclass), right across Wikidata.
These will only be corrected once people start doing searches in a systematic way and addressing the anomalies they find.
In this case, politician (Q82955) should *not* be a subclass of human (Q5), instead it should be a subclass of something like occupation (Q13516667), or alternatively perhaps profession (Q28640).
My understanding is that currently there are a vast number of incorrect subclass relationships in the project, messing up tree searches, and so far it is something that has simply not yet been systematically addressed.
-- James.
For now, what's the best way to find (and perhaps correct) incorrect declarations like these?
If I were to just change items for commonly used items like politician (Q82955) it might be construed as vandalism or someone who doesn't care about or understand the Stubbs-declared-as-a-human problem might just add that declaration back later.
When it comes to the gender property (P21), the human readable description indicates that it's to define genders in general, yet it's declared as an instance of an item (Q18608871) which only applies to humans, which of course has consequences further up in the hierarchy since the maintainers of item Q18608871 faithfully assume it only applies to humans.
Well, the situation with respect to Wikidata property for items
about people
(Q18608871) is very difficult. There is absolutely no
machine-interpretable
information associated with this class that can be used to deterimine
that
instances of it are only supposed to be used for people. So, at the bare minimum, such machine-interpretable information needs to be added.
Then there is the issue that there is no theory of how the machine-interpretable information that is associated with entities in
Wikidata
is to be processed. All the processing is currently done using uninterpretable procedures. For example, on https://www.wikidata.org/wiki/Property_talk:P22 there is information
that is
used to control some piece of code that checks to see that the subject of https://www.wikidata.org/wiki/Property:P21 belongs to person (Q215627) or fictional character (Q95074). However, there is no theory showing
how this
interacts with other parts of Wikidata, even such inherent parts of
Wikidata
as https://www.wikidata.org/wiki/Property:P31
In fact, there is even difficulty of determining simple truth in
Wikidata.
Two sources can conflict, and Wikidata is not in the position of being an arbiter for such conflicts, certainly not in general. To make the
situation
even more complex, Wikidata has a temporal aspect as well and has a
need to
admit exceptions to general statements.
So what can be done? Any solution is going to be tricky. That is
not to say
that some solutions cannot be found by looking at systems and
standards that
are already being used for storing large amounts of complex information. However, any solution is going to have to be carefully tailored to
meet the
requirements of Wikidata and Wikidatans. (Is there an official term
for the
people who are putting Wikidata and Wikidata information together?)
There is also a big chicken-and-egg problem here - a good solution to
reliable
machine-interpretation of Wikidata information requires, for example, consistent use of instance of, subclass, and subproperty; but what
counts as a
consistent use of these fundamental properties depends on a formal
theory of
what they mean.
I, for one, would find even just the attempt to solve this problem vastly interesting, and I have been doing some exploration as to what might be needed. My company is interested in using Wikidata as a source of
background
information, but finds that the lack of a good theory of Wikidata
information
is problematic, so I have some cover for spending time on this problem.
Anyway, if there is interest in machine interpretation of Wikidata information, if only to detect potential anomalies, I, and probably
others,
would be motivated to spend more time on trying to come up with potential solutions, hopefully in a collaborative effort that includes not just theoreticians but also Wikidatans.
In the case of the hierarchy Stubbs is associated with the maintainers have assumed all mayors are, without exception, humans or they somehow thought that if there were exceptions to this, the machines could somehow detect and apply them in each case. Both of those methods are, I think we agree, are wrong and we should find out why it's happening.
Is there a tool where one can put in a Wikidata item and it extracts declarations based on "higher" properties like subclass or instance of? Like if I were to input the item for Stubbs, it would travel the hierarchy and tell me what would be assumed about Stubbs based on the declarations further up in the tree.
Yes, it is called a reasoner. The design of a reasoner would very
likely be
one result of the sort of work described above, but without such work
it is
very hard to figure out just what is supposed to be done in any
except the
simple cases.
- Svavar Kjarrval
Peter F. Patel-Schneider Nuance Communications
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
So far from the other thread, the current need seems to be for two types of definitions: 1. How to interpret declarations depending on associated properties. 2. Constraints (or suggestions) when interpreting multiple items.
The first definition is used so the machine can know *if* the declaration is up in the hierarchy or sideways. When interpreting the item, the machine needs to know if the property implies that all declarations of that item are inhereted. If we take some currently living human as an example who has a Wikidata item and that human is connected to an occupation via a property. The machine should know if it should process the declarations of the occupation to apply them to the human, in whole or partially. Then there are properties which don't inheret, like if the human has a declared family member, the human doesn't inherit the other family member's name or birthdate.
The other definition has the purpose of solving contradictions like in my example of Stubbs. If we are realistic, it's not likely that a tree structure with that much data is totally free of contradictions. So we need to have some way of telling the machine that there are, or could be, contradictions. One example of this to define that a certain property can't be more than one of something (at any given time). For simplification (not referring to the current data structure) is that a human is a part of a certain species. If we were to define, in this case, that any item can't be part of more than one species, then the machine would detect a contradiction. In the specific example of Stubbs, the machine would determine that cats and humans are two separate species and there can be only one[1]. If we had a definition that the declaration closer to the item in the specific link has precedence, then the machine would solve it by determining that mayors are generally humans but Stubbs being a cat is an exception to that rule.
[1] Didn't see the Highlander reference until I had written it.
- Svavar Kjarrval
A human is not a part of a species, it is an instance of a species :)
Contradiction management is a very interisting topic, and contradictions are inherent to Wikidata model. We can't expect everything is consistent considering Wikidata only reflects sources, and that 2 sources can disagree in an essentially inconsistent way.
We could expect however that several statements extracted from the same source should be consistent themselves, but it might be rare that we will have enough statements that will be sourced to draw useful inferences. This can lead to subproblems like computing the maximum set of consistent sources on a part of the graph or finding the sources that leads to contradiction when took together.
However, we already have qualifiers that marks a source in contradiction with another : "statement disputed by". We could assume that the sources involved are probably inconsistent with each other.
Or we could simply drop the consistency checks out of the inference way :) And leave it to the constraint system : if an inference draws a path that leads to constraint violation, then community will be notified. To avoid explosion, the scope of inerences could be limited (not trying to compute the transitive closure of the inferences rules application). We could use some sort of "partial consistency" notion, such as those used in constraint programming.
Thinking about it I can imagine constraint problems such as "considering an inference I deduced some way, is it fully consistent with the set of sources we have, or is there a set of sources that implies the inference is not true ?" -> Is the inference a tautology or is the infererence only satisfiable in a problem where each statements maps to a variable, the different sources are values for the domain of the variables, and the sources must be consistent wrt. what we know they says on Wikidata ?
2015-08-27 14:43 GMT+02:00 Svavar Kjarrval svavar@kjarrval.is:
So far from the other thread, the current need seems to be for two types of definitions:
- How to interpret declarations depending on associated properties.
- Constraints (or suggestions) when interpreting multiple items.
The first definition is used so the machine can know *if* the declaration is up in the hierarchy or sideways. When interpreting the item, the machine needs to know if the property implies that all declarations of that item are inhereted. If we take some currently living human as an example who has a Wikidata item and that human is connected to an occupation via a property. The machine should know if it should process the declarations of the occupation to apply them to the human, in whole or partially. Then there are properties which don't inheret, like if the human has a declared family member, the human doesn't inherit the other family member's name or birthdate.
The other definition has the purpose of solving contradictions like in my example of Stubbs. If we are realistic, it's not likely that a tree structure with that much data is totally free of contradictions. So we need to have some way of telling the machine that there are, or could be, contradictions. One example of this to define that a certain property can't be more than one of something (at any given time). For simplification (not referring to the current data structure) is that a human is a part of a certain species. If we were to define, in this case, that any item can't be part of more than one species, then the machine would detect a contradiction. In the specific example of Stubbs, the machine would determine that cats and humans are two separate species and there can be only one[1]. If we had a definition that the declaration closer to the item in the specific link has precedence, then the machine would solve it by determining that mayors are generally humans but Stubbs being a cat is an exception to that rule.
[1] Didn't see the Highlander reference until I had written it.
- Svavar Kjarrval
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 27.08.2015 14:43, Svavar Kjarrval wrote:
So far from the other thread, the current need seems to be for two types of definitions:
- How to interpret declarations depending on associated properties.
If I understand your explanations correctly, the first point is a very specific case of inference, which is already thinking in terms of "hierarchies" (of some property). I am asking: how do we even know that some properties are supposed to be read as forming a "hierarchy". This is one special case of a rule of inference that one might formulate. Have a look at
https://www.wikidata.org/wiki/Wikidata:WikiProject_Reasoning/Use_cases
for some more examples of what could be relevant inferences. As you can see, only few of these cases have anything to do with hierarchies (subclass of in particular), but one could easily come up with similar rules to express that something should be propagated along a hierarchy (in some cases).
- Constraints (or suggestions) when interpreting multiple items.
For me, a constraint is a rule that infers a Warning. It can follow a similar pattern as the examples I gave, but instead of deriving a new statement, it will derive that a human should better look at a particular piece of our data to check if it is meaningful.
There is no huge theoretical challenge involved here, but a big practical one. I expect that we will refine our rules once we encounter cases where they do not yield the right result. If you look at the examples I gave, they are all mostly based on how we choose to define the meaning of our properties. This is different from our current constraints that specify how thing *usually* are in the world. We can have both (constraints that warn us of unusual situations and rules that derive statements) based on similar technology, but different considerations are relevant when defining these two types of things.
As for Stubbs, there is a strong and a weak rule involved: * Strong: all mayors are persons (I assume now that this class encompasses named animals, as suggested in earlier messages; if not, then replace "person" by a suitable generalisation that does). * Weak: most mayors are humans.
The strong version could probably be applied to derive new information, without danger of "exceptions" -- it would be part of our characterisation of what makes something a "person" in our view (or whatever other class we pick there). The weak version should only be used to find potential problems that humans might want to check.
Similar rules exist in many domains: * Strong: All birds are animals (it's part of how we define "bird") * Weak: All birds can fly (it's something we observe for actual birds, but not part of the definition of what it means to be a bird).
I suggest we start by focussing on strong rules, since they make a big contribution to documenting what we mean (by "person", by "bird", etc.), even before we have any tool support for acting on this information.
Cheers,
Markus
The first definition is used so the machine can know *if* the declaration is up in the hierarchy or sideways. When interpreting the item, the machine needs to know if the property implies that all declarations of that item are inhereted. If we take some currently living human as an example who has a Wikidata item and that human is connected to an occupation via a property. The machine should know if it should process the declarations of the occupation to apply them to the human, in whole or partially. Then there are properties which don't inheret, like if the human has a declared family member, the human doesn't inherit the other family member's name or birthdate.
The other definition has the purpose of solving contradictions like in my example of Stubbs. If we are realistic, it's not likely that a tree structure with that much data is totally free of contradictions. So we need to have some way of telling the machine that there are, or could be, contradictions. One example of this to define that a certain property can't be more than one of something (at any given time). For simplification (not referring to the current data structure) is that a human is a part of a certain species. If we were to define, in this case, that any item can't be part of more than one species, then the machine would detect a contradiction. In the specific example of Stubbs, the machine would determine that cats and humans are two separate species and there can be only one[1]. If we had a definition that the declaration closer to the item in the specific link has precedence, then the machine would solve it by determining that mayors are generally humans but Stubbs being a cat is an exception to that rule.
[1] Didn't see the Highlander reference until I had written it.
- Svavar Kjarrval
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On fim 27.ágú 2015 15:52, Markus Krötzsch wrote:
On 27.08.2015 14:43, Svavar Kjarrval wrote:
So far from the other thread, the current need seems to be for two types of definitions:
- How to interpret declarations depending on associated properties.
If I understand your explanations correctly, the first point is a very specific case of inference, which is already thinking in terms of "hierarchies" (of some property). I am asking: how do we even know that some properties are supposed to be read as forming a "hierarchy". This is one special case of a rule of inference that one might formulate. Have a look at
https://www.wikidata.org/wiki/Wikidata:WikiProject_Reasoning/Use_cases
for some more examples of what could be relevant inferences. As you can see, only few of these cases have anything to do with hierarchies (subclass of in particular), but one could easily come up with similar rules to express that something should be propagated along a hierarchy (in some cases).
- Constraints (or suggestions) when interpreting multiple items.
For me, a constraint is a rule that infers a Warning. It can follow a similar pattern as the examples I gave, but instead of deriving a new statement, it will derive that a human should better look at a particular piece of our data to check if it is meaningful.
There is no huge theoretical challenge involved here, but a big practical one. I expect that we will refine our rules once we encounter cases where they do not yield the right result. If you look at the examples I gave, they are all mostly based on how we choose to define the meaning of our properties. This is different from our current constraints that specify how thing *usually* are in the world. We can have both (constraints that warn us of unusual situations and rules that derive statements) based on similar technology, but different considerations are relevant when defining these two types of things.
As for Stubbs, there is a strong and a weak rule involved:
- Strong: all mayors are persons (I assume now that this class
encompasses named animals, as suggested in earlier messages; if not, then replace "person" by a suitable generalisation that does).
- Weak: most mayors are humans.
The strong version could probably be applied to derive new information, without danger of "exceptions" -- it would be part of our characterisation of what makes something a "person" in our view (or whatever other class we pick there). The weak version should only be used to find potential problems that humans might want to check.
Similar rules exist in many domains:
- Strong: All birds are animals (it's part of how we define "bird")
- Weak: All birds can fly (it's something we observe for actual birds,
but not part of the definition of what it means to be a bird).
I suggest we start by focussing on strong rules, since they make a big contribution to documenting what we mean (by "person", by "bird", etc.), even before we have any tool support for acting on this information.
Cheers,
Markus
I'm a big advocate of strong versions. My suggestions for "exceptions" was practical since we can't reasonably expect all data to be consistent with strong definitions. Personally I wouldn't support weak versions when a feasable strong version alternate would be available. The constraints I had in mind are only suggestive and would only serve as warnings, so I think we agree there. The constraints wouldn't be enforced but rather used to detect potential mistakes in the data. It wouldn't prevent someone adding the information that Stubbs is a mayor when it would lead to the contradiction of him being both a human and a cat.
Regarding your question of my former definition, the point is to serve as a classification of what can be reasonably inferred from the relationship of two items, depending on the property used to connect them. Like in the case of Stubbs. Stubbs is a mayor and from that connection we can (or should be able to) to assume Stubbs is also a public official, a head of government and a politician. However, we shouldn't reasonably be able to assume Stubb's Freebase identifier is the same as for the town. The purpose is to enable machines to retrieve an item and extract all the relevant facts which can be reasonable inferred based on the relationship of that item with other items, recursively, until all the branches are exhausted.
- Svavar Kjarrval