Hi Andrew,
I am very interested in this, especially in the second aspect (how to handle symmetry). There are many cases where we have two or more ways to say the same thing on Wikidata (symmetric properties are only one case). It would be useful to draw these inferences so that they can used for queries and maybe also in the UI.
This can also help to solve some of the other problems you mention: for those who would like to have properties "son" and "daughter", one could infer their values automatically from other statements, without editors having to maintain this data at all.
A possible way to maintain these statements on wiki would be to use a special reference to encode that they have been inferred (and from what). This would make it possible to maintain them automatically without the problem of human editors ending up wrestling with bots ;-) Moreover, it would not require any change in the software on which Wikidata is running.
For the cases you mentioned, I don't think that there is a problem with too many inferred statements. There are surely cases where it would not be practical (in the current system) to store inferred data, but family relationships are usually not problematic. In fact, they are very useful to human readers.
Of course, the community needs to fully control what is inferred, and this has to be done in-wiki. We already have symmetry information in constraints, but for useful inference we might have to be stricter. The current constraints also cover some not-so-strict cases where exceptions are likely (e.g., most people have only one gender, but this is not a strong rule; on the other hand, one is always the child of one's mother by definition).
One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.
A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This will make it easier to solve part of the problem first without having too many discussions in parallel.
Best regards,
Markus
On 17.08.2015 13:29, Andrew Gray wrote:
Hi all,
I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.
We currently have the following properties to link people together:
- spouses (P26) and cohabitants (P451) - not gendered
- parents (P22/P25) and step-parents (P43/P44) - gendered
- siblings (P7/P9) - gendered
- children (P40) - not gendered (and oddly no step-children?)
- a generic "related to" (P1038) for more distant relationships
There's two big things that jump out here.
** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?
This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.
In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".
** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.
However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.
I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.
Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!
A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.
Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.