Properties for family relationships in Wikidata

List overview All Threads
Download

newer

older

WIkidata reasoning (Was:...

weekly summary #172

Andrew Gray

17 Aug 2015 17 Aug '15

4:59 p.m.

Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

* spouses (P26) and cohabitants (P451) - not gendered * parents (P22/P25) and step-parents (P43/P44) - gendered * siblings (P7/P9) - gendered * children (P40) - not gendered (and oddly no step-children?) * a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

-- - Andrew Gray andrew.gray@dunelm.org.uk

Show replies by date

Markus Kroetzsch

17 Aug 17 Aug

6:17 p.m.

Hi Andrew,

I am very interested in this, especially in the second aspect (how to handle symmetry). There are many cases where we have two or more ways to say the same thing on Wikidata (symmetric properties are only one case). It would be useful to draw these inferences so that they can used for queries and maybe also in the UI.

This can also help to solve some of the other problems you mention: for those who would like to have properties "son" and "daughter", one could infer their values automatically from other statements, without editors having to maintain this data at all.

A possible way to maintain these statements on wiki would be to use a special reference to encode that they have been inferred (and from what). This would make it possible to maintain them automatically without the problem of human editors ending up wrestling with bots ;-) Moreover, it would not require any change in the software on which Wikidata is running.

For the cases you mentioned, I don't think that there is a problem with too many inferred statements. There are surely cases where it would not be practical (in the current system) to store inferred data, but family relationships are usually not problematic. In fact, they are very useful to human readers.

Of course, the community needs to fully control what is inferred, and this has to be done in-wiki. We already have symmetry information in constraints, but for useful inference we might have to be stricter. The current constraints also cover some not-so-strict cases where exceptions are likely (e.g., most people have only one gender, but this is not a strong rule; on the other hand, one is always the child of one's mother by definition).

One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.

A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This will make it easier to solve part of the problem first without having too many discussions in parallel.

Best regards,

Markus

On 17.08.2015 13:29, Andrew Gray wrote:

...

Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Gerard Meijssen

10:28 p.m.

Hoi, When you make these inferences, you have to appreciate how English oriented they are. In many cultures there are specific names for older sisters, brothers and younger sisters and brothers. There are names for uncles aunts from mother's side that differ from those of father's side.

Inferences are language specific. They may have a place but they are not obvious when you look at a scale of Wikidata. Thanks, GerardM

On 17 August 2015 at 14:47, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de

...

wrote:

...

Hi Andrew,

I am very interested in this, especially in the second aspect (how to handle symmetry). There are many cases where we have two or more ways to say the same thing on Wikidata (symmetric properties are only one case). It would be useful to draw these inferences so that they can used for queries and maybe also in the UI.

This can also help to solve some of the other problems you mention: for those who would like to have properties "son" and "daughter", one could infer their values automatically from other statements, without editors having to maintain this data at all.

A possible way to maintain these statements on wiki would be to use a special reference to encode that they have been inferred (and from what). This would make it possible to maintain them automatically without the problem of human editors ending up wrestling with bots ;-) Moreover, it would not require any change in the software on which Wikidata is running.

For the cases you mentioned, I don't think that there is a problem with too many inferred statements. There are surely cases where it would not be practical (in the current system) to store inferred data, but family relationships are usually not problematic. In fact, they are very useful to human readers.

Of course, the community needs to fully control what is inferred, and this has to be done in-wiki. We already have symmetry information in constraints, but for useful inference we might have to be stricter. The current constraints also cover some not-so-strict cases where exceptions are likely (e.g., most people have only one gender, but this is not a strong rule; on the other hand, one is always the child of one's mother by definition).

One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.

A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This will make it easier to solve part of the problem first without having too many discussions in parallel.

Best regards,

Markus

On 17.08.2015 13:29, Andrew Gray wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Joe Filceolaire

19 Aug 19 Aug

6:49 a.m.

Rather than automatically filling out symmetric property statemens we should just display statements that link to the current item as well as the statements that link from the item I.e. the statement "John has mother Mary" should appear on the item/page for Mary as well as on the item/page for John. Then we could think about getting rid of many of the symmetric properties.

Joe

On Mon, 17 Aug 2015 18:03 Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, When you make these inferences, you have to appreciate how English oriented they are. In many cultures there are specific names for older sisters, brothers and younger sisters and brothers. There are names for uncles aunts from mother's side that differ from those of father's side.

Inferences are language specific. They may have a place but they are not obvious when you look at a scale of Wikidata. Thanks, GerardM

On 17 August 2015 at 14:47, Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:

...
Hi Andrew,

I am very interested in this, especially in the second aspect (how to handle symmetry). There are many cases where we have two or more ways to say the same thing on Wikidata (symmetric properties are only one case). It would be useful to draw these inferences so that they can used for queries and maybe also in the UI.

This can also help to solve some of the other problems you mention: for those who would like to have properties "son" and "daughter", one could infer their values automatically from other statements, without editors having to maintain this data at all.

A possible way to maintain these statements on wiki would be to use a special reference to encode that they have been inferred (and from what). This would make it possible to maintain them automatically without the problem of human editors ending up wrestling with bots ;-) Moreover, it would not require any change in the software on which Wikidata is running.

For the cases you mentioned, I don't think that there is a problem with too many inferred statements. There are surely cases where it would not be practical (in the current system) to store inferred data, but family relationships are usually not problematic. In fact, they are very useful to human readers.

Of course, the community needs to fully control what is inferred, and this has to be done in-wiki. We already have symmetry information in constraints, but for useful inference we might have to be stricter. The current constraints also cover some not-so-strict cases where exceptions are likely (e.g., most people have only one gender, but this is not a strong rule; on the other hand, one is always the child of one's mother by definition).

One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.

A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This will make it easier to solve part of the problem first without having too many discussions in parallel.

Best regards,

Markus

On 17.08.2015 13:29, Andrew Gray wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Andrew Gray

20 Aug 20 Aug

6:11 p.m.

On 19 August 2015 at 02:19, Joe Filceolaire filceolaire@gmail.com wrote:

...

Rather than automatically filling out symmetric property statemens we should just display statements that link to the current item as well as the statements that link from the item I.e. the statement "John has mother Mary" should appear on the item/page for Mary as well as on the item/page for John. Then we could think about getting rid of many of the symmetric properties.

In the long run, I think this would be idea. However, we would need to be very careful about this - it would be easy to make the items "human" or "city" completely unworkable due to all the "instance of" backlinks!

Some kind of way of defining properties as being suitable for crosslinking would be key here - eg/ P26, definitely suitable; P31, definitely not.

-- - Andrew Gray andrew.gray@dunelm.org.uk

Andrew Gray

6:21 p.m.

As someone with an extensive collection of Hindi-speaking relatives, I agree entirely with the complexity here. Never did a language have such specialised ways of identifying your relations :-)

However, we already seem to manage fine with "simple" relation properties like spouse or child, without significant language complications, and as long as all we're doing is putting these on more items rather than inferring more complex relationships, I think we should be okay.

Andrew.

On 17 August 2015 at 17:58, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, When you make these inferences, you have to appreciate how English oriented they are. In many cultures there are specific names for older sisters, brothers and younger sisters and brothers. There are names for uncles aunts from mother's side that differ from those of father's side.

Inferences are language specific. They may have a place but they are not obvious when you look at a scale of Wikidata. Thanks, GerardM

On 17 August 2015 at 14:47, Markus Kroetzsch markus.kroetzsch@tu-dresden.de wrote:

...
Hi Andrew,

I am very interested in this, especially in the second aspect (how to handle symmetry). There are many cases where we have two or more ways to say the same thing on Wikidata (symmetric properties are only one case). It would be useful to draw these inferences so that they can used for queries and maybe also in the UI.

This can also help to solve some of the other problems you mention: for those who would like to have properties "son" and "daughter", one could infer their values automatically from other statements, without editors having to maintain this data at all.

A possible way to maintain these statements on wiki would be to use a special reference to encode that they have been inferred (and from what). This would make it possible to maintain them automatically without the problem of human editors ending up wrestling with bots ;-) Moreover, it would not require any change in the software on which Wikidata is running.

For the cases you mentioned, I don't think that there is a problem with too many inferred statements. There are surely cases where it would not be practical (in the current system) to store inferred data, but family relationships are usually not problematic. In fact, they are very useful to human readers.

Of course, the community needs to fully control what is inferred, and this has to be done in-wiki. We already have symmetry information in constraints, but for useful inference we might have to be stricter. The current constraints also cover some not-so-strict cases where exceptions are likely (e.g., most people have only one gender, but this is not a strong rule; on the other hand, one is always the child of one's mother by definition).

One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.

A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This will make it easier to solve part of the problem first without having too many discussions in parallel.

Best regards,

Markus

On 17.08.2015 13:29, Andrew Gray wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- - Andrew Gray andrew.gray@dunelm.org.uk

Markus Krötzsch

7:46 p.m.

On 20.08.2015 14:51, Andrew Gray wrote:

...

As someone with an extensive collection of Hindi-speaking relatives, I agree entirely with the complexity here. Never did a language have such specialised ways of identifying your relations :-)

This is in fact exactly where inferred relations can make life easier. Instead of storing many different culture-specific properties on Wikidata (which would lead to a lengthy page with a lot of culture-specific relations), one can infer their values from existing data on the fly. It is not necessary to show these inferences to all users in all contexts, but one can offer them to users who are interested in this (e.g., in Reasonator, based on the language setting).

There are still some steps needed until we can have this, but I can see a great chance there to make Wikidata more adapted to the cultural diversity of its users while keeping the underlying data simple.

Markus

...

However, we already seem to manage fine with "simple" relation properties like spouse or child, without significant language complications, and as long as all we're doing is putting these on more items rather than inferring more complex relationships, I think we should be okay.

Andrew.

On 17 August 2015 at 17:58, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
Hoi, When you make these inferences, you have to appreciate how English oriented they are. In many cultures there are specific names for older sisters, brothers and younger sisters and brothers. There are names for uncles aunts from mother's side that differ from those of father's side.

Inferences are language specific. They may have a place but they are not obvious when you look at a scale of Wikidata. Thanks, GerardM

On 17 August 2015 at 14:47, Markus Kroetzsch markus.kroetzsch@tu-dresden.de wrote:

...
Hi Andrew,

I am very interested in this, especially in the second aspect (how to handle symmetry). There are many cases where we have two or more ways to say the same thing on Wikidata (symmetric properties are only one case). It would be useful to draw these inferences so that they can used for queries and maybe also in the UI.

This can also help to solve some of the other problems you mention: for those who would like to have properties "son" and "daughter", one could infer their values automatically from other statements, without editors having to maintain this data at all.

A possible way to maintain these statements on wiki would be to use a special reference to encode that they have been inferred (and from what). This would make it possible to maintain them automatically without the problem of human editors ending up wrestling with bots ;-) Moreover, it would not require any change in the software on which Wikidata is running.

For the cases you mentioned, I don't think that there is a problem with too many inferred statements. There are surely cases where it would not be practical (in the current system) to store inferred data, but family relationships are usually not problematic. In fact, they are very useful to human readers.

Of course, the community needs to fully control what is inferred, and this has to be done in-wiki. We already have symmetry information in constraints, but for useful inference we might have to be stricter. The current constraints also cover some not-so-strict cases where exceptions are likely (e.g., most people have only one gender, but this is not a strong rule; on the other hand, one is always the child of one's mother by definition).

One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.

A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This will make it easier to solve part of the problem first without having too many discussions in parallel.

Best regards,

Markus

On 17.08.2015 13:29, Andrew Gray wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Gerard Meijssen

11:13 p.m.

Hoi, I am surprised at your argument. In London you argues against automated descriptions. Automated descriptions are inferred. They add serious value and they provide a solution we do not really have in any other way.

Your argument was we need static values. For the life of me, I did not understand why you said it then, I do not understand why you said it now. Thanks, GerardM

On 20 August 2015 at 16:16, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...

On 20.08.2015 14:51, Andrew Gray wrote:

...
As someone with an extensive collection of Hindi-speaking relatives, I agree entirely with the complexity here. Never did a language have such specialised ways of identifying your relations :-)

This is in fact exactly where inferred relations can make life easier. Instead of storing many different culture-specific properties on Wikidata (which would lead to a lengthy page with a lot of culture-specific relations), one can infer their values from existing data on the fly. It is not necessary to show these inferences to all users in all contexts, but one can offer them to users who are interested in this (e.g., in Reasonator, based on the language setting).

There are still some steps needed until we can have this, but I can see a great chance there to make Wikidata more adapted to the cultural diversity of its users while keeping the underlying data simple.

Markus

...
However, we already seem to manage fine with "simple" relation properties like spouse or child, without significant language complications, and as long as all we're doing is putting these on more items rather than inferring more complex relationships, I think we should be okay.

Andrew.

On 17 August 2015 at 17:58, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...
Hoi, When you make these inferences, you have to appreciate how English oriented they are. In many cultures there are specific names for older sisters, brothers and younger sisters and brothers. There are names for uncles aunts from mother's side that differ from those of father's side.

Inferences are language specific. They may have a place but they are not obvious when you look at a scale of Wikidata. Thanks, GerardM

On 17 August 2015 at 14:47, Markus Kroetzsch markus.kroetzsch@tu-dresden.de wrote:

...
Hi Andrew,

I am very interested in this, especially in the second aspect (how to handle symmetry). There are many cases where we have two or more ways to say the same thing on Wikidata (symmetric properties are only one case). It would be useful to draw these inferences so that they can used for queries and maybe also in the UI.

This can also help to solve some of the other problems you mention: for those who would like to have properties "son" and "daughter", one could infer their values automatically from other statements, without editors having to maintain this data at all.

A possible way to maintain these statements on wiki would be to use a special reference to encode that they have been inferred (and from what). This would make it possible to maintain them automatically without the problem of human editors ending up wrestling with bots ;-) Moreover, it would not require any change in the software on which Wikidata is running.

For the cases you mentioned, I don't think that there is a problem with too many inferred statements. There are surely cases where it would not be practical (in the current system) to store inferred data, but family relationships are usually not problematic. In fact, they are very useful to human readers.

Of course, the community needs to fully control what is inferred, and this has to be done in-wiki. We already have symmetry information in constraints, but for useful inference we might have to be stricter. The current constraints also cover some not-so-strict cases where exceptions are likely (e.g., most people have only one gender, but this is not a strong rule; on the other hand, one is always the child of one's mother by definition).

One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.

A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This will make it easier to solve part of the problem first without having too many discussions in parallel.

Best regards,

Markus

On 17.08.2015 13:29, Andrew Gray wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

19 Aug 19 Aug

10:37 a.m.

Hi!

...

One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.

By coincidence, just today I was looking into "spouse" property's qualifiers, specifically start/end times - and we do have a sizeable number of entries where qualifiers on both ends do not match. Usually, the case is that one of them is missing, however it is possible to have them have different values - as different sources (especially after importing from different language wikis which may contain different information) may contain different values. We probably will need to have some way of manually handling such cases.

...

A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This

Due to the above, automatic inference may be not trivial, especially for exports which are now processed in context-free fashion - i.e. each entity is more or less processed independently of others. There are also performance concerns - if when generating data about certain entity we will also need to load data about all entities it relates to, it greatly increases the workload per item, so we need to see if we can do it efficiently when doing dumps, etc.

-- Stas Malyshev smalyshev@wikimedia.org

Gerard Meijssen

11:26 a.m.

Hoi, There is always caching.. Thanks, GerardM

On 19 August 2015 at 07:07, Stas Malyshev smalyshev@wikimedia.org wrote:

...

Hi!

...
One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.

By coincidence, just today I was looking into "spouse" property's qualifiers, specifically start/end times - and we do have a sizeable number of entries where qualifiers on both ends do not match. Usually, the case is that one of them is missing, however it is possible to have them have different values - as different sources (especially after importing from different language wikis which may contain different information) may contain different values. We probably will need to have some way of manually handling such cases.

...
A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This

Due to the above, automatic inference may be not trivial, especially for exports which are now processed in context-free fashion - i.e. each entity is more or less processed independently of others. There are also performance concerns - if when generating data about certain entity we will also need to load data about all entities it relates to, it greatly increases the workload per item, so we need to see if we can do it efficiently when doing dumps, etc.

-- Stas Malyshev smalyshev@wikimedia.org

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Stas Malyshev

12:08 p.m.

Hi!

...

There is always caching..

We have 14M+ entries, so keeping them all in memory won't be realistic, and repetitiveness of access is pretty low - each one would be accessed only once for each inferred relationship, so the cache would work well only if we somehow are lucky to process related entities in clusters, so that both ends of the relationship are processed within short time to each other, but I don't see why we can count on such luck. Besides that, caching would only save the time required to load the data from the database, but not to actually process all the inferences. If we talk about 15M entities, every 1ms of extra processing time per entity adds 4 hours to dump processing. Granted, with modern CPUs you can do a lot in 1ms, but we should keep in mind the costs.

Also, there's another thing. Suppose we have Q345 -> spouse -> Q123, but not Q123 -> spouse -> Q345, and we process entities, without loss of generality, in order of ascending IDs. When we generate data for Q123, we don't know yet that Q345 is linked to it, so in order to infer Q123 -> spouse -> Q345, we can't just load Q345 (we'd need to load it later anyway to get the qualifiers, etc.), since we don't know we'd need it, we'd probably somehow have to query the database (if we have suitable links table?) for every entry that has Q123 on the other end of "spouse". I'm not even sure it's possible currently on Wikidata (query service can easily do that, but not within 1ms), but even if it is, I don't see how it is cacheable and doing this for every entity for multiple relationships may be quite expensive.

-- Stas Malyshev smalyshev@wikimedia.org

Markus Krötzsch

2:20 p.m.

On 19.08.2015 08:38, Stas Malyshev wrote: ...

...

Also, there's another thing. Suppose we have Q345 -> spouse -> Q123, but not Q123 -> spouse -> Q345, and we process entities, without loss of generality, in order of ascending IDs. When we generate data for Q123, we don't know yet that Q345 is linked to it, so in order to infer Q123 -> spouse -> Q345, we can't just load Q345 (we'd need to load it later anyway to get the qualifiers, etc.), since we don't know we'd need it, we'd probably somehow have to query the database (if we have suitable links table?) for every entry that has Q123 on the other end of "spouse". I'm not even sure it's possible currently on Wikidata (query service can easily do that, but not within 1ms), but even if it is, I don't see how it is cacheable and doing this for every entity for multiple relationships may be quite expensive.

That's an important concern for generating the live exports, but it does not actually matter for the dumps. RDF does not care about the order, so you can generate triples about Q123 when processing Q345. There are also other methods of taking advantage of inferences during query answering without having to precompute them first (based on query rewriting, which could be done by a service on top of the main SPARQL endpoint). Anyway, this really needs a bit more thought before it should be part of the main SPARQL endpoint. I will write another email on this ...

Markus

Markus Krötzsch

2:39 p.m.

Hi all,

There have been some discussions here already on what to do with the inferences (add them to Wikidata, just display them, add them only to the query service, etc.). That's great, but this is already the second step from where we are now.

Right now, we don't have any way yet for people to write down what should be inferred. If we could describe this, we could easily add information on what to do with the inference (add, display, make queryable, use for quality control, etc.). This could be discussed on a case-by-case basis (similar to bot requests).

Even the (very simple) case of symmetry shows that we are not there yet: we have no information anywhere on Wikidata that tells us that start and end qualifiers for spouse should also be symmetric. It is not automatically the case that all qualifiers of a symmetric property are symmetric! For example, "diplomatic relation" (P530) is symmetric and uses qualifier "diplomatic mission sent" (P531) that points to the embassy of the subject country in the value country. Clearly this qualifier should not be copied when inferring symmetric statements. Symmetry is only the simplest case; already "inverse of" requires more information ...

We therefore first need to come up with a good way of describing the intended inferences in the wiki. Then we can think about how to best act on this information, step by step. The current constraints such as "this property is symmetric" are obviously too limited for really describing what should be inferred. On the other hand, one needs to take care that descriptions are not too general, to make sure that they can still be implemented and that they remain meaningful when considering many of them together (just consider what happens when an inferred relation triggers another inference ...). Luckily, there is a lot of experience in this area today, so it's not rocket science to come up with a workable description language that is not a collection of special cases and still is not too general or too complicated.

So what's the best way to move forward? I have some ideas on how to do this, but I would like to also have user feedback to make sure that the result is easy to use and covers many important use cases. The basic idea would be to come up with a template-based format for describing rules of inference of the form "If there is a statement that looks like X, then infer a statement that looks like Y". However, there must also be a way to say how the qualifiers should be formed for Y. I have some ideas on how to do this in a (hopefully) sane way.

If other people are interested, we could form some kind of interest group to work this out together. Alternatively, I can start by making a proposal on the wiki.

Markus

On 17.08.2015 14:47, Markus Kroetzsch wrote:

...

Hi Andrew,

I am very interested in this, especially in the second aspect (how to handle symmetry). There are many cases where we have two or more ways to say the same thing on Wikidata (symmetric properties are only one case). It would be useful to draw these inferences so that they can used for queries and maybe also in the UI.

This can also help to solve some of the other problems you mention: for those who would like to have properties "son" and "daughter", one could infer their values automatically from other statements, without editors having to maintain this data at all.

A possible way to maintain these statements on wiki would be to use a special reference to encode that they have been inferred (and from what). This would make it possible to maintain them automatically without the problem of human editors ending up wrestling with bots ;-) Moreover, it would not require any change in the software on which Wikidata is running.

For the cases you mentioned, I don't think that there is a problem with too many inferred statements. There are surely cases where it would not be practical (in the current system) to store inferred data, but family relationships are usually not problematic. In fact, they are very useful to human readers.

Of course, the community needs to fully control what is inferred, and this has to be done in-wiki. We already have symmetry information in constraints, but for useful inference we might have to be stricter. The current constraints also cover some not-so-strict cases where exceptions are likely (e.g., most people have only one gender, but this is not a strong rule; on the other hand, one is always the child of one's mother by definition).

One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.

A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This will make it easier to solve part of the problem first without having too many discussions in parallel.

Best regards,

Markus

On 17.08.2015 13:29, Andrew Gray wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

Gerard Meijssen

2:43 p.m.

Hoi, I often forget that Wikidata does not have the same power that Reasonator has. When you have parents with children, it will nicely show all the siblings and the complete grandparents and all.\

Obviously it can be done, it has been done. It is sad that Wikidata is not as aware as the tooling that has been around for years now. Thanks, GerardM

On 19 August 2015 at 11:09, Markus Krötzsch markus@semantic-mediawiki.org wrote:

...

Hi all,

There have been some discussions here already on what to do with the inferences (add them to Wikidata, just display them, add them only to the query service, etc.). That's great, but this is already the second step from where we are now.

Right now, we don't have any way yet for people to write down what should be inferred. If we could describe this, we could easily add information on what to do with the inference (add, display, make queryable, use for quality control, etc.). This could be discussed on a case-by-case basis (similar to bot requests).

Even the (very simple) case of symmetry shows that we are not there yet: we have no information anywhere on Wikidata that tells us that start and end qualifiers for spouse should also be symmetric. It is not automatically the case that all qualifiers of a symmetric property are symmetric! For example, "diplomatic relation" (P530) is symmetric and uses qualifier "diplomatic mission sent" (P531) that points to the embassy of the subject country in the value country. Clearly this qualifier should not be copied when inferring symmetric statements. Symmetry is only the simplest case; already "inverse of" requires more information ...

We therefore first need to come up with a good way of describing the intended inferences in the wiki. Then we can think about how to best act on this information, step by step. The current constraints such as "this property is symmetric" are obviously too limited for really describing what should be inferred. On the other hand, one needs to take care that descriptions are not too general, to make sure that they can still be implemented and that they remain meaningful when considering many of them together (just consider what happens when an inferred relation triggers another inference ...). Luckily, there is a lot of experience in this area today, so it's not rocket science to come up with a workable description language that is not a collection of special cases and still is not too general or too complicated.

So what's the best way to move forward? I have some ideas on how to do this, but I would like to also have user feedback to make sure that the result is easy to use and covers many important use cases. The basic idea would be to come up with a template-based format for describing rules of inference of the form "If there is a statement that looks like X, then infer a statement that looks like Y". However, there must also be a way to say how the qualifiers should be formed for Y. I have some ideas on how to do this in a (hopefully) sane way.

If other people are interested, we could form some kind of interest group to work this out together. Alternatively, I can start by making a proposal on the wiki.

Markus

On 17.08.2015 14:47, Markus Kroetzsch wrote:

...
Hi Andrew,

I am very interested in this, especially in the second aspect (how to handle symmetry). There are many cases where we have two or more ways to say the same thing on Wikidata (symmetric properties are only one case). It would be useful to draw these inferences so that they can used for queries and maybe also in the UI.

This can also help to solve some of the other problems you mention: for those who would like to have properties "son" and "daughter", one could infer their values automatically from other statements, without editors having to maintain this data at all.

A possible way to maintain these statements on wiki would be to use a special reference to encode that they have been inferred (and from what). This would make it possible to maintain them automatically without the problem of human editors ending up wrestling with bots ;-) Moreover, it would not require any change in the software on which Wikidata is running.

For the cases you mentioned, I don't think that there is a problem with too many inferred statements. There are surely cases where it would not be practical (in the current system) to store inferred data, but family relationships are usually not problematic. In fact, they are very useful to human readers.

Of course, the community needs to fully control what is inferred, and this has to be done in-wiki. We already have symmetry information in constraints, but for useful inference we might have to be stricter. The current constraints also cover some not-so-strict cases where exceptions are likely (e.g., most people have only one gender, but this is not a strong rule; on the other hand, one is always the child of one's mother by definition).

One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.

A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This will make it easier to solve part of the problem first without having too many discussions in parallel.

Best regards,

Markus

On 17.08.2015 13:29, Andrew Gray wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Markus Krötzsch

3:09 p.m.

On 19.08.2015 11:13, Gerard Meijssen wrote:

...

Hoi, I often forget that Wikidata does not have the same power that Reasonator has. When you have parents with children, it will nicely show all the siblings and the complete grandparents and all.\

Obviously it can be done, it has been done. It is sad that Wikidata is not as aware as the tooling that has been around for years now.

Yes, Reasonator shows that you can draw useful inferences from the data. As it is now, the inferences are hard-coded in Reasonator, and therefore cannot be extended by the community. The next step for us is to create a way for the community to describe what inferences should be drawn. Reasonator (and other tools) could then enrich their view with further inferences without Magnus having to hand-code all relevant cases on his own ;-).

Markus

Thomas Douillard

22 Aug 22 Aug

4:29 p.m.

Another example where things might get complicated is the common "office - head of office problem".

For example in french governments, ministries change all the time in scopes depending on the president and the prime minister.

We can have one "Minister of veterants and of slipper", with it corresponding minister, that will become "minister on slippers and panties" on the next one.

While it's pretty sure there will be a minister of foreign affairs, hence it's pretty clear that there will be an item for the corresponding minister.

So a contributor can face problems like "do I use the construction"

* Michel Michu <office held>: <minister> <of>: <slippers> * Michel Michu <office held>: <minister of slipper> * Michel Michu <office held>: <minister> <of>: <ministry of slipper>

Do I create the item for the ministry of slippers ? for the minister as an office ?

I think inferences could help him by stating all these are more or less a way to say the same thing, and help the query writing at the same time.

2015-08-17 14:47 GMT+02:00 Markus Kroetzsch markus.kroetzsch@tu-dresden.de :

...

Hi Andrew,

I am very interested in this, especially in the second aspect (how to handle symmetry). There are many cases where we have two or more ways to say the same thing on Wikidata (symmetric properties are only one case). It would be useful to draw these inferences so that they can used for queries and maybe also in the UI.

This can also help to solve some of the other problems you mention: for those who would like to have properties "son" and "daughter", one could infer their values automatically from other statements, without editors having to maintain this data at all.

A possible way to maintain these statements on wiki would be to use a special reference to encode that they have been inferred (and from what). This would make it possible to maintain them automatically without the problem of human editors ending up wrestling with bots ;-) Moreover, it would not require any change in the software on which Wikidata is running.

For the cases you mentioned, I don't think that there is a problem with too many inferred statements. There are surely cases where it would not be practical (in the current system) to store inferred data, but family relationships are usually not problematic. In fact, they are very useful to human readers.

Of course, the community needs to fully control what is inferred, and this has to be done in-wiki. We already have symmetry information in constraints, but for useful inference we might have to be stricter. The current constraints also cover some not-so-strict cases where exceptions are likely (e.g., most people have only one gender, but this is not a strong rule; on the other hand, one is always the child of one's mother by definition).

One also has to be careful with qualifiers etc. For example, the start end end of a "spouse" statement should be copied to its symmetric version, but there might also be qualifiers that should not be copied like this. I would like to work on a proposal for how to specify such things. It would be good to coordinate there.

A first step (even before adding any statement to Wikidata) could be to add inferred information to the query services and RDF exports. This will make it easier to solve part of the problem first without having too many discussions in parallel.

Best regards,

Markus

On 17.08.2015 13:29, Andrew Gray wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

-- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Federico Leva (Nemo)

20 Aug 20 Aug

6:45 p.m.

Andrew Gray, 17/08/2015 13:29:

...

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

I think this is quite important. I think properties should focus on one thing at a time and there is no need to state both gender and family relationship in the same statement.

Also, are we really sure we don't currently have linguistic issues? I bet there is at least one language in the world where "sister" and "brother" are not two distinct words.

Nemo

Gerard Meijssen

6:51 p.m.

Hoi, <grin> the English word sibling is good enough.. </grin> because of the gender of a person we would know if it is a brother or a sister... When we know a parent, we implicitly know this already through child.. Consequently in many occassions we do not need to register brother or sister at all. Thanks, GerardM

On 20 August 2015 at 15:15, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...

Andrew Gray, 17/08/2015 13:29:

...
In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

I think this is quite important. I think properties should focus on one thing at a time and there is no need to state both gender and family relationship in the same statement.

Also, are we really sure we don't currently have linguistic issues? I bet there is at least one language in the world where "sister" and "brother" are not two distinct words.

Nemo

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Jane Darnell

9:05 p.m.

Except when people forget to add the women to the family. I have added lots of women to Wikidata and then I have to go through all of the family relationships - I wish I could just say someone is a sister of someone and be done with it, but no, if they are a child, mother, aunt or cousin as well then I have to go track down all of those relationships separately.

On Thu, Aug 20, 2015 at 3:21 PM, Gerard Meijssen gerard.meijssen@gmail.com wrote:

...

Hoi, <grin> the English word sibling is good enough.. </grin> because of the gender of a person we would know if it is a brother or a sister... When we know a parent, we implicitly know this already through child.. Consequently in many occassions we do not need to register brother or sister at all. Thanks, GerardM

On 20 August 2015 at 15:15, Federico Leva (Nemo) nemowiki@gmail.com wrote:

...
Andrew Gray, 17/08/2015 13:29:

...
In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

I think this is quite important. I think properties should focus on one thing at a time and there is no need to state both gender and family relationship in the same statement.

Also, are we really sure we don't currently have linguistic issues? I bet there is at least one language in the world where "sister" and "brother" are not two distinct words.

Nemo

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Andrew Gray

24 Aug 24 Aug

6:19 p.m.

Hi all,

Thanks again for your comments. It looks like:

a) there's interest in simplifying this;

b) creating automatic inferences is possibly desirable but will need a lot of work and thought.

I'll put together an RFC onwiki about merging the "gendered" relationship properties, which will address the first part of the issue, and we can continue to think about how best to approach the second.

Andrew.

On 17 August 2015 at 12:29, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...

Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

--

Andrew Gray andrew.gray@dunelm.org.uk

-- - Andrew Gray andrew.gray@dunelm.org.uk

Lukas Benedix

6:32 p.m.

+1 for genderless family relationship properties.

Lukas

...

Hi all,

Thanks again for your comments. It looks like:

a) there's interest in simplifying this;

b) creating automatic inferences is possibly desirable but will need a lot of work and thought.

I'll put together an RFC onwiki about merging the "gendered" relationship properties, which will address the first part of the issue, and we can continue to think about how best to approach the second.

Andrew.

On 17 August 2015 at 12:29, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

--

Andrew Gray andrew.gray@dunelm.org.uk

--

Andrew Gray andrew.gray@dunelm.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Andrew Gray

25 Aug 25 Aug

2:25 a.m.

Having gone and written the RFC (https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Merging_relation...) I've just discovered that we *did* have this discussion in 2013:

https://www.wikidata.org/w/index.php?title=Wikidata%3AProperties_for_deletio...

- and it was suggested we come back to it "after Phase III". I think the existing state of arbitrary access should be able to solve this problem, so I've added some notes about this.

Comments welcome; I'll circulate notifications onwiki tonight.

Andrew.

On 24 August 2015 at 14:02, Lukas Benedix benedix@zedat.fu-berlin.de wrote:

...

+1 for genderless family relationship properties.

Lukas

...
Hi all,

Thanks again for your comments. It looks like:

a) there's interest in simplifying this;

b) creating automatic inferences is possibly desirable but will need a lot of work and thought.

I'll put together an RFC onwiki about merging the "gendered" relationship properties, which will address the first part of the issue, and we can continue to think about how best to approach the second.

Andrew.

On 17 August 2015 at 12:29, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

--

Andrew Gray andrew.gray@dunelm.org.uk

--

Andrew Gray andrew.gray@dunelm.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- - Andrew Gray andrew.gray@dunelm.org.uk

Ole Palnatoke Andersen

26 Aug 26 Aug

5:15 p.m.

I've just completed #100wikidays, and my 100th article was about a horse: https://www.wikidata.org/wiki/Q12003911 That horse is the grandfather of https://www.wikidata.org/wiki/Q20872428, but should I use the same properties as for humans?

We also have https://www.wikidata.org/wiki/Q12331109 and https://www.wikidata.org/wiki/Q12338810, who were father and son. Again: Do we have animal properties, or do we use the same as for humans?

Regards, Ole

On Mon, Aug 24, 2015 at 10:55 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...

Having gone and written the RFC (https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Merging_relation...) I've just discovered that we *did* have this discussion in 2013:

https://www.wikidata.org/w/index.php?title=Wikidata%3AProperties_for_deletio...

and it was suggested we come back to it "after Phase III". I think

the existing state of arbitrary access should be able to solve this problem, so I've added some notes about this.

Comments welcome; I'll circulate notifications onwiki tonight.

Andrew.

On 24 August 2015 at 14:02, Lukas Benedix benedix@zedat.fu-berlin.de wrote:

...
+1 for genderless family relationship properties.

Lukas

...
Hi all,

Thanks again for your comments. It looks like:

a) there's interest in simplifying this;

b) creating automatic inferences is possibly desirable but will need a lot of work and thought.

I'll put together an RFC onwiki about merging the "gendered" relationship properties, which will address the first part of the issue, and we can continue to think about how best to approach the second.

Andrew.

On 17 August 2015 at 12:29, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

--

Andrew Gray andrew.gray@dunelm.org.uk

--

Andrew Gray andrew.gray@dunelm.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

--

Andrew Gray andrew.gray@dunelm.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- http://palnatoke.org * @palnatoke * +4522934588

Svavar Kjarrval

6:46 p.m.

On mið 26.ágú 2015 11:45, Ole Palnatoke Andersen wrote:

...

I've just completed #100wikidays, and my 100th article was about a horse: https://www.wikidata.org/wiki/Q12003911 That horse is the grandfather of https://www.wikidata.org/wiki/Q20872428, but should I use the same properties as for humans?

We also have https://www.wikidata.org/wiki/Q12331109 and https://www.wikidata.org/wiki/Q12338810, who were father and son. Again: Do we have animal properties, or do we use the same as for humans?

P21 is a subclass of P31 with Q18608871 which indicates in machine readable interpretation that it is about the gender of people, yet the descriptions assume items can be associated with P21 to include gender of animals. Yeah, I can understand the confusion. :/

- Svavar Kjarrval

Peter F. Patel-Schneider

7:28 p.m.

On 08/26/2015 06:16 AM, Svavar Kjarrval wrote:

...

On mið 26.ágú 2015 11:45, Ole Palnatoke Andersen wrote:

...
I've just completed #100wikidays, and my 100th article was about a horse: https://www.wikidata.org/wiki/Q12003911 That horse is the grandfather of https://www.wikidata.org/wiki/Q20872428, but should I use the same properties as for humans?

We also have https://www.wikidata.org/wiki/Q12331109 and https://www.wikidata.org/wiki/Q12338810, who were father and son. Again: Do we have animal properties, or do we use the same as for humans?

P21 is a subclass of P31 with Q18608871 which indicates in machine readable interpretation that it is about the gender of people, yet the descriptions assume items can be associated with P21 to include gender of animals. Yeah, I can understand the confusion. :/

Svavar Kjarrval

I don't think that P21 (https://www.wikidata.org/wiki/Property:P21, sex or gender) is a subclass of P31 (https://www.wikidata.org/wiki/Property:P31, instance of). Properties aren't subclasses in general.

Perhaps you meant to talk about https://www.wikidata.org/wiki/Property:P21 (sex or gender) being related via (https://www.wikidata.org/wiki/Property:P31 (instance of) to https://www.wikidata.org/wiki/Q18608871 (Wikidata property for items about people). This indicates that the property should only be used on people, even though the description of the property itself talks about its use on animals.

It appears that Wikidata is not very consistent internally.

peter

Svavar Kjarrval

11:13 p.m.

On mið 26.ágú 2015 13:58, Peter F. Patel-Schneider wrote:

...

I don't think that P21 (https://www.wikidata.org/wiki/Property:P21, sex or gender) is a subclass of P31 (https://www.wikidata.org/wiki/Property:P31, instance of). Properties aren't subclasses in general.

Perhaps you meant to talk about https://www.wikidata.org/wiki/Property:P21 (sex or gender) being related via (https://www.wikidata.org/wiki/Property:P31 (instance of) to https://www.wikidata.org/wiki/Q18608871 (Wikidata property for items about people). This indicates that the property should only be used on people, even though the description of the property itself talks about its use on animals.

It appears that Wikidata is not very consistent internally.

peter

Sorry, I'm not used to the Wikidata lingo.

To further explain my point (to which I think you have already agreed to): If I were to produce a code which makes assumptions based on such relations, the code would come to the contradiction that a non-human with a P21 relation is a human, if it were to recursively travel via in the hierarchy of declarations. P21 is declared with a P31->Q18608871 and Q18608871 is in turn declared P1269->Q5. Unless special precautions would be taken, anyone trying to generate an exhaustive list of all humans on Wikidata (without relying solely on the direct declaration on each item), they might find themselves with non-humans on that list due to travelling backwards via such relations.

In essence, it seems like P21 either wrongfully allows definitions of genders of non-humans or that the property is too broad for a declaration of P31->Q18608871.

- Svavar Kjarrval

Joe Filceolaire

27 Aug 27 Aug

12:54 a.m.

Every other ontology mixes humans with fictional characters and with groups of humans and possibly fictional humans (biblical characters for instance). Wikidata has gone to a lot of trouble to try to untangle these into separate classes. Anyone trying to get an exhaustive list of humans and not using <instance of:human> deserves everything he gets.

P21 (sex or gender) is very explicitly specified as being usable for humans and for other creatures. At the request of some languages we have separate items for 'female human' and for 'female creature' (we have the same for male), 'Female human' is 'subclass of:female creature'. Relying on P21 to tell if something is or is not human is not recommended as it will probably miss out all the humans who are neither male nor female - wikidata has about a dozen other values that can be used with this property.

Father (P22) and mother (P25) can perfectly well be used for non-humans and if the current constraints on these properties flag this as a problem then the constraints will have to be updated. I expect to see extensive pedigrees for racehorses entered in Wikidata. Note that there is a proposal under consideration to replace P22 and P25 with a single 'parent' property. Hope this helps

Joe

On Wed, 26 Aug 2015 18:44 Svavar Kjarrval svavar@kjarrval.is wrote:

...

On mið 26.ágú 2015 13:58, Peter F. Patel-Schneider wrote:

...
I don't think that P21 (https://www.wikidata.org/wiki/Property:P21, sex

or

...
gender) is a subclass of P31 (https://www.wikidata.org/wiki/Property:P31

,

...
instance of). Properties aren't subclasses in general.

Perhaps you meant to talk about

https://www.wikidata.org/wiki/Property:P21

...
(sex or gender) being related via (

https://www.wikidata.org/wiki/Property:P31

...
(instance of) to https://www.wikidata.org/wiki/Q18608871 (Wikidata

property

...
for items about people). This indicates that the property should only

be

...
used on people, even though the description of the property itself talks

about

...
its use on animals.

It appears that Wikidata is not very consistent internally.

peter

Sorry, I'm not used to the Wikidata lingo.

To further explain my point (to which I think you have already agreed to): If I were to produce a code which makes assumptions based on such relations, the code would come to the contradiction that a non-human with a P21 relation is a human, if it were to recursively travel via in the hierarchy of declarations. P21 is declared with a P31->Q18608871 and Q18608871 is in turn declared P1269->Q5. Unless special precautions would be taken, anyone trying to generate an exhaustive list of all humans on Wikidata (without relying solely on the direct declaration on each item), they might find themselves with non-humans on that list due to travelling backwards via such relations.

In essence, it seems like P21 either wrongfully allows definitions of genders of non-humans or that the property is too broad for a declaration of P31->Q18608871.

Svavar Kjarrval

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Svavar Kjarrval

4:05 a.m.

On mið 26.ágú 2015 19:24, Joe Filceolaire wrote:

...

Every other ontology mixes humans with fictional characters and with groups of humans and possibly fictional humans (biblical characters for instance). Wikidata has gone to a lot of trouble to try to untangle these into separate classes. Anyone trying to get an exhaustive list of humans and not using <instance of:human> deserves everything he gets.

P21 (sex or gender) is very explicitly specified as being usable for humans and for other creatures. At the request of some languages we have separate items for 'female human' and for 'female creature' (we have the same for male), 'Female human' is 'subclass of:female creature'. Relying on P21 to tell if something is or is not human is not recommended as it will probably miss out all the humans who are neither male nor female - wikidata has about a dozen other values that can be used with this property.

Father (P22) and mother (P25) can perfectly well be used for non-humans and if the current constraints on these properties flag this as a problem then the constraints will have to be updated. I expect to see extensive pedigrees for racehorses entered in Wikidata. Note that there is a proposal under consideration to replace P22 and P25 with a single 'parent' property.

Hope this helps

Joe

For me, it doesn't help. One of the purposes of Wikidata is that it should also be machine readable. If I were trying to, for example, travel recursively through the declarations to find deep common facts about some group of items, it would take much more work than necessary if I have to hunt down and code around a lot of wrongly categorised trees and special cases in the data structure.

One other example is Stubbs, the current mayor of Talkeetna, (Q7627362) which happens to be a cat. The Wikidata item for Stubbs has the declaration P31->Q146 (cat). However, it also has the definition P31->Q30185 (mayor), a subclass of Q2285706 (head of government) which is a subclass of Q82955 (politician) and that's finally a subclass of Q5 (human). One might suggest that since the item for Stubbs is specifically declared as a cat, that definition has priority (or some variation of that logic). The problem is that a machine cannot automatically understand that. Without special programming and/or a way to define contradictions like that in Wikidata, both facts are assumed to be correct. The machine might not even know that there is a contradiction at all so the machine, in its inferences, will assume Stubbs is both a human and a cat.

- Svavar Kjarrval

James Heald

4:35 a.m.

On 26/08/2015 23:35, Svavar Kjarrval wrote:

...

On mið 26.ágú 2015 19:24, Joe Filceolaire wrote:

...
Every other ontology mixes humans with fictional characters and with groups of humans and possibly fictional humans (biblical characters for instance). Wikidata has gone to a lot of trouble to try to untangle these into separate classes. Anyone trying to get an exhaustive list of humans and not using <instance of:human> deserves everything he gets.

P21 (sex or gender) is very explicitly specified as being usable for humans and for other creatures. At the request of some languages we have separate items for 'female human' and for 'female creature' (we have the same for male), 'Female human' is 'subclass of:female creature'. Relying on P21 to tell if something is or is not human is not recommended as it will probably miss out all the humans who are neither male nor female - wikidata has about a dozen other values that can be used with this property.

Father (P22) and mother (P25) can perfectly well be used for non-humans and if the current constraints on these properties flag this as a problem then the constraints will have to be updated. I expect to see extensive pedigrees for racehorses entered in Wikidata. Note that there is a proposal under consideration to replace P22 and P25 with a single 'parent' property.

Hope this helps

Joe

For me, it doesn't help. One of the purposes of Wikidata is that it should also be machine readable. If I were trying to, for example, travel recursively through the declarations to find deep common facts about some group of items, it would take much more work than necessary if I have to hunt down and code around a lot of wrongly categorised trees and special cases in the data structure.

One other example is Stubbs, the current mayor of Talkeetna, (Q7627362) which happens to be a cat. The Wikidata item for Stubbs has the declaration P31->Q146 (cat). However, it also has the definition P31->Q30185 (mayor), a subclass of Q2285706 (head of government) which is a subclass of Q82955 (politician) and that's finally a subclass of Q5 (human). One might suggest that since the item for Stubbs is specifically declared as a cat, that definition has priority (or some variation of that logic). The problem is that a machine cannot automatically understand that. Without special programming and/or a way to define contradictions like that in Wikidata, both facts are assumed to be correct. The machine might not even know that there is a contradiction at all so the machine, in its inferences, will assume Stubbs is both a human and a cat.

Svavar Kjarrval

There are a *lot* of problems with P279 (subclass), right across Wikidata.

These will only be corrected once people start doing searches in a systematic way and addressing the anomalies they find.

In this case, politician (Q82955) should *not* be a subclass of human (Q5), instead it should be a subclass of something like occupation (Q13516667), or alternatively perhaps profession (Q28640).

My understanding is that currently there are a vast number of incorrect subclass relationships in the project, messing up tree searches, and so far it is something that has simply not yet been systematically addressed.

-- James.

Joe Filceolaire

5:08 a.m.

and the class of items that can have occupation:politician should be 'person' not 'human'. 'person' as defined by VIAF could include Stubbs the cat-with-a-name, as well as fictional humans etc. so Stubbs would be <instance of:cat with a name> which would be <subclass of:person> and <subclass of:cat> or maybe Stubbs <instance of:cat> and <instance of:non-human politician>. or maybe the constraint should be property 'occupation' has domain:person and domain:Stubbs. Yeah I like that last one best.

I think the difference between 'occupation' and 'profession' is that if a cat can do it then it's an 'occupation'.

Yes the class tree needs work, especially the higher levels, and copying someone else's High Level Ontology seems to be impractical as they all seem to be copyrighted. I hope that this is the kind of thing that will get better with time. We will see.

Joe

On Thu, Aug 27, 2015 at 12:07 AM James Heald j.heald@ucl.ac.uk wrote:

...

On 26/08/2015 23:35, Svavar Kjarrval wrote:

...
On mið 26.ágú 2015 19:24, Joe Filceolaire wrote:

...
Every other ontology mixes humans with fictional characters and with groups of humans and possibly fictional humans (biblical characters for instance). Wikidata has gone to a lot of trouble to try to untangle these into separate classes. Anyone trying to get an exhaustive list of humans and not using <instance of:human> deserves everything he gets.

P21 (sex or gender) is very explicitly specified as being usable for humans and for other creatures. At the request of some languages we have separate items for 'female human' and for 'female creature' (we have the same for male), 'Female human' is 'subclass of:female creature'. Relying on P21 to tell if something is or is not human is not recommended as it will probably miss out all the humans who are neither male nor female - wikidata has about a dozen other values that can be used with this property.

Father (P22) and mother (P25) can perfectly well be used for non-humans and if the current constraints on these properties flag this as a problem then the constraints will have to be updated. I expect to see extensive pedigrees for racehorses entered in Wikidata. Note that there is a proposal under consideration to replace P22 and P25 with a single 'parent' property.

Hope this helps

Joe

For me, it doesn't help. One of the purposes of Wikidata is that it should also be machine readable. If I were trying to, for example, travel recursively through the declarations to find deep common facts about some group of items, it would take much more work than necessary if I have to hunt down and code around a lot of wrongly categorised trees and special cases in the data structure.

One other example is Stubbs, the current mayor of Talkeetna, (Q7627362) which happens to be a cat. The Wikidata item for Stubbs has the declaration P31->Q146 (cat). However, it also has the definition P31->Q30185 (mayor), a subclass of Q2285706 (head of government) which is a subclass of Q82955 (politician) and that's finally a subclass of Q5 (human). One might suggest that since the item for Stubbs is specifically declared as a cat, that definition has priority (or some variation of that logic). The problem is that a machine cannot automatically understand that. Without special programming and/or a way to define contradictions like that in Wikidata, both facts are assumed to be correct. The machine might not even know that there is a contradiction at all so the machine, in its inferences, will assume Stubbs is both a human and a cat.

Svavar Kjarrval

There are a *lot* of problems with P279 (subclass), right across Wikidata.

These will only be corrected once people start doing searches in a systematic way and addressing the anomalies they find.

In this case, politician (Q82955) should *not* be a subclass of human (Q5), instead it should be a subclass of something like occupation (Q13516667), or alternatively perhaps profession (Q28640).

My understanding is that currently there are a vast number of incorrect subclass relationships in the project, messing up tree searches, and so far it is something that has simply not yet been systematically addressed.

-- James.

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Thad Guidry

7:32 a.m.

[snip]

Yes the class tree needs work, especially the higher levels, and copying

...

someone else's High Level Ontology seems to be impractical as they all seem to be copyrighted. I hope that this is the kind of thing that will get better with time. We will see.

...

Joe

...

Freebase's is not copyrighted ! :)

In Freebase, we had the notion of a community curated mutex that applied simple rules and visually would warn a user when they applied an instance of (Freebase Type or Class) to an entity that was also typed by a mutexed Class.

For instance, Pets ARE NOT Humans And other simple things like that.

Wikidata might want to begin creating something like that. Q82955 NOT ALLOWED AS Q5 or whatever works for you guys.

The Big Mama Mutex in Freebase was the primary one and others...all of which is still in the Freebase graph data itself. Use it as a starting point if you want.

Thad +ThadGuidry https://www.google.com/+ThadGuidry

Svavar Kjarrval

6:31 a.m.

On mið 26.ágú 2015 23:05, James Heald wrote:

...

There are a *lot* of problems with P279 (subclass), right across Wikidata.

These will only be corrected once people start doing searches in a systematic way and addressing the anomalies they find.

In this case, politician (Q82955) should *not* be a subclass of human (Q5), instead it should be a subclass of something like occupation (Q13516667), or alternatively perhaps profession (Q28640).

My understanding is that currently there are a vast number of incorrect subclass relationships in the project, messing up tree searches, and so far it is something that has simply not yet been systematically addressed.

-- James.

For now, what's the best way to find (and perhaps correct) incorrect declarations like these?

If I were to just change items for commonly used items like politician (Q82955) it might be construed as vandalism or someone who doesn't care about or understand the Stubbs-declared-as-a-human problem might just add that declaration back later.

When it comes to the gender property (P21), the human readable description indicates that it's to define genders in general, yet it's declared as an instance of an item (Q18608871) which only applies to humans, which of course has consequences further up in the hierarchy since the maintainers of item Q18608871 faithfully assume it only applies to humans.

In the case of the hierarchy Stubbs is associated with the maintainers have assumed all mayors are, without exception, humans or they somehow thought that if there were exceptions to this, the machines could somehow detect and apply them in each case. Both of those methods are, I think we agree, are wrong and we should find out why it's happening.

Is there a tool where one can put in a Wikidata item and it extracts declarations based on "higher" properties like subclass or instance of? Like if I were to input the item for Stubbs, it would travel the hierarchy and tell me what would be assumed about Stubbs based on the declarations further up in the tree.

- Svavar Kjarrval

Peter F. Patel-Schneider

9:56 a.m.

On 08/26/2015 06:01 PM, Svavar Kjarrval wrote:

...

On mið 26.ágú 2015 23:05, James Heald wrote:

...
There are a *lot* of problems with P279 (subclass), right across Wikidata.

These will only be corrected once people start doing searches in a systematic way and addressing the anomalies they find.

In this case, politician (Q82955) should *not* be a subclass of human (Q5), instead it should be a subclass of something like occupation (Q13516667), or alternatively perhaps profession (Q28640).

My understanding is that currently there are a vast number of incorrect subclass relationships in the project, messing up tree searches, and so far it is something that has simply not yet been systematically addressed.

-- James.

For now, what's the best way to find (and perhaps correct) incorrect declarations like these?

If I were to just change items for commonly used items like politician (Q82955) it might be construed as vandalism or someone who doesn't care about or understand the Stubbs-declared-as-a-human problem might just add that declaration back later.

When it comes to the gender property (P21), the human readable description indicates that it's to define genders in general, yet it's declared as an instance of an item (Q18608871) which only applies to humans, which of course has consequences further up in the hierarchy since the maintainers of item Q18608871 faithfully assume it only applies to humans.

Well, the situation with respect to Wikidata property for items about people (Q18608871) is very difficult. There is absolutely no machine-interpretable information associated with this class that can be used to deterimine that instances of it are only supposed to be used for people. So, at the bare minimum, such machine-interpretable information needs to be added.

Then there is the issue that there is no theory of how the machine-interpretable information that is associated with entities in Wikidata is to be processed. All the processing is currently done using uninterpretable procedures. For example, on https://www.wikidata.org/wiki/Property_talk:P22 there is information that is used to control some piece of code that checks to see that the subject of https://www.wikidata.org/wiki/Property:P21 belongs to person (Q215627) or fictional character (Q95074). However, there is no theory showing how this interacts with other parts of Wikidata, even such inherent parts of Wikidata as https://www.wikidata.org/wiki/Property:P31

In fact, there is even difficulty of determining simple truth in Wikidata. Two sources can conflict, and Wikidata is not in the position of being an arbiter for such conflicts, certainly not in general. To make the situation even more complex, Wikidata has a temporal aspect as well and has a need to admit exceptions to general statements.

So what can be done? Any solution is going to be tricky. That is not to say that some solutions cannot be found by looking at systems and standards that are already being used for storing large amounts of complex information. However, any solution is going to have to be carefully tailored to meet the requirements of Wikidata and Wikidatans. (Is there an official term for the people who are putting Wikidata and Wikidata information together?)

There is also a big chicken-and-egg problem here - a good solution to reliable machine-interpretation of Wikidata information requires, for example, consistent use of instance of, subclass, and subproperty; but what counts as a consistent use of these fundamental properties depends on a formal theory of what they mean.

I, for one, would find even just the attempt to solve this problem vastly interesting, and I have been doing some exploration as to what might be needed. My company is interested in using Wikidata as a source of background information, but finds that the lack of a good theory of Wikidata information is problematic, so I have some cover for spending time on this problem.

Anyway, if there is interest in machine interpretation of Wikidata information, if only to detect potential anomalies, I, and probably others, would be motivated to spend more time on trying to come up with potential solutions, hopefully in a collaborative effort that includes not just theoreticians but also Wikidatans.

...

In the case of the hierarchy Stubbs is associated with the maintainers have assumed all mayors are, without exception, humans or they somehow thought that if there were exceptions to this, the machines could somehow detect and apply them in each case. Both of those methods are, I think we agree, are wrong and we should find out why it's happening.

Is there a tool where one can put in a Wikidata item and it extracts declarations based on "higher" properties like subclass or instance of? Like if I were to input the item for Stubbs, it would travel the hierarchy and tell me what would be assumed about Stubbs based on the declarations further up in the tree.

Yes, it is called a reasoner. The design of a reasoner would very likely be one result of the sort of work described above, but without such work it is very hard to figure out just what is supposed to be done in any except the simple cases.

...

Svavar Kjarrval

Peter F. Patel-Schneider Nuance Communications

Ole Palnatoke Andersen

1:47 p.m.

Well, I decided to be bold (that is often the road to reversion, but let's get the ball rolling):

Tarok[1] now has Pay Dirt[2] as his father. B.B.S. Sugarlight[3] now has Sugarsweet Sid[4] as his mother, and she has Sugarcane Hanover[5] as her father.

1 https://www.wikidata.org/wiki/Q12338810 2 https://www.wikidata.org/wiki/Q12331109 3 https://www.wikidata.org/wiki/Q20872428 4 https://www.wikidata.org/wiki/Q20873813 5 https://www.wikidata.org/wiki/Q12003911

When I asked about this on Facebook, the first answer was "Random guess: Check out Secretariat. My guess is that it has been registered thoroughly." Now the quest is to connect Secretariat, Tarok and Sugarcane Hanover.. :-)

On Wed, Aug 26, 2015 at 9:24 PM, Joe Filceolaire filceolaire@gmail.com wrote:

...

Every other ontology mixes humans with fictional characters and with groups of humans and possibly fictional humans (biblical characters for instance). Wikidata has gone to a lot of trouble to try to untangle these into separate classes. Anyone trying to get an exhaustive list of humans and not using <instance of:human> deserves everything he gets.

P21 (sex or gender) is very explicitly specified as being usable for humans and for other creatures. At the request of some languages we have separate items for 'female human' and for 'female creature' (we have the same for male), 'Female human' is 'subclass of:female creature'. Relying on P21 to tell if something is or is not human is not recommended as it will probably miss out all the humans who are neither male nor female - wikidata has about a dozen other values that can be used with this property.

Father (P22) and mother (P25) can perfectly well be used for non-humans and if the current constraints on these properties flag this as a problem then the constraints will have to be updated. I expect to see extensive pedigrees for racehorses entered in Wikidata. Note that there is a proposal under consideration to replace P22 and P25 with a single 'parent' property.

Hope this helps

Joe

On Wed, 26 Aug 2015 18:44 Svavar Kjarrval svavar@kjarrval.is wrote:

...
On mið 26.ágú 2015 13:58, Peter F. Patel-Schneider wrote:

...
I don't think that P21 (https://www.wikidata.org/wiki/Property:P21, sex or gender) is a subclass of P31 (https://www.wikidata.org/wiki/Property:P31, instance of). Properties aren't subclasses in general.

Perhaps you meant to talk about https://www.wikidata.org/wiki/Property:P21 (sex or gender) being related via (https://www.wikidata.org/wiki/Property:P31 (instance of) to https://www.wikidata.org/wiki/Q18608871 (Wikidata property for items about people). This indicates that the property should only be used on people, even though the description of the property itself talks about its use on animals.

It appears that Wikidata is not very consistent internally.

peter

Sorry, I'm not used to the Wikidata lingo.

To further explain my point (to which I think you have already agreed to): If I were to produce a code which makes assumptions based on such relations, the code would come to the contradiction that a non-human with a P21 relation is a human, if it were to recursively travel via in the hierarchy of declarations. P21 is declared with a P31->Q18608871 and Q18608871 is in turn declared P1269->Q5. Unless special precautions would be taken, anyone trying to generate an exhaustive list of all humans on Wikidata (without relying solely on the direct declaration on each item), they might find themselves with non-humans on that list due to travelling backwards via such relations.

In essence, it seems like P21 either wrongfully allows definitions of genders of non-humans or that the property is too broad for a declaration of P31->Q18608871.

Svavar Kjarrval

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- http://palnatoke.org * @palnatoke * +4522934588

Marielle Volz

2:38 p.m.

If you want to find all humans on wikidata, find all items with the property "instance of" (p35) equal to "human" (q5). There is no need to infer this from things like having the parent property, that's a terrible way to do things. Items that are instances of different items use the same properties all the time, you shouldn't be inferring anything about the class of an item based on the properties it has.

If you are worried about horses being put in a genealogical tree with humans, that would require someone to put a horse as a parent of a human or vice versa. That's an problem with an invalid relationship being added, not the property itself.

On Wed, Aug 26, 2015 at 6:43 PM, Svavar Kjarrval svavar@kjarrval.is wrote:

...

On mið 26.ágú 2015 13:58, Peter F. Patel-Schneider wrote:

...
I don't think that P21 (https://www.wikidata.org/wiki/Property:P21, sex or gender) is a subclass of P31 (https://www.wikidata.org/wiki/Property:P31, instance of). Properties aren't subclasses in general.

Perhaps you meant to talk about https://www.wikidata.org/wiki/Property:P21 (sex or gender) being related via (https://www.wikidata.org/wiki/Property:P31 (instance of) to https://www.wikidata.org/wiki/Q18608871 (Wikidata property for items about people). This indicates that the property should only be used on people, even though the description of the property itself talks about its use on animals.

It appears that Wikidata is not very consistent internally.

peter

Sorry, I'm not used to the Wikidata lingo.

To further explain my point (to which I think you have already agreed to): If I were to produce a code which makes assumptions based on such relations, the code would come to the contradiction that a non-human with a P21 relation is a human, if it were to recursively travel via in the hierarchy of declarations. P21 is declared with a P31->Q18608871 and Q18608871 is in turn declared P1269->Q5. Unless special precautions would be taken, anyone trying to generate an exhaustive list of all humans on Wikidata (without relying solely on the direct declaration on each item), they might find themselves with non-humans on that list due to travelling backwards via such relations.

In essence, it seems like P21 either wrongfully allows definitions of genders of non-humans or that the property is too broad for a declaration of P31->Q18608871.

Svavar Kjarrval

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Gerard Meijssen

2:47 p.m.

Hoi, Absolutely..

When full genealogy information is available, you do not need special words that indicate whatever. It is only when this is not the case that you need to specify what type of link there is. This can be specific like maternal uncle or paternal aunt. This makes a practical difference in several cultures and is THEREFORE significant. Again, it is only of relevance when it cannot be inferred. Thanks, GerardM

On 27 August 2015 at 11:08, Marielle Volz marielle.volz@gmail.com wrote:

...

If you want to find all humans on wikidata, find all items with the property "instance of" (p35) equal to "human" (q5). There is no need to infer this from things like having the parent property, that's a terrible way to do things. Items that are instances of different items use the same properties all the time, you shouldn't be inferring anything about the class of an item based on the properties it has.

If you are worried about horses being put in a genealogical tree with humans, that would require someone to put a horse as a parent of a human or vice versa. That's an problem with an invalid relationship being added, not the property itself.

On Wed, Aug 26, 2015 at 6:43 PM, Svavar Kjarrval svavar@kjarrval.is wrote:

...
On mið 26.ágú 2015 13:58, Peter F. Patel-Schneider wrote:

...
I don't think that P21 (https://www.wikidata.org/wiki/Property:P21,

sex or

...
...
gender) is a subclass of P31 (

https://www.wikidata.org/wiki/Property:P31,

...
...
instance of). Properties aren't subclasses in general.

Perhaps you meant to talk about

https://www.wikidata.org/wiki/Property:P21

...
...
(sex or gender) being related via (

https://www.wikidata.org/wiki/Property:P31

...
...
(instance of) to https://www.wikidata.org/wiki/Q18608871 (Wikidata

property

...
...
for items about people). This indicates that the property should only

be

...
...
used on people, even though the description of the property itself

talks about

...
...
its use on animals.

It appears that Wikidata is not very consistent internally.

peter

Sorry, I'm not used to the Wikidata lingo.

To further explain my point (to which I think you have already agreed

to):

...
If I were to produce a code which makes assumptions based on such relations, the code would come to the contradiction that a non-human with a P21 relation is a human, if it were to recursively travel via in the hierarchy of declarations. P21 is declared with a P31->Q18608871 and Q18608871 is in turn declared P1269->Q5. Unless special precautions would be taken, anyone trying to generate an exhaustive list of all humans on Wikidata (without relying solely on the direct declaration on each item), they might find themselves with non-humans on that list due to travelling backwards via such relations.

In essence, it seems like P21 either wrongfully allows definitions of genders of non-humans or that the property is too broad for a declaration of P31->Q18608871.

Svavar Kjarrval

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Peter F. Patel-Schneider

26 Aug 26 Aug

7:06 p.m.

I am a relative [sic] outsider to Wikidata and I just tried to answer this question by looking at wikidata.

It turns out that there is information in Wikidata that indicates that https://www.wikidata.org/wiki/Property:P22 (father) is only to be used on people. Look at https://www.wikidata.org/wiki/Property_talk:P22, where both the type and the value type are person (Q215627), fictional character (Q95074). Similar restrictions are in place for https://www.wikidata.org/wiki/Property:P1038 (relative).

So I would say that, no, you should not use these properties on horses.

Whether this is a good thing or not is a separate matter. I do note that there do not appear to be any Wikidata properties that can be used for parent-offspring relationships for horses. Neither https://www.wikidata.org/wiki/Property:P22 (father) nor https://www.wikidata.org/wiki/Property:P1038 (relative) have super-properties.

peter

On 08/26/2015 04:45 AM, Ole Palnatoke Andersen wrote:

...

I've just completed #100wikidays, and my 100th article was about a horse: https://www.wikidata.org/wiki/Q12003911 That horse is the grandfather of https://www.wikidata.org/wiki/Q20872428, but should I use the same properties as for humans?

We also have https://www.wikidata.org/wiki/Q12331109 and https://www.wikidata.org/wiki/Q12338810, who were father and son. Again: Do we have animal properties, or do we use the same as for humans?

Regards, Ole

On Mon, Aug 24, 2015 at 10:55 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...
Having gone and written the RFC (https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Merging_relation...) I've just discovered that we *did* have this discussion in 2013:

https://www.wikidata.org/w/index.php?title=Wikidata%3AProperties_for_deletio...

and it was suggested we come back to it "after Phase III". I think

the existing state of arbitrary access should be able to solve this problem, so I've added some notes about this.

Comments welcome; I'll circulate notifications onwiki tonight.

Andrew.

On 24 August 2015 at 14:02, Lukas Benedix benedix@zedat.fu-berlin.de wrote:

...
+1 for genderless family relationship properties.

Lukas

...
Hi all,

Thanks again for your comments. It looks like:

a) there's interest in simplifying this;

b) creating automatic inferences is possibly desirable but will need a lot of work and thought.

I'll put together an RFC onwiki about merging the "gendered" relationship properties, which will address the first part of the issue, and we can continue to think about how best to approach the second.

Andrew.

On 17 August 2015 at 12:29, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

--

Andrew Gray andrew.gray@dunelm.org.uk

--

Andrew Gray andrew.gray@dunelm.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

--

Andrew Gray andrew.gray@dunelm.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Joe Filceolaire

8:07 p.m.

Use the same properties for family relationships of animals and humans

Sex /gender is the only property that has values for female / male creatures different from the values for male / female humans.

Joe

On Wed, 26 Aug 2015 12:45 Ole Palnatoke Andersen palnatoke@gmail.com wrote:

...

I've just completed #100wikidays, and my 100th article was about a horse: https://www.wikidata.org/wiki/Q12003911 That horse is the grandfather of https://www.wikidata.org/wiki/Q20872428, but should I use the same properties as for humans?

We also have https://www.wikidata.org/wiki/Q12331109 and https://www.wikidata.org/wiki/Q12338810, who were father and son. Again: Do we have animal properties, or do we use the same as for humans?

Regards, Ole

On Mon, Aug 24, 2015 at 10:55 PM, Andrew Gray andrew.gray@dunelm.org.uk wrote:

...
Having gone and written the RFC (

https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Merging_relation... )

...
I've just discovered that we *did* have this discussion in 2013:

https://www.wikidata.org/w/index.php?title=Wikidata%3AProperties_for_deletio...

...

and it was suggested we come back to it "after Phase III". I think

the existing state of arbitrary access should be able to solve this problem, so I've added some notes about this.

Comments welcome; I'll circulate notifications onwiki tonight.

Andrew.

On 24 August 2015 at 14:02, Lukas Benedix benedix@zedat.fu-berlin.de

wrote:

...
...
+1 for genderless family relationship properties.

Lukas

...
Hi all,

Thanks again for your comments. It looks like:

a) there's interest in simplifying this;

b) creating automatic inferences is possibly desirable but will need a lot of work and thought.

I'll put together an RFC onwiki about merging the "gendered" relationship properties, which will address the first part of the issue, and we can continue to think about how best to approach the second.

Andrew.

On 17 August 2015 at 12:29, Andrew Gray andrew.gray@dunelm.org.uk

wrote:

...
...
...
...
Hi all,

I've recently been thinking about how we handle family/genealogical relationships in Wikidata - this is, potentially, a really valuable source of information for researchers to have available in a structured form, especially now we're bringing together so many biographical databases.

We currently have the following properties to link people together:

spouses (P26) and cohabitants (P451) - not gendered

parents (P22/P25) and step-parents (P43/P44) - gendered

siblings (P7/P9) - gendered

children (P40) - not gendered (and oddly no step-children?)

a generic "related to" (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not (we have mother/father not son/daughter). Siblings are likewise gendered, and spouses are not. These are all very early properties - does anyone remember how we got this way?

This makes for some odd results. For example, if we want to using our data to identify all the male-line *descendants* of a person, we have to do some complicated inference from [P40 + target is male]. However, to identify all the male-line *ancestors*, we can just run back up the P22 chain. It feels quite strange to have this difference, and I wonder if we should standardise one way or the other - split P40 or merge the others.

In some ways, merging seems more elegant. We do have fairly good gender metadata (and getting better all the time!), so we can still do gender-specific relationship searches where needed. It also avoids having to force a binary gender approach - we are in the odd position of being able to give a nuanced entry in P21 but can only say if someone is a "sister" or "brother".

** Secondly, symmetry. Siblings, spouses, and parent-child pairs are by definition symmetric. If A has P26:B, then B should also have P26:A. The gendered cases are a little more complicated, as if A has P40:B, then B has P22:A or P25:A, but there is still a degree of symmetry - one of those must be true.

However, Wikidata doesn't really help us make use of this symmetry. If I list A as spouse of B, I need to add (separately) that B is spouse of A. If they have four children C, D, E, and F, this gets very complicated - we have six articles with *30* links between them, all of which need to be made manually. It feels like automatically making symmetric links for these properties would save a lot of work, and produce a much more reliable dataset.

I believe we decided early on not to do symmetric links because it would swamp commonly linked articles (imagine what Q5 would look like by now!). On the other hand, these are properties with a very narrowly defined scope, and we actively *want* them to be comprehensively symmetric - every parent article should list all their children on Wikidata, and every child article should list their parent and all their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a specifically defined class of properties - would an automatically symmetric P26 really swamp the system? It would be great if the system could match up relationships and fill in missing parent/child, sibling, and spouse links. I can't be the only one who regularly adds one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a useful approach... but it would break down if someone tries to remove one of the symmetric entries without also removing the other, as the bot would probably (eventually) fill it back in. Ultimately, an automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a formal proposal on-wiki.

--

Andrew Gray andrew.gray@dunelm.org.uk

--

Andrew Gray andrew.gray@dunelm.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

--

Andrew Gray andrew.gray@dunelm.org.uk

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

-- http://palnatoke.org * @palnatoke * +4522934588

Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata

3365

Age (days ago)

3375

Last active (days ago)

wikidata@lists.wikimedia.org

37 comments

16 participants

tags (0)

participants (16)

Andrew Gray
Federico Leva (Nemo)
Gerard Meijssen
James Heald
Jane Darnell
Joe Filceolaire
Lukas Benedix
Marielle Volz
Markus Kroetzsch
Markus Krötzsch
Ole Palnatoke Andersen
Peter F. Patel-Schneider
Stas Malyshev
Svavar Kjarrval
Thad Guidry
Thomas Douillard