[Wikidata-l] Subclass of/instance of

List overview All Threads
Download

newer

older

[Wikidata-l] Weekyl summary #110

[Wikidata-l] Wikibase on Commons

Markus Kroetzsch

5 May 2014 5 May '14

6:46 p.m.

Hi,

I got interested in subclass of (P279) and instance of (P31) statements recently. I was surprised by two things:

(1) There are quite a lot of subclass of statements: tenth of thousands. (2) Many of them make a lot of sense, and (in particular) are not (obvious) copies of Wikipedia categories.

My big question is: who is creating all these statements and how is this done? It seems too much data to be created manually, but I don't see obvious automated approaches either (and there are usually no references given).

I also found some rare issues. "A subclass of B" should be read as "Every A is also a B". For example, we have "Every piano (Q5994) is also a keyboard instrument (Q52954)". Overall, the great majority of cases I looked at had remarkably sane modelling (which reinforces my big question).

But there are still cases where "subclass of" is mixed up with "instance of". For example, Wikidata also says "Every 'House of Staufen' (Q130875) is also a dynasty (Q164950)". This is dubious -- how many instances of 'House of Staufen' are there? I guess we really want to say that "The House of Staufen is a(n instance of) dynasty." Is this a singular error or a systematic issue?

I guess there is already a group of people who deal with such issues -- or it would be a miracle that things are in such a good shape already :-) I have read the talk page for subclass of, but that does not seem to explain the original of all the data we have already. Pointers?

Cheers,

Markus

Show replies by date

emw

6 May 6 May

4:53 a.m.

Hi Markus,

You asked "who is creating all these [subclass of] statements and how is this done?"

The class hierarchy in http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120&rp=279&lan... a few relatively large subclass trees for specialist domains, including molecular biology and mineralogy. The several thousand subclass of 'gene' and 'protein' subclass claims were created by members of WikiProject Molecular biology (WD:MB), based on discussions in [1] and [2]. The decision to use P279 instead of P31 there was based on the fact that the "is-a" relation in Gene Ontology maps to rdfs:subClassOf, which P279 is based on. The claims were added by a bot [3], with input from WD:MB members. The data ultimately comes from external biological databases.

A glance at the mineralogy class hierarchy indicates it has been constructed by WikiProject Mineralogy [4] members through non-bot edits. I imagine most of the other subclass of claims are done manually or semi-automatically outside specific Wikiproject efforts. In other words, I think most of the other P279 claims are added by Wikidata users going into the UI and building usually-reasonable concept hierarchies on domains they're interested in. I've worked on constructing class hierarchies for health problems (e.g. diseases and injuries) [5] and medical procedures [6] based on classifications like ICD-10 and assertions and templates on Wikipedia (e.g. [8]).

It's not incredibly surprising to me that Wikidata has about 36,000 subclass of (P279) claims [9]. The property has been around for over a year and is a regular topic of discussion [10] along with instance of (P31), which has over 6,600,000 claims.

You noted a dubious claim subclass of claim for 'House of Staufen' (Q130875). I agree that instance of would probably be the better membership property to use there. Such questionable usage of P279 is probably uncommon, but definitely not singular. The dynasty class hierarchy shows 13 dubious cases at the moment [11]. I would guess less than 5% of subclass of claims have that kind of issue, where instance of would make more sense. I think there are probably vastly more cases of the converse: instance of being used where subclass of would make more sense.

As you probably know, P31 and P279 are intended to have the semantics of rdf:type and rdfs:subClassOf per community decision. A while ago I read a bit about the ELK reasoner you were involved with [12], which makes use of the seemingly class-centric OWL EL profile. Do you have any plans to integrate features of ELK with the Wikidata Toolkit [13]? How do you see reasoning engines using P31 and P279 in the future, if at all?

Thanks, Eric

https://www.wikidata.org/wiki/User:Emw

On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch < markus.kroetzsch@tu-dresden.de> wrote:

...

Hi,

I got interested in subclass of (P279) and instance of (P31) statements recently. I was surprised by two things:

(1) There are quite a lot of subclass of statements: tenth of thousands. (2) Many of them make a lot of sense, and (in particular) are not (obvious) copies of Wikipedia categories.

My big question is: who is creating all these statements and how is this done? It seems too much data to be created manually, but I don't see obvious automated approaches either (and there are usually no references given).

I also found some rare issues. "A subclass of B" should be read as "Every A is also a B". For example, we have "Every piano (Q5994) is also a keyboard instrument (Q52954)". Overall, the great majority of cases I looked at had remarkably sane modelling (which reinforces my big question).

But there are still cases where "subclass of" is mixed up with "instance of". For example, Wikidata also says "Every 'House of Staufen' (Q130875) is also a dynasty (Q164950)". This is dubious -- how many instances of 'House of Staufen' are there? I guess we really want to say that "The House of Staufen is a(n instance of) dynasty." Is this a singular error or a systematic issue?

I guess there is already a group of people who deal with such issues -- or it would be a miracle that things are in such a good shape already :-) I have read the talk page for subclass of, but that does not seem to explain the original of all the data we have already. Pointers?

Cheers,

Markus

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Markus Krötzsch

14 May 14 May

3:33 p.m.

Hi Eric,

Thanks for all the information. This was very helpful. I only get to answer now since we have been quite busy building RDF exports for Wikidata (and writing a paper about it). I will soon announce this here (we still need to fix a few details).

You were asking about using these properties like rdfs:subClassOf and rdf:type. I think that's entirely possible, since the modelling is very reasonable and would probably yield good results. Our reasoner ELK could easily handle the class hierarchy in terms of size, but you don't really need such a highly optimized tool for this as long as you only have subClassOf. In fact, the page you linked to shows that it is perfectly possible to compute the class hierarchy with Wikidata Query and to display all of it on one page. ELK's main task is to compute class hierarchies for more complicated ontologies, which we do not have yet. OTOH, query answering and data access are different tasks that ELK is not really intended for (although it could do some of this as well).

Regarding future perspectives: one thing that we have also done is to extract OWL axioms from property constraint templates on Wikidata talk pages (we will publish the result soon, when announcing the rest). This gives you only some specific types of OWL axioms, but it is making things a bit more interesting already. In particular, there are some constraints that tell you that an item should have a certain class, so this is something you could reason with. However, the current property constraint system does not work too well for stating axioms that are not related to a particular property (such as: "Every [instance of] person who appears as an actor in some film should be [instance of] in the class 'actor'" -- which property or item page should this be stated on?). But the constraints show that it makes sense to express such information somehow.

In the end, however, the real use of OWL (and similar ontology languages) is to remove the need for making everything explicit. That is, instead of "constraints" (which say: "if your data looks like X, then your data should also include Y") you have "axioms" (which say: "if your data looks like X, then Y follows automatically"). So this allows you to remove redundancy rather than to detect omissions. This would make more sense with "derived" notions that one does not want to store in the database, but which make sense for queries (like "grandmother").

One would need a bit more infrastructure for this; in particular, one would need to define "grandmother" (with labels in many languages) even if one does not want to use it as a property but only in queries. Maybe one could have a separate Wikibase installation for defining such derived notions without needing to change Wikidata? There are no statements on properties yet, but one could also use item pages to define derived properties when using another site ...

Best regards,

Markus

P.S. Thanks for all the work on the "semantic" modelling aspects of Wikidata. I have seen that you have done a lot in the discussions to clarify things there.

On 06/05/14 04:53, emw wrote:

...

Hi Markus,

You asked "who is creating all these [subclass of] statements and how is this done?"

The class hierarchy in http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120&rp=279&lan... shows a few relatively large subclass trees for specialist domains, including molecular biology and mineralogy. The several thousand subclass of 'gene' and 'protein' subclass claims were created by members of WikiProject Molecular biology (WD:MB), based on discussions in [1] and [2]. The decision to use P279 instead of P31 there was based on the fact that the "is-a" relation in Gene Ontology maps to rdfs:subClassOf, which P279 is based on. The claims were added by a bot [3], with input from WD:MB members. The data ultimately comes from external biological databases.

A glance at the mineralogy class hierarchy indicates it has been constructed by WikiProject Mineralogy [4] members through non-bot edits. I imagine most of the other subclass of claims are done manually or semi-automatically outside specific Wikiproject efforts. In other words, I think most of the other P279 claims are added by Wikidata users going into the UI and building usually-reasonable concept hierarchies on domains they're interested in. I've worked on constructing class hierarchies for health problems (e.g. diseases and injuries) [5] and medical procedures [6] based on classifications like ICD-10 and assertions and templates on Wikipedia (e.g. [8]).

It's not incredibly surprising to me that Wikidata has about 36,000 subclass of (P279) claims [9]. The property has been around for over a year and is a regular topic of discussion [10] along with instance of (P31), which has over 6,600,000 claims.

You noted a dubious claim subclass of claim for 'House of Staufen' (Q130875). I agree that instance of would probably be the better membership property to use there. Such questionable usage of P279 is probably uncommon, but definitely not singular. The dynasty class hierarchy shows 13 dubious cases at the moment [11]. I would guess less than 5% of subclass of claims have that kind of issue, where instance of would make more sense. I think there are probably vastly more cases of the converse: instance of being used where subclass of would make more sense.

As you probably know, P31 and P279 are intended to have the semantics of rdf:type and rdfs:subClassOf per community decision. A while ago I read a bit about the ELK reasoner you were involved with [12], which makes use of the seemingly class-centric OWL EL profile. Do you have any plans to integrate features of ELK with the Wikidata Toolkit [13]? How do you see reasoning engines using P31 and P279 in the future, if at all?

Thanks, Eric

https://www.wikidata.org/wiki/User:Emw

[1] https://www.wikidata.org/wiki/WT:MB#Distinguishing_between_genes_and_protein... [2] https://www.wikidata.org/wiki/WT:MB#Human.2Fmouse.2F..._ID [3] https://www.wikidata.org/wiki/User:ProteinBoxBot. Chinmay Nalk (https://www.wikidata.org/wiki/User:Chinmay26) did all the work on this, with input from WD:MB. [4] https://www.wikidata.org/wiki/Wikidata:WikiProject_Mineralogy [5] http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q15281399&rp=279&... [6] http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q796194&rp=279&la... [7] http://apps.who.int/classifications/icd10/browse/2010/en [8] https://en.wikipedia.org/wiki/Template:Surgeries [9] https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Popular... [10] Examples include

https://www.wikidata.org/wiki/Wikidata:Project_chat#chemical_element

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2013/12#Top_of_t...

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/01#Question... [11] http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q164950&rp=279&la... [12] http://korrekt.org/page/The_Incredible_ELK [13] https://www.mediawiki.org/wiki/Wikidata_Toolkit

On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de> wrote:
Hi,

I got interested in subclass of (P279) and instance of (P31)
statements recently. I was surprised by two things:

(1) There are quite a lot of subclass of statements: tenth of thousands.
(2) Many of them make a lot of sense, and (in particular) are not
(obvious) copies of Wikipedia categories.

My big question is: who is creating all these statements and how is
this done? It seems too much data to be created manually, but I
don't see obvious automated approaches either (and there are usually
no references given).

I also found some rare issues. "A subclass of B" should be read as
"Every A is also a B". For example, we have "Every piano (Q5994) is
also a keyboard instrument (Q52954)". Overall, the great majority of
cases I looked at had remarkably sane modelling (which reinforces my
big question).

But there are still cases where "subclass of" is mixed up with
"instance of". For example, Wikidata also says "Every 'House of
Staufen' (Q130875) is also a dynasty (Q164950)". This is dubious --
how many instances of 'House of Staufen' are there? I guess we
really want to say that "The House of Staufen is a(n instance of)
dynasty." Is this a singular error or a systematic issue?

I guess there is already a group of people who deal with such issues
-- or it would be a miracle that things are in such a good shape
already :-) I have read the talk page for subclass of, but that does
not seem to explain the original of all the data we have already.
Pointers?

Cheers,

Markus


_________________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org>
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
<https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Joe Filceolaire

7:33 p.m.

Except that there are lots of people who have appeared in one movie who don't consider themselves actors and should not have the 'occupation=>actor/actress'. There are good reasons for some constraints to be gadgets that can be overridden rather than hard coded semantic limits.

I do think we should be able to have hard coded reverse properties and symmettric properties.

Joe filceolaire

On Wed, May 14, 2014 at 2:33 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

Hi Eric,

Thanks for all the information. This was very helpful. I only get to answer now since we have been quite busy building RDF exports for Wikidata (and writing a paper about it). I will soon announce this here (we still need to fix a few details).

You were asking about using these properties like rdfs:subClassOf and rdf:type. I think that's entirely possible, since the modelling is very reasonable and would probably yield good results. Our reasoner ELK could easily handle the class hierarchy in terms of size, but you don't really need such a highly optimized tool for this as long as you only have subClassOf. In fact, the page you linked to shows that it is perfectly possible to compute the class hierarchy with Wikidata Query and to display all of it on one page. ELK's main task is to compute class hierarchies for more complicated ontologies, which we do not have yet. OTOH, query answering and data access are different tasks that ELK is not really intended for (although it could do some of this as well).

Regarding future perspectives: one thing that we have also done is to extract OWL axioms from property constraint templates on Wikidata talk pages (we will publish the result soon, when announcing the rest). This gives you only some specific types of OWL axioms, but it is making things a bit more interesting already. In particular, there are some constraints that tell you that an item should have a certain class, so this is something you could reason with. However, the current property constraint system does not work too well for stating axioms that are not related to a particular property (such as: "Every [instance of] person who appears as an actor in some film should be [instance of] in the class 'actor'" -- which property or item page should this be stated on?). But the constraints show that it makes sense to express such information somehow.

In the end, however, the real use of OWL (and similar ontology languages) is to remove the need for making everything explicit. That is, instead of "constraints" (which say: "if your data looks like X, then your data should also include Y") you have "axioms" (which say: "if your data looks like X, then Y follows automatically"). So this allows you to remove redundancy rather than to detect omissions. This would make more sense with "derived" notions that one does not want to store in the database, but which make sense for queries (like "grandmother").

One would need a bit more infrastructure for this; in particular, one would need to define "grandmother" (with labels in many languages) even if one does not want to use it as a property but only in queries. Maybe one could have a separate Wikibase installation for defining such derived notions without needing to change Wikidata? There are no statements on properties yet, but one could also use item pages to define derived properties when using another site ...

Best regards,

Markus

P.S. Thanks for all the work on the "semantic" modelling aspects of Wikidata. I have seen that you have done a lot in the discussions to clarify things there.

On 06/05/14 04:53, emw wrote:

...
Hi Markus,

You asked "who is creating all these [subclass of] statements and how is this done?"

The class hierarchy in http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120&rp=279&lan... shows a few relatively large subclass trees for specialist domains, including molecular biology and mineralogy. The several thousand subclass of 'gene' and 'protein' subclass claims were created by members of WikiProject Molecular biology (WD:MB), based on discussions in [1] and [2]. The decision to use P279 instead of P31 there was based on the fact that the "is-a" relation in Gene Ontology maps to rdfs:subClassOf, which P279 is based on. The claims were added by a bot [3], with input from WD:MB members. The data ultimately comes from external biological databases.

A glance at the mineralogy class hierarchy indicates it has been constructed by WikiProject Mineralogy [4] members through non-bot edits. I imagine most of the other subclass of claims are done manually or semi-automatically outside specific Wikiproject efforts. In other words, I think most of the other P279 claims are added by Wikidata users going into the UI and building usually-reasonable concept hierarchies on domains they're interested in. I've worked on constructing class hierarchies for health problems (e.g. diseases and injuries) [5] and medical procedures [6] based on classifications like ICD-10 and assertions and templates on Wikipedia (e.g. [8]).

It's not incredibly surprising to me that Wikidata has about 36,000 subclass of (P279) claims [9]. The property has been around for over a year and is a regular topic of discussion [10] along with instance of (P31), which has over 6,600,000 claims.

You noted a dubious claim subclass of claim for 'House of Staufen' (Q130875). I agree that instance of would probably be the better membership property to use there. Such questionable usage of P279 is probably uncommon, but definitely not singular. The dynasty class hierarchy shows 13 dubious cases at the moment [11]. I would guess less than 5% of subclass of claims have that kind of issue, where instance of would make more sense. I think there are probably vastly more cases of the converse: instance of being used where subclass of would make more sense.

As you probably know, P31 and P279 are intended to have the semantics of rdf:type and rdfs:subClassOf per community decision. A while ago I read a bit about the ELK reasoner you were involved with [12], which makes use of the seemingly class-centric OWL EL profile. Do you have any plans to integrate features of ELK with the Wikidata Toolkit [13]? How do you see reasoning engines using P31 and P279 in the future, if at all?

Thanks, Eric

https://www.wikidata.org/wiki/User:Emw

[1] https://www.wikidata.org/wiki/WT:MB#Distinguishing_between_ genes_and_proteins [2] https://www.wikidata.org/wiki/WT:MB#Human.2Fmouse.2F..._ID [3] https://www.wikidata.org/wiki/User:ProteinBoxBot. Chinmay Nalk (https://www.wikidata.org/wiki/User:Chinmay26) did all the work on this, with input from WD:MB. [4] https://www.wikidata.org/wiki/Wikidata:WikiProject_Mineralogy [5] http://tools.wmflabs.org/wikidata-todo/tree.html?q= Q15281399&rp=279&lang=en [6] http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q796194&rp=279&la... [7] http://apps.who.int/classifications/icd10/browse/2010/en [8] https://en.wikipedia.org/wiki/Template:Surgeries [9] https://www.wikidata.org/w/index.php?title=Wikidata: Database_reports/Popular_properties&oldid=125595374 [10] Examples include

https://www.wikidata.org/wiki/Wikidata:Project_chat#chemical_element

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/ 2013/12#Top_of_the_subclass_tree

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/ 2014/01#Question_about_classes.2C_and_.27instance_of.27_vs_.27subclass.27 [11] http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q164950&rp=279&la... [12] http://korrekt.org/page/The_Incredible_ELK [13] https://www.mediawiki.org/wiki/Wikidata_Toolkit

On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de>

wrote:
Hi,

I got interested in subclass of (P279) and instance of (P31)
statements recently. I was surprised by two things:

(1) There are quite a lot of subclass of statements: tenth of
thousands. (2) Many of them make a lot of sense, and (in particular) are not (obvious) copies of Wikipedia categories.
My big question is: who is creating all these statements and how is
this done? It seems too much data to be created manually, but I
don't see obvious automated approaches either (and there are usually
no references given).

I also found some rare issues. "A subclass of B" should be read as
"Every A is also a B". For example, we have "Every piano (Q5994) is
also a keyboard instrument (Q52954)". Overall, the great majority of
cases I looked at had remarkably sane modelling (which reinforces my
big question).

But there are still cases where "subclass of" is mixed up with
"instance of". For example, Wikidata also says "Every 'House of
Staufen' (Q130875) is also a dynasty (Q164950)". This is dubious --
how many instances of 'House of Staufen' are there? I guess we
really want to say that "The House of Staufen is a(n instance of)
dynasty." Is this a singular error or a systematic issue?

I guess there is already a group of people who deal with such issues
-- or it would be a miracle that things are in such a good shape
already :-) I have read the talk page for subclass of, but that does
not seem to explain the original of all the data we have already.
Pointers?

Cheers,

Markus


_________________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org
...
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
<https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Markus Krötzsch

15 May 15 May

9:33 a.m.

On 14/05/14 19:33, Joe Filceolaire wrote:

...

Except that there are lots of people who have appeared in one movie who don't consider themselves actors and should not have the 'occupation=>actor/actress'. There are good reasons for some constraints to be gadgets that can be overridden rather than hard coded semantic limits.

Sure, we completely agree here. It was just an example. But it shows why we need any such feature to be controlled by the community ;-)

...

I do think we should be able to have hard coded reverse properties and symmettric properties.

By "hard coded" do you mean "stored explicitly" (as opposed to: "inferred in some way")? It will always be possible to store anything explicitly in this sense (but I guess you know this; maybe I misunderstood what you said; feel free to clarify).

In general, what I mentioned about inferencing is not supposed to alter the way in which the site works. It would be more like a layer on top that could be useful for asking queries. For example, imagine you want to query for the grandmother of a person: we don't have this property in Wikidata but we have enough information to answer the query. So you would have to research how to get this information by combining existing properties. The idea is that one could have a place to keep this information (= the definition of "grandmother" in terms of Wikidata properties). We would then have a "community approved" way of finding grandmothers in Wikidata, and you would be much faster with your query. At the same time, you could look up the definition to find out how Wikidata really stores this information. None of this would would change how the underlying data works, but it could contribute to some data modelling problems because it gives you an option to "support" a property without the added maintenance cost on the data management level.

Cheers,

Markus

...

On Wed, May 14, 2014 at 2:33 PM, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:

...

         I guess there is already a group of people who deal w>
Hi Eric,

Thanks for all the information. This was very helpful. I only get to
answer now since we have been quite busy building RDF exports for
Wikidata (and writing a paper about it). I will soon announce this
here (we still need to fix a few details).

You were asking about using these properties like rdfs:subClassOf
and rdf:type. I think that's entirely possible, since the modelling
is very reasonable and would probably yield good results. Our
reasoner ELK could easily handle the class hierarchy in terms of
size, but you don't really need such a highly optimized tool for
this as long as you only have subClassOf. In fact, the page you
linked to shows that it is perfectly possible to compute the class
hierarchy with Wikidata Query and to display all of it on one page.
ELK's main task is to compute class hierarchies for more complicated
ontologies, which we do not have yet. OTOH, query answering and data
access are different tasks that ELK is not really intended for
(although it could do some of this as well).

Regarding future perspectives: one thing that we have also done is
to extract OWL axioms from property constraint templates on Wikidata
talk pages (we will publish the result soon, when announcing the
rest). This gives you only some specific types of OWL axioms, but it
is making things a bit more interesting already. In particular,
there are some constraints that tell you that an item should have a
certain class, so this is something you could reason with. However,
the current property constraint system does not work too well for
stating axioms that are not related to a particular property (such
as: "Every [instance of] person who appears as an actor in some film
should be [instance of] in the class 'actor'" -- which property or
item page should this be stated on?). But the constraints show that
it makes sense to express such information somehow.

In the end, however, the real use of OWL (and similar ontology
languages) is to remove the need for making everything explicit.
That is, instead of "constraints" (which say: "if your data looks
like X, then your data should also include Y") you have "axioms"
(which say: "if your data looks like X, then Y follows
automatically"). So this allows you to remove redundancy rather than
to detect omissions. This would make more sense with "derived"
notions that one does not want to store in the database, but which
make sense for queries (like "grandmother").

One would need a bit more infrastructure for this; in particular,
one would need to define "grandmother" (with labels in many
languages) even if one does not want to use it as a property but
only in queries. Maybe one could have a separate Wikibase
installation for defining such derived notions without needing to
change Wikidata? There are no statements on properties yet, but one
could also use item pages to define derived properties when using
another site ...

Best regards,

Markus

P.S. Thanks for all the work on the "semantic" modelling aspects of
Wikidata. I have seen that you have done a lot in the discussions to
clarify things there.



On 06/05/14 04:53, emw wrote:

    Hi Markus,

    You asked "who is creating all these [subclass of] statements
    and how is
    this done?"

    The class hierarchy in
    http://tools.wmflabs.org/__wikidata-todo/tree.html?q=__Q35120&rp=279&lang=en
    <http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120&rp=279&lang=en>
    shows a few relatively large subclass trees for specialist domains,
    including molecular biology and mineralogy.  The several thousand
    subclass of 'gene' and 'protein' subclass claims were created by
    members
    of WikiProject Molecular biology (WD:MB), based on discussions
    in [1]
    and [2].  The decision to use P279 instead of P31 there was
    based on the
    fact that the "is-a" relation in Gene Ontology maps to
    rdfs:subClassOf,
    which P279 is based on.  The claims were added by a bot [3],
    with input
    from WD:MB members.  The data ultimately comes from external
    biological
    databases.

    A glance at the mineralogy class hierarchy indicates it has been
    constructed by WikiProject Mineralogy [4] members through non-bot
    edits.  I imagine most of the other subclass of claims are done
    manually
    or semi-automatically outside specific Wikiproject efforts.  In
    other
    words, I think most of the other P279 claims are added by
    Wikidata users
    going into the UI and building usually-reasonable concept
    hierarchies on
    domains they're interested in.  I've worked on constructing class
    hierarchies for health problems (e.g. diseases and injuries) [5] and
    medical procedures [6] based on classifications like ICD-10 and
    assertions and templates on Wikipedia (e.g. [8]).

    It's not incredibly surprising to me that Wikidata has about 36,000
    subclass of (P279) claims [9].  The property has been around for
    over a
    year and is a regular topic of discussion [10] along with
    instance of
    (P31), which has over 6,600,000 claims.

    You noted a dubious claim subclass of claim for 'House of Staufen'
    (Q130875).  I agree that instance of would probably be the better
    membership property to use there.  Such questionable usage of
    P279 is
    probably uncommon, but definitely not singular.  The dynasty class
    hierarchy shows 13 dubious cases at the moment [11].  I would
    guess less
    than 5% of subclass of claims have that kind of issue, where
    instance of
    would make more sense.  I think there are probably vastly more
    cases of
    the converse: instance of being used where subclass of would
    make more
    sense.

    As you probably know, P31 and P279 are intended to have the
    semantics of
    rdf:type and rdfs:subClassOf per community decision.  A while
    ago I read
    a bit about the ELK reasoner you were involved with [12], which
    makes
    use of the seemingly class-centric OWL EL profile.  Do you have any
    plans to integrate features of ELK with the Wikidata Toolkit
    [13]?  How
    do you see reasoning engines using P31 and P279 in the future,
    if at all?

    Thanks,
    Eric

    https://www.wikidata.org/wiki/__User:Emw
    <https://www.wikidata.org/wiki/User:Emw>

    [1]
    https://www.wikidata.org/wiki/__WT:MB#Distinguishing_between___genes_and_proteins
    <https://www.wikidata.org/wiki/WT:MB#Distinguishing_between_genes_and_proteins>
    [2] https://www.wikidata.org/wiki/__WT:MB#Human.2Fmouse.2F..._ID
    <https://www.wikidata.org/wiki/WT:MB#Human.2Fmouse.2F..._ID>
    [3] https://www.wikidata.org/wiki/__User:ProteinBoxBot
    <https://www.wikidata.org/wiki/User:ProteinBoxBot>.  Chinmay Nalk
    (https://www.wikidata.org/__wiki/User:Chinmay26
    <https://www.wikidata.org/wiki/User:Chinmay26>) did all the work
    on this,
    with input from WD:MB.
    [4]
    https://www.wikidata.org/wiki/__Wikidata:WikiProject___Mineralogy <https://www.wikidata.org/wiki/Wikidata:WikiProject_Mineralogy>
    [5]
    http://tools.wmflabs.org/__wikidata-todo/tree.html?q=__Q15281399&rp=279&lang=en
    <http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q15281399&rp=279&lang=en>
    [6]
    http://tools.wmflabs.org/__wikidata-todo/tree.html?q=__Q796194&rp=279&lang=en
    <http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q796194&rp=279&lang=en>
    [7] http://apps.who.int/__classifications/icd10/browse/__2010/en
    <http://apps.who.int/classifications/icd10/browse/2010/en>
    [8] https://en.wikipedia.org/wiki/__Template:Surgeries
    <https://en.wikipedia.org/wiki/Template:Surgeries>
    [9]
    https://www.wikidata.org/w/__index.php?title=Wikidata:__Database_reports/Popular___properties&oldid=125595374
    <https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Popular_properties&oldid=125595374>
    [10] Examples include
    -
    https://www.wikidata.org/wiki/__Wikidata:Project_chat#__chemical_element
    <https://www.wikidata.org/wiki/Wikidata:Project_chat#chemical_element>
    -
    https://www.wikidata.org/wiki/__Wikidata:Project_chat/Archive/__2013/12#Top_of_the_subclass___tree
    <https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2013/12#Top_of_the_subclass_tree>

    -
    https://www.wikidata.org/wiki/__Wikidata:Project_chat/Archive/__2014/01#Question_about___classes.2C_and_.27instance_of.__27_vs_.27subclass.27
    <https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/01#Question_about_classes.2C_and_.27instance_of.27_vs_.27subclass.27>
    [11]
    http://tools.wmflabs.org/__wikidata-todo/tree.html?q=__Q164950&rp=279&lang=en
    <http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q164950&rp=279&lang=en>
    [12] http://korrekt.org/page/The___Incredible_ELK
    <http://korrekt.org/page/The_Incredible_ELK>
    [13] https://www.mediawiki.org/__wiki/Wikidata_Toolkit
    <https://www.mediawiki.org/wiki/Wikidata_Toolkit>


    On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch
    <markus.kroetzsch@tu-dresden.__de
    <mailto:markus.kroetzsch@tu-dresden.de>
    <mailto:markus.kroetzsch@tu-__dresden.de
    <mailto:markus.kroetzsch@tu-dresden.de>>>

    wrote:

         Hi,

         I got interested in subclass of (P279) and instance of (P31)
         statements recently. I was surprised by two things:

         (1) There are quite a lot of subclass of statements: tenth
    of thousands.
         (2) Many of them make a lot of sense, and (in particular)
    are not
         (obvious) copies of Wikipedia categories.

         My big question is: who is creating all these statements
    and how is
         this done? It seems too much data to be created manually, but I
         don't see obvious automated approaches either (and there
    are usually
         no references given).

         I also found some rare issues. "A subclass of B" should be
    read as
         "Every A is also a B". For example, we have "Every piano
    (Q5994) is
         also a keyboard instrument (Q52954)". Overall, the great
    majority of
         cases I looked at had remarkably sane modelling (which
    reinforces my
         big question).

         But there are still cases where "subclass of" is mixed up with
         "instance of". For example, Wikidata also says "Every 'House of
         Staufen' (Q130875) is also a dynasty (Q164950)". This is
    dubious --
         how many instances of 'House of Staufen' are there? I guess we
         really want to say that "The House of Staufen is a(n
    instance of)
         dynasty." Is this a singular error or a systematic issue?

ith such issues -- or it would be a miracle that things are in such a good shape already :-) I have read the talk page for subclass of, but that does not seem to explain the original of all the data we have already. Pointers?

         Cheers,

         Markus


         ___________________________________________________
         Wikidata-l mailing list
    Wikidata-l@lists.wikimedia.org
    <mailto:Wikidata-l@lists.wikimedia.org>
    <mailto:Wikidata-l@lists.__wikimedia.org
    <mailto:Wikidata-l@lists.wikimedia.org>>
    https://lists.wikimedia.org/____mailman/listinfo/wikidata-l
    <https://lists.wikimedia.org/__mailman/listinfo/wikidata-l>
         <https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
    <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>>





    _________________________________________________
    Wikidata-l mailing list
    Wikidata-l@lists.wikimedia.org
    <mailto:Wikidata-l@lists.wikimedia.org>
    https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
    <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>



_________________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org>
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
<https://lists.wikimedia.org/mailman/listinfo/wikidata-l>

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Thomas Douillard

11:47 a.m.

Hi Markus.

Concerning redundancy, I question myself. Is redundancy, at least at some degree, something we absolutely want to remove from Wikidata ? I don't think so. Wikidata is an open project where a lot of change happens on a high number of "pages" (in the Wiki sense). This mean in my own the more control mechanism there is the best it is.

I think redundancy is a powerful mechanism in robustness achievment, it happens to some extent in a lot of complex systems. For example think of claim deletion. Assume a reasoner would rely on that claim to make a lot of inferences. In a sense it's a kind of compression of information. Then there is a risk, if the deletion is unnoticed, that we lose a lot of datas due to that claim deletion.

Now think that there is a redundant claim that is a part of the inferences chain that come from our deleted claim, could a mechanism based on inferences enlight the fact that the graph might be incomplete, or that the deletion introduced an inconsistency, or whatever, where just a inference system with a minimal set of claim to compress the stored data would just not make the inference anymore ? I guess it could also compare the (would we say completed graph, or partially completed ?) before and after the change, and hint that there actually is a mass loss of datas. Or compute a "inference score" based on the number of inferences a claim is a part of to hint the patrollers for the deletion to verify (just random thoughts.

Anyway, any thoughts on redundancy in Wikidata ?

2014-05-14 15:33 GMT+02:00 Markus Krötzsch markus@semantic-mediawiki.org:

...

Hi Eric,

Thanks for all the information. This was very helpful. I only get to answer now since we have been quite busy building RDF exports for Wikidata (and writing a paper about it). I will soon announce this here (we still need to fix a few details).

You were asking about using these properties like rdfs:subClassOf and rdf:type. I think that's entirely possible, since the modelling is very reasonable and would probably yield good results. Our reasoner ELK could easily handle the class hierarchy in terms of size, but you don't really need such a highly optimized tool for this as long as you only have subClassOf. In fact, the page you linked to shows that it is perfectly possible to compute the class hierarchy with Wikidata Query and to display all of it on one page. ELK's main task is to compute class hierarchies for more complicated ontologies, which we do not have yet. OTOH, query answering and data access are different tasks that ELK is not really intended for (although it could do some of this as well).

Regarding future perspectives: one thing that we have also done is to extract OWL axioms from property constraint templates on Wikidata talk pages (we will publish the result soon, when announcing the rest). This gives you only some specific types of OWL axioms, but it is making things a bit more interesting already. In particular, there are some constraints that tell you that an item should have a certain class, so this is something you could reason with. However, the current property constraint system does not work too well for stating axioms that are not related to a particular property (such as: "Every [instance of] person who appears as an actor in some film should be [instance of] in the class 'actor'" -- which property or item page should this be stated on?). But the constraints show that it makes sense to express such information somehow.

In the end, however, the real use of OWL (and similar ontology languages) is to remove the need for making everything explicit. That is, instead of "constraints" (which say: "if your data looks like X, then your data should also include Y") you have "axioms" (which say: "if your data looks like X, then Y follows automatically"). So this allows you to remove redundancy rather than to detect omissions. This would make more sense with "derived" notions that one does not want to store in the database, but which make sense for queries (like "grandmother").

One would need a bit more infrastructure for this; in particular, one would need to define "grandmother" (with labels in many languages) even if one does not want to use it as a property but only in queries. Maybe one could have a separate Wikibase installation for defining such derived notions without needing to change Wikidata? There are no statements on properties yet, but one could also use item pages to define derived properties when using another site ...

Best regards,

Markus

P.S. Thanks for all the work on the "semantic" modelling aspects of Wikidata. I have seen that you have done a lot in the discussions to clarify things there.

On 06/05/14 04:53, emw wrote:

...
Hi Markus,

You asked "who is creating all these [subclass of] statements and how is this done?"

The class hierarchy in http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120&rp=279&lan... shows a few relatively large subclass trees for specialist domains, including molecular biology and mineralogy. The several thousand subclass of 'gene' and 'protein' subclass claims were created by members of WikiProject Molecular biology (WD:MB), based on discussions in [1] and [2]. The decision to use P279 instead of P31 there was based on the fact that the "is-a" relation in Gene Ontology maps to rdfs:subClassOf, which P279 is based on. The claims were added by a bot [3], with input from WD:MB members. The data ultimately comes from external biological databases.

A glance at the mineralogy class hierarchy indicates it has been constructed by WikiProject Mineralogy [4] members through non-bot edits. I imagine most of the other subclass of claims are done manually or semi-automatically outside specific Wikiproject efforts. In other words, I think most of the other P279 claims are added by Wikidata users going into the UI and building usually-reasonable concept hierarchies on domains they're interested in. I've worked on constructing class hierarchies for health problems (e.g. diseases and injuries) [5] and medical procedures [6] based on classifications like ICD-10 and assertions and templates on Wikipedia (e.g. [8]).

It's not incredibly surprising to me that Wikidata has about 36,000 subclass of (P279) claims [9]. The property has been around for over a year and is a regular topic of discussion [10] along with instance of (P31), which has over 6,600,000 claims.

You noted a dubious claim subclass of claim for 'House of Staufen' (Q130875). I agree that instance of would probably be the better membership property to use there. Such questionable usage of P279 is probably uncommon, but definitely not singular. The dynasty class hierarchy shows 13 dubious cases at the moment [11]. I would guess less than 5% of subclass of claims have that kind of issue, where instance of would make more sense. I think there are probably vastly more cases of the converse: instance of being used where subclass of would make more sense.

As you probably know, P31 and P279 are intended to have the semantics of rdf:type and rdfs:subClassOf per community decision. A while ago I read a bit about the ELK reasoner you were involved with [12], which makes use of the seemingly class-centric OWL EL profile. Do you have any plans to integrate features of ELK with the Wikidata Toolkit [13]? How do you see reasoning engines using P31 and P279 in the future, if at all?

Thanks, Eric

https://www.wikidata.org/wiki/User:Emw

[1] https://www.wikidata.org/wiki/WT:MB#Distinguishing_between_ genes_and_proteins [2] https://www.wikidata.org/wiki/WT:MB#Human.2Fmouse.2F..._ID [3] https://www.wikidata.org/wiki/User:ProteinBoxBot. Chinmay Nalk (https://www.wikidata.org/wiki/User:Chinmay26) did all the work on this, with input from WD:MB. [4] https://www.wikidata.org/wiki/Wikidata:WikiProject_Mineralogy [5] http://tools.wmflabs.org/wikidata-todo/tree.html?q= Q15281399&rp=279&lang=en [6] http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q796194&rp=279&la... [7] http://apps.who.int/classifications/icd10/browse/2010/en [8] https://en.wikipedia.org/wiki/Template:Surgeries [9] https://www.wikidata.org/w/index.php?title=Wikidata: Database_reports/Popular_properties&oldid=125595374 [10] Examples include

https://www.wikidata.org/wiki/Wikidata:Project_chat#chemical_element

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/ 2013/12#Top_of_the_subclass_tree

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/ 2014/01#Question_about_classes.2C_and_.27instance_of.27_vs_.27subclass.27 [11] http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q164950&rp=279&la... [12] http://korrekt.org/page/The_Incredible_ELK [13] https://www.mediawiki.org/wiki/Wikidata_Toolkit

On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch <markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de>

wrote:
Hi,

I got interested in subclass of (P279) and instance of (P31)
statements recently. I was surprised by two things:

(1) There are quite a lot of subclass of statements: tenth of
thousands. (2) Many of them make a lot of sense, and (in particular) are not (obvious) copies of Wikipedia categories.
My big question is: who is creating all these statements and how is
this done? It seems too much data to be created manually, but I
don't see obvious automated approaches either (and there are usually
no references given).

I also found some rare issues. "A subclass of B" should be read as
"Every A is also a B". For example, we have "Every piano (Q5994) is
also a keyboard instrument (Q52954)". Overall, the great majority of
cases I looked at had remarkably sane modelling (which reinforces my
big question).

But there are still cases where "subclass of" is mixed up with
"instance of". For example, Wikidata also says "Every 'House of
Staufen' (Q130875) is also a dynasty (Q164950)". This is dubious --
how many instances of 'House of Staufen' are there? I guess we
really want to say that "The House of Staufen is a(n instance of)
dynasty." Is this a singular error or a systematic issue?

I guess there is already a group of people who deal with such issues
-- or it would be a miracle that things are in such a good shape
already :-) I have read the talk page for subclass of, but that does
not seem to explain the original of all the data we have already.
Pointers?

Cheers,

Markus


_________________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org
...
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
<https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Markus Krötzsch

5:03 p.m.

Hi Thomas,

On 15/05/14 11:47, Thomas Douillard wrote:

...

Hi Markus.

Concerning redundancy, I question myself. Is redundancy, at least at some degree, something we absolutely want to remove from Wikidata ? I don't think so. Wikidata is an open project where a lot of change happens on a high number of "pages" (in the Wiki sense). This mean in my own the more control mechanism there is the best it is.

I think redundancy is a powerful mechanism in robustness achievment, it happens to some extent in a lot of complex systems. For example think of claim deletion. Assume a reasoner would rely on that claim to make a lot of inferences. In a sense it's a kind of compression of information. Then there is a risk, if the deletion is unnoticed, that we lose a lot of datas due to that claim deletion.

Now think that there is a redundant claim that is a part of the inferences chain that come from our deleted claim, could a mechanism based on inferences enlight the fact that the graph might be incomplete, or that the deletion introduced an inconsistency, or whatever, where just a inference system with a minimal set of claim to compress the stored data would just not make the inference anymore ? I guess it could also compare the (would we say completed graph, or partially completed ?) before and after the change, and hint that there actually is a mass loss of datas. Or compute a "inference score" based on the number of inferences a claim is a part of to hint the patrollers for the deletion to verify (just random thoughts.

Anyway, any thoughts on redundancy in Wikidata ?

I agree with what you say. It is impossible to build a redundancy-free Wikidata (think of property "spouse" ;-), and there are several reasons for allowing for some kinds of redundancy. At the same time, we could never store *every* fact that implicitly follows from other statements. The community is having discussions about what should be in and what should be out, based on concrete use-cases (for example, we don't have "grandparent" but we do have "sister"). In the end, we have to leave this to the experts in each topic area.

I applaud your comparison of inferencing with a form of decompression. I think this is a nice intuition (in fact, some people have researched "semantic compression" where one tries to reduce the size of a knowledge base by eliminating things that follow from the rest anyway).

You are right that the ramifications of removing one statement might be bigger if inferences are used. On the other hand, there are many other reasons why a single change can have a big impact: soon we will have simple queries, and it is quite possible that thousands of template instances issue queries that all depend on the same single statement -- deleting it would change a lot of pages then. Redundancy cannot protect against this, since most queries (or inference rules) would refer to one form of the data and not check all possible redundant formulations. Moreover, many things in Wikidata are not stored redundantly at all, yet we want robustness in all cases. A mechanism to indicate importance like you describe might be a solution, or we could use a form of protection to avoid accidental changes to important statements. But this is yet another discussion, which is probably more important for Wikipedia than for the prototype I was having in mind.

Another interesting use of inferencing in a system that has redundancy could be to infer additional support for a statement (Example: if a reference says that X is a child of Y, then the same reference also supports the claim that Y is the parent of X). In other words: even if we don't want to infer statements, we might want to infer references :-)

Constraints are a great start. We should now ask how we could improve the management of constraints in the future, and which constraints we will have then.

Cheers,

Markus

...

2014-05-14 15:33 GMT+02:00 Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org>:

Hi Eric,

Thanks for all the information. This was very helpful. I only get to
answer now since we have been quite busy building RDF exports for
Wikidata (and writing a paper about it). I will soon announce this
here (we still need to fix a few details).

You were asking about using these properties like rdfs:subClassOf
and rdf:type. I think that's entirely possible, since the modelling
is very reasonable and would probably yield good results. Our
reasoner ELK could easily handle the class hierarchy in terms of
size, but you don't really need such a highly optimized tool for
this as long as you only have subClassOf. In fact, the page you
linked to shows that it is perfectly possible to compute the class
hierarchy with Wikidata Query and to display all of it on one page.
ELK's main task is to compute class hierarchies for more complicated
ontologies, which we do not have yet. OTOH, query answering and data
access are different tasks that ELK is not really intended for
(although it could do some of this as well).

Regarding future perspectives: one thing that we have also done is
to extract OWL axioms from property constraint templates on Wikidata
talk pages (we will publish the result soon, when announcing the
rest). This gives you only some specific types of OWL axioms, but it
is making things a bit more interesting already. In particular,
there are some constraints that tell you that an item should have a
certain class, so this is something you could reason with. However,
the current property constraint system does not work too well for
stating axioms that are not related to a particular property (such
as: "Every [instance of] person who appears as an actor in some film
should be [instance of] in the class 'actor'" -- which property or
item page should this be stated on?). But the constraints show that
it makes sense to express such information somehow.

In the end, however, the real use of OWL (and similar ontology
languages) is to remove the need for making everything explicit.
That is, instead of "constraints" (which say: "if your data looks
like X, then your data should also include Y") you have "axioms"
(which say: "if your data looks like X, then Y follows
automatically"). So this allows you to remove redundancy rather than
to detect omissions. This would make more sense with "derived"
notions that one does not want to store in the database, but which
make sense for queries (like "grandmother").

One would need a bit more infrastructure for this; in particular,
one would need to define "grandmother" (with labels in many
languages) even if one does not want to use it as a property but
only in queries. Maybe one could have a separate Wikibase
installation for defining such derived notions without needing to
change Wikidata? There are no statements on properties yet, but one
could also use item pages to define derived properties when using
another site ...

Best regards,

Markus

P.S. Thanks for all the work on the "semantic" modelling aspects of
Wikidata. I have seen that you have done a lot in the discussions to
clarify things there.



On 06/05/14 04:53, emw wrote:

    Hi Markus,

    You asked "who is creating all these [subclass of] statements
    and how is
    this done?"

    The class hierarchy in
    http://tools.wmflabs.org/__wikidata-todo/tree.html?q=__Q35120&rp=279&lang=en
    <http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120&rp=279&lang=en>
    shows a few relatively large subclass trees for specialist domains,
    including molecular biology and mineralogy.  The several thousand
    subclass of 'gene' and 'protein' subclass claims were created by
    members
    of WikiProject Molecular biology (WD:MB), based on discussions
    in [1]
    and [2].  The decision to use P279 instead of P31 there was
    based on the
    fact that the "is-a" relation in Gene Ontology maps to
    rdfs:subClassOf,
    which P279 is based on.  The claims were added by a bot [3],
    with input
    from WD:MB members.  The data ultimately comes from external
    biological
    databases.

    A glance at the mineralogy class hierarchy indicates it has been
    constructed by WikiProject Mineralogy [4] members through non-bot
    edits.  I imagine most of the other subclass of claims are done
    manually
    or semi-automatically outside specific Wikiproject efforts.  In
    other
    words, I think most of the other P279 claims are added by
    Wikidata users
    going into the UI and building usually-reasonable concept
    hierarchies on
    domains they're interested in.  I've worked on constructing class
    hierarchies for health problems (e.g. diseases and injuries) [5] and
    medical procedures [6] based on classifications like ICD-10 and
    assertions and templates on Wikipedia (e.g. [8]).

    It's not incredibly surprising to me that Wikidata has about 36,000
    subclass of (P279) claims [9].  The property has been around for
    over a
    year and is a regular topic of discussion [10] along with
    instance of
    (P31), which has over 6,600,000 claims.

    You noted a dubious claim subclass of claim for 'House of Staufen'
    (Q130875).  I agree that instance of would probably be the better
    membership property to use there.  Such questionable usage of
    P279 is
    probably uncommon, but definitely not singular.  The dynasty class
    hierarchy shows 13 dubious cases at the moment [11].  I would
    guess less
    than 5% of subclass of claims have that kind of issue, where
    instance of
    would make more sense.  I think there are probably vastly more
    cases of
    the converse: instance of being used where subclass of would
    make more
    sense.

    As you probably know, P31 and P279 are intended to have the
    semantics of
    rdf:type and rdfs:subClassOf per community decision.  A while
    ago I read
    a bit about the ELK reasoner you were involved with [12], which
    makes
    use of the seemingly class-centric OWL EL profile.  Do you have any
    plans to integrate features of ELK with the Wikidata Toolkit
    [13]?  How
    do you see reasoning engines using P31 and P279 in the future,
    if at all?

    Thanks,
    Eric

    https://www.wikidata.org/wiki/__User:Emw
    <https://www.wikidata.org/wiki/User:Emw>

    [1]
    https://www.wikidata.org/wiki/__WT:MB#Distinguishing_between___genes_and_proteins
    <https://www.wikidata.org/wiki/WT:MB#Distinguishing_between_genes_and_proteins>
    [2] https://www.wikidata.org/wiki/__WT:MB#Human.2Fmouse.2F..._ID
    <https://www.wikidata.org/wiki/WT:MB#Human.2Fmouse.2F..._ID>
    [3] https://www.wikidata.org/wiki/__User:ProteinBoxBot
    <https://www.wikidata.org/wiki/User:ProteinBoxBot>.  Chinmay Nalk
    (https://www.wikidata.org/__wiki/User:Chinmay26
    <https://www.wikidata.org/wiki/User:Chinmay26>) did all the work
    on this,
    with input from WD:MB.
    [4]
    https://www.wikidata.org/wiki/__Wikidata:WikiProject___Mineralogy <https://www.wikidata.org/wiki/Wikidata:WikiProject_Mineralogy>
    [5]
    http://tools.wmflabs.org/__wikidata-todo/tree.html?q=__Q15281399&rp=279&lang=en
    <http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q15281399&rp=279&lang=en>
    [6]
    http://tools.wmflabs.org/__wikidata-todo/tree.html?q=__Q796194&rp=279&lang=en
    <http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q796194&rp=279&lang=en>
    [7] http://apps.who.int/__classifications/icd10/browse/__2010/en
    <http://apps.who.int/classifications/icd10/browse/2010/en>
    [8] https://en.wikipedia.org/wiki/__Template:Surgeries
    <https://en.wikipedia.org/wiki/Template:Surgeries>
    [9]
    https://www.wikidata.org/w/__index.php?title=Wikidata:__Database_reports/Popular___properties&oldid=125595374
    <https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Popular_properties&oldid=125595374>
    [10] Examples include
    -
    https://www.wikidata.org/wiki/__Wikidata:Project_chat#__chemical_element
    <https://www.wikidata.org/wiki/Wikidata:Project_chat#chemical_element>
    -
    https://www.wikidata.org/wiki/__Wikidata:Project_chat/Archive/__2013/12#Top_of_the_subclass___tree
    <https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2013/12#Top_of_the_subclass_tree>

    -
    https://www.wikidata.org/wiki/__Wikidata:Project_chat/Archive/__2014/01#Question_about___classes.2C_and_.27instance_of.__27_vs_.27subclass.27
    <https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/01#Question_about_classes.2C_and_.27instance_of.27_vs_.27subclass.27>
    [11]
    http://tools.wmflabs.org/__wikidata-todo/tree.html?q=__Q164950&rp=279&lang=en
    <http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q164950&rp=279&lang=en>
    [12] http://korrekt.org/page/The___Incredible_ELK
    <http://korrekt.org/page/The_Incredible_ELK>
    [13] https://www.mediawiki.org/__wiki/Wikidata_Toolkit
    <https://www.mediawiki.org/wiki/Wikidata_Toolkit>


    On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch
    <markus.kroetzsch@tu-dresden.__de
    <mailto:markus.kroetzsch@tu-dresden.de>
    <mailto:markus.kroetzsch@tu-__dresden.de
    <mailto:markus.kroetzsch@tu-dresden.de>>>

    wrote:

         Hi,

         I got interested in subclass of (P279) and instance of (P31)
         statements recently. I was surprised by two things:

         (1) There are quite a lot of subclass of statements: tenth
    of thousands.
         (2) Many of them make a lot of sense, and (in particular)
    are not
         (obvious) copies of Wikipedia categories.

         My big question is: who is creating all these statements
    and how is
         this done? It seems too much data to be created manually, but I
         don't see obvious automated approaches either (and there
    are usually
         no references given).

         I also found some rare issues. "A subclass of B" should be
    read as
         "Every A is also a B". For example, we have "Every piano
    (Q5994) is
         also a keyboard instrument (Q52954)". Overall, the great
    majority of
         cases I looked at had remarkably sane modelling (which
    reinforces my
         big question).

         But there are still cases where "subclass of" is mixed up with
         "instance of". For example, Wikidata also says "Every 'House of
         Staufen' (Q130875) is also a dynasty (Q164950)". This is
    dubious --
         how many instances of 'House of Staufen' are there? I guess we
         really want to say that "The House of Staufen is a(n
    instance of)
         dynasty." Is this a singular error or a systematic issue?

         I guess there is already a group of people who deal with
    such issues
         -- or it would be a miracle that things are in such a good
    shape
         already :-) I have read the talk page for subclass of, but
    that does
         not seem to explain the original of all the data we have
    already.
         Pointers?

         Cheers,

         Markus


         ___________________________________________________
         Wikidata-l mailing list
    Wikidata-l@lists.wikimedia.org
    <mailto:Wikidata-l@lists.wikimedia.org>
    <mailto:Wikidata-l@lists.__wikimedia.org
    <mailto:Wikidata-l@lists.wikimedia.org>>
    https://lists.wikimedia.org/____mailman/listinfo/wikidata-l
    <https://lists.wikimedia.org/__mailman/listinfo/wikidata-l>
         <https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
    <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>>





    _________________________________________________
    Wikidata-l mailing list
    Wikidata-l@lists.wikimedia.org
    <mailto:Wikidata-l@lists.wikimedia.org>
    https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
    <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>



_________________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org>
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
<https://lists.wikimedia.org/mailman/listinfo/wikidata-l>

Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l

David Cuenca

27 May 27 May

3:06 p.m.

On Thu, May 15, 2014 at 5:03 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:

...

I applaud your comparison of inferencing with a form of decompression. I think this is a nice intuition (in fact, some people have researched "semantic compression" where one tries to reduce the size of a knowledge base by eliminating things that follow from the rest anyway).

Markus, sorry the delay answering to this, I had to let the ideas grow for a while.

I also like the idea of decompression, that is what makes your "database of inferred data" even more useful. There is a lot of data that can be inferred, and not just from following the relationships, but by computing. For instance "population density" which can be calculated from "area" and "population", or aggregates of the population of each town in a district.

Another source of inferred statements are wp categories. Most of them are very easily translatable into statements, and the other way round too. A place where to store and process these inferences would be most useful if WD is not the right place.

You also say: "Constraints are a great start. We should now ask how we could improve the management of constraints in the future, and which constraints we will have then." The first step will be having them as statements, then having them as queries, and finally automating their correction, either by semi-automatic tools, or with gamification. How to automatically transform a constraint into a game to solve the outliers it might be also an interesting topic. And of course, more far fetched, but nevertheless relevant is how to connect the property to a perceptual mechanism.

About improving the reliability: yes, as wikidata grows bigger some statements become more important. There is something to be learnt about how neural nets work, specially strengthening most-used (or traveled, or accepted, or viewed) connections. Another process little understood now is the need to forget, or in wikidata terms auto-deprecate information that is no longer current. Not very relevant now, but something to keep in mind for the next years.

Cheers, Micru

3834

Age (days ago)

3856

Last active (days ago)

wikidata@lists.wikimedia.org

7 comments

6 participants

tags (0)

participants (6)

David Cuenca
emw
Joe Filceolaire
Markus Kroetzsch
Markus Krötzsch
Thomas Douillard