Re: [Wikidata-l] Subclass of/instance of

15 May 2014

      Hi Markus.
Concerning redundancy, I question myself. Is redundancy, at least at some
degree, something we absolutely want to remove from Wikidata ? I don't
think so. Wikidata is an open project where a lot of change happens on a
high number of "pages" (in the Wiki sense). This mean in my own the more
control mechanism there is the best it is.
I think redundancy is a powerful mechanism in robustness achievment, it
happens to some extent in a lot of complex systems. For example think of
claim deletion. Assume a reasoner would rely on that claim to make a lot of
inferences. In a sense it's a kind of compression of information. Then
there is a risk, if the deletion is unnoticed, that we lose a lot of datas
due to that claim deletion.
Now think that there is a redundant claim that is a part of the inferences
chain that come from our deleted claim, could a mechanism based on
inferences enlight the fact that the graph might be incomplete, or that the
deletion introduced an inconsistency, or whatever, where just a inference
system with a minimal set of claim to compress the stored data would just
not make the inference anymore ? I guess it could also compare the (would
we say completed graph, or partially completed ?) before and after the
change, and hint that there actually is a mass loss of datas. Or compute a
"inference score" based on the number of inferences a claim is a part of to
hint the patrollers for the deletion to verify (just random thoughts.
Anyway, any thoughts on redundancy in Wikidata ?
2014-05-14 15:33 GMT+02:00 Markus Krötzsch markus@semantic-mediawiki.org:
...
Hi Eric,
Thanks for all the information. This was very helpful. I only get to
answer now since we have been quite busy building RDF exports for Wikidata
(and writing a paper about it). I will soon announce this here (we still
need to fix a few details).
You were asking about using these properties like rdfs:subClassOf and
rdf:type. I think that's entirely possible, since the modelling is very
reasonable and would probably yield good results. Our reasoner ELK could
easily handle the class hierarchy in terms of size, but you don't really
need such a highly optimized tool for this as long as you only have
subClassOf. In fact, the page you linked to shows that it is perfectly
possible to compute the class hierarchy with Wikidata Query and to display
all of it on one page. ELK's main task is to compute class hierarchies for
more complicated ontologies, which we do not have yet. OTOH, query
answering and data access are different tasks that ELK is not really
intended for (although it could do some of this as well).
Regarding future perspectives: one thing that we have also done is to
extract OWL axioms from property constraint templates on Wikidata talk
pages (we will publish the result soon, when announcing the rest). This
gives you only some specific types of OWL axioms, but it is making things a
bit more interesting already. In particular, there are some constraints
that tell you that an item should have a certain class, so this is
something you could reason with. However, the current property constraint
system does not work too well for stating axioms that are not related to a
particular property (such as: "Every [instance of] person who appears as an
actor in some film should be [instance of] in the class 'actor'" -- which
property or item page should this be stated on?). But the constraints show
that it makes sense to express such information somehow.
In the end, however, the real use of OWL (and similar ontology languages)
is to remove the need for making everything explicit. That is, instead of
"constraints" (which say: "if your data looks like X, then your data should
also include Y") you have "axioms" (which say: "if your data looks like X,
then Y follows automatically"). So this allows you to remove redundancy
rather than to detect omissions. This would make more sense with "derived"
notions that one does not want to store in the database, but which make
sense for queries (like "grandmother").
One would need a bit more infrastructure for this; in particular, one
would need to define "grandmother" (with labels in many languages) even if
one does not want to use it as a property but only in queries. Maybe one
could have a separate Wikibase installation for defining such derived
notions without needing to change Wikidata? There are no statements on
properties yet, but one could also use item pages to define derived
properties when using another site ...
Best regards,
Markus
P.S. Thanks for all the work on the "semantic" modelling aspects of
Wikidata. I have seen that you have done a lot in the discussions to
clarify things there.
On 06/05/14 04:53, emw wrote:
...
Hi Markus,
You asked "who is creating all these [subclass of] statements and how is
this done?"
The class hierarchy in
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120&rp=279&lan...
shows a few relatively large subclass trees for specialist domains,
including molecular biology and mineralogy.  The several thousand
subclass of 'gene' and 'protein' subclass claims were created by members
of WikiProject Molecular biology (WD:MB), based on discussions in [1]
and [2].  The decision to use P279 instead of P31 there was based on the
fact that the "is-a" relation in Gene Ontology maps to rdfs:subClassOf,
which P279 is based on.  The claims were added by a bot [3], with input
from WD:MB members.  The data ultimately comes from external biological
databases.
A glance at the mineralogy class hierarchy indicates it has been
constructed by WikiProject Mineralogy [4] members through non-bot
edits.  I imagine most of the other subclass of claims are done manually
or semi-automatically outside specific Wikiproject efforts.  In other
words, I think most of the other P279 claims are added by Wikidata users
going into the UI and building usually-reasonable concept hierarchies on
domains they're interested in.  I've worked on constructing class
hierarchies for health problems (e.g. diseases and injuries) [5] and
medical procedures [6] based on classifications like ICD-10 and
assertions and templates on Wikipedia (e.g. [8]).
It's not incredibly surprising to me that Wikidata has about 36,000
subclass of (P279) claims [9].  The property has been around for over a
year and is a regular topic of discussion [10] along with instance of
(P31), which has over 6,600,000 claims.
You noted a dubious claim subclass of claim for 'House of Staufen'
(Q130875).  I agree that instance of would probably be the better
membership property to use there.  Such questionable usage of P279 is
probably uncommon, but definitely not singular.  The dynasty class
hierarchy shows 13 dubious cases at the moment [11].  I would guess less
than 5% of subclass of claims have that kind of issue, where instance of
would make more sense.  I think there are probably vastly more cases of
the converse: instance of being used where subclass of would make more
sense.
As you probably know, P31 and P279 are intended to have the semantics of
rdf:type and rdfs:subClassOf per community decision.  A while ago I read
a bit about the ELK reasoner you were involved with [12], which makes
use of the seemingly class-centric OWL EL profile.  Do you have any
plans to integrate features of ELK with the Wikidata Toolkit [13]?  How
do you see reasoning engines using P31 and P279 in the future, if at all?
Thanks,
Eric
https://www.wikidata.org/wiki/User:Emw
[1]
https://www.wikidata.org/wiki/WT:MB#Distinguishing_between_
genes_and_proteins
[2] https://www.wikidata.org/wiki/WT:MB#Human.2Fmouse.2F..._ID
[3] https://www.wikidata.org/wiki/User:ProteinBoxBot.  Chinmay Nalk
(https://www.wikidata.org/wiki/User:Chinmay26) did all the work on this,
with input from WD:MB.
[4] https://www.wikidata.org/wiki/Wikidata:WikiProject_Mineralogy
[5]
http://tools.wmflabs.org/wikidata-todo/tree.html?q=
Q15281399&rp=279&lang=en
[6]
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q796194&rp=279&la...
[7] http://apps.who.int/classifications/icd10/browse/2010/en
[8] https://en.wikipedia.org/wiki/Template:Surgeries
[9]
https://www.wikidata.org/w/index.php?title=Wikidata:
Database_reports/Popular_properties&oldid=125595374
[10] Examples include

https://www.wikidata.org/wiki/Wikidata:Project_chat#chemical_element

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/
2013/12#Top_of_the_subclass_tree

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/
2014/01#Question_about_classes.2C_and_.27instance_of.27_vs_.27subclass.27
[11]
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q164950&rp=279&la...
[12] http://korrekt.org/page/The_Incredible_ELK
[13] https://www.mediawiki.org/wiki/Wikidata_Toolkit
On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch
<markus.kroetzsch@tu-dresden.de mailto:markus.kroetzsch@tu-dresden.de>
wrote:
Hi,

I got interested in subclass of (P279) and instance of (P31)
statements recently. I was surprised by two things:

(1) There are quite a lot of subclass of statements: tenth of

thousands.
    (2) Many of them make a lot of sense, and (in particular) are not
    (obvious) copies of Wikipedia categories.
My big question is: who is creating all these statements and how is
this done? It seems too much data to be created manually, but I
don't see obvious automated approaches either (and there are usually
no references given).

I also found some rare issues. "A subclass of B" should be read as
"Every A is also a B". For example, we have "Every piano (Q5994) is
also a keyboard instrument (Q52954)". Overall, the great majority of
cases I looked at had remarkably sane modelling (which reinforces my
big question).

But there are still cases where "subclass of" is mixed up with
"instance of". For example, Wikidata also says "Every 'House of
Staufen' (Q130875) is also a dynasty (Q164950)". This is dubious --
how many instances of 'House of Staufen' are there? I guess we
really want to say that "The House of Staufen is a(n instance of)
dynasty." Is this a singular error or a systematic issue?

I guess there is already a group of people who deal with such issues
-- or it would be a miracle that things are in such a good shape
already :-) I have read the talk page for subclass of, but that does
not seem to explain the original of all the data we have already.
Pointers?

Cheers,

Markus

_________________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org

...
https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
<https://lists.wikimedia.org/mailman/listinfo/wikidata-l>

Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata-l] Subclass of/instance of