Hello,

It is interesting to note that what Cparle wants are "is a" relationships based on common sense. For most people, ants are insects, not instances of taxon. A clarinet is a woodwind instrument, and woodwind instruments are musical instruments, not an instance of "first order metaclass".

One of the best sources of "common sense" hypernymy is probably the first sentence of a Wikipedia page. Whether in English, French, Italian, a woman is always "a female human being."

For "poodle", this would look like (following the links in the English version of Wikipedia):

- The poodle is a group of formal dog breeds

- Dog breeds are dogs that...

- The domestic dog (...) is a member of the genus Canis (canines)

- Canis is a genus of the Canidae

- The biological family Canidae (...) is a lineage of carnivorans

- Carnivora (...) is a diverse scrotiferan order

- Scrotifera is a clade of placental mammals

- Placentalia ("Placentals") is one of the three extant subdivisions of the class of animals Mammalia...

- Mammals are the vertebrates within the class Mammalia...

From my point of view, this classification looks much better than the current relationships in Wikidata's ontology.

The automatic extraction of hypernymic relationships from English texts (especially Wikipedia) has been studied for a long time and gives good results, even with simple methods based on hand-crafted rules. In the case of Wikipedia, the hypernym often has a page itself (and therefore a link to Wikidata), which could simplify the NLP extraction and the mapping with Wikidata items.

Of course, the extracted relationships will not always be "subclass of" or "instance of". But if someone proposed a new property called "Wikipedia Hypernyms" (and its symmetric property "Wikipedia Hyponyms"), I would use it more willingly and with more confidence than the current system. This would also better respect the logic of Wikidata's descriptions.

I mean, if the description of Zoroastrianism (Q9601) says this is an "Ancient Iranian religion founded by Zoroaster", one would expect the class "religion" to appear much earlier in the hierarchy of superclasses of this item. If there was this property "Wikipedia Hypernyms", we could mention it in the same page - since Wikipedia describes Zoroastrianism as "one of the world's oldest religions that remains active." And a SPARQL query looking for 'all items that have "religion" as "Wikipedia hypernyms" property' would be much much faster.

Note: sorry if this reflection is naive or if it has already been discussed/tested.

Cheers,

Ettore

On Thu, 27 Sep 2018 at 23:35, James Heald <jpm.heald@gmail.com> wrote:

This recent announcement by the Structured Data team perhaps ought to be
quite a heads-up for us:

https://commons.wikimedia.org/wiki/Commons_talk:Structured_data#Searching_Commons_-_how_to_structure_coverage

Essentially the team has given up on the hope of using Wikidata
hierarchies to suggest generalised "depicts" values to store for images
on Commons, to match against terms in incoming search requests.

i.e. if an image is of a German Shepherd dog, and identified as such,
the team has given up on trying to infer in general from Wikidata that
'dog' is also a search term that such an image should score positively with.

Apparently the Wikidata hierarchies were simply too complicated, too
unpredictable, and too arbitrary and inconsistent in their design across
different subject areas to be readily assimilated (before one even
starts on the density of bugs and glitches that then undermine them).

Instead, if that image ought to be considered in a search for 'dog', it
looks as though an explicit 'depicts:dog' statement may be going to be
needed to be specifically present, in addition to 'depicts:German Shepherd'.

Some of the background behind this assessment can be read in
https://phabricator.wikimedia.org/T199119
in particular the first substantive comment on that ticket, by Cparle on
10 July, giving his quick initial read of some of the issues using
Wikidata would face.

SDC was considered a flagship end-application for Wikidata. If the data
in Wikidata is not usable enough to supply the dogfood that project was
expected to be going to be relying on, that should be a serious wake-up
call, a red flag we should not ignore.

If the way data is organised across different subjects is currently too
inconsistent and confusing to be usable by our own SDC project, are
there actions we can take to address that? Are there design principles
to be chosen that then need to be applied consistently? Is this
something the community can do, or is some more active direction going
to need to be applied?

Wikidata's 'ontology' has grown haphazardly, with little oversight, like
an untended bank of weeds. Is some more active gardening now required?

-- James.

---
This email has been checked for viruses by AVG.
https://www.avg.com

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata