Hi Stas,
Thanks for elaborating. I think we could always start with traversing only "subclass of". In spite of its limits, it does work in many areas (e.g. buildings, astronomical objects, vehicles, organisations, etc.), even if by far not in all. Where it doesn't work, one would simply not get enough results, but the alternative (do not even use "subclass of") will just make this problem worse. Any approach of fixing the latter will also help the former.
Now regarding issues such as dog, woman, and many other things, it seems clear that what one would need are inference rules. It should be possible to say somewhere that a "if a human is female, then it is also woman" without having to add the unwanted statement "instance of woman" everywhere. Or "if someone has profession 'programmer' then he/she/they is/are a programmer" -- at least for the purpose of media search. The case of dogs would be complicated (referring to quantifiers) but still doable.
Obvious questions arise: * Would we prefer to maintain such rules somewhere rather than adding the relations they might infer manually? (Probably yes, since one would need much fewer rules than manual statements, which would always add redundancy and cause conflicts -- cf. taxonomy modelling discussion -- that are not necessary when applications can select which inference rules to use without touching the underlying data.) * How would the rules look to human editors? (We have made some first proposals for this; see the rules supported by SQID [1]; but one can come up with other options) * Where would such rules be managed? (Preferably on Wikidata, but the encoding in statements would be a challenge; another challenge is how to associate rules with entities -- usually they make connections between several entities) * How would the rules be applied on the live data, especially if there are many updates? (Doable using known algorithms and based on existing tools, but still needs some implementation work; I think for a start one could just reduce the update speed on these "inferred tags" and still get a big improvement over the case where nothing of this type is done at all).
So would this be a mid-term goal to overcome this issue? I would think so, also because there are enough degrees of freedom here to gradually grow this from simple (only allow rules that effectively add some more traversal hints) to powerful (have rules that can use qualifiers, as needed to get from dog to mammal). The main challenge is to find a good approach for community-editing this part without restricting upfront to a few special cases (as for the case of the constraints).
Inference rules come up as potential solutions in many similar tasks where you want users to access/query the data. Imagine someone would look for the brothers of a person (let's assume we'd built an intelligent search for such things) -- again, Wikidata has no concept of "brother" and we would not have any idea how to answer this, unless somewhere we'd have a rule that defines how you can find brother-relationships from the data that we actually have. This happens a lot when you want users who are not familiar with how we organise data find things, but the solution cannot be to add every possible view/inferred statement to Wikidata explicitly.
Obviously, the rule approach is not something we could deploy anytime soon, but it could be something to work towards ...
Cheers,
Markus
[1] Example rule with explanation of how it was applied to find a grandfather of Ada Lovelace: https://tinyurl.com/y7rgmk7o The qualifier sets (X, Y, Z) are unused here and could be hidden entirely, but this is just a prototype.
On 20/10/2018 00:28, Stas Malyshev wrote:
Hi!
possibility to find more results by letting the search engine traverse the "more-general-than" links stored in Wikidata. People have discovered cases where some of these links are not correct (surprise! it's a wiki ;-), and the suggestion was that such glitches would be fixed with higher priority if there would be an application relying on it. But even
The main problem I see here is not that some links are incorrect - which may have bad effects, but it's not the most important issue. The most important one, IMHO, that there's no way to figure out in any scalable and scriptable way what "more-general-than" means for any particular case.
It's different for each type of objects and often inconsistent within the same class (e.g. see confusion between whether "dog" is an animal, a name of the animal, name of the taxon, etc.) It's not that navigating the hierarchy would lead as astray - we're not even there yet to have this problem, because we don't even have a good way to navigate it.
Using instance-of/subclass-of only seems to not be that useful, because a lot of interesting things are not represented in this way - e.g. finding out that Donna Strickland (Q56855591) is a woman (Q467) is impossible using only this hierarchy. We could special-case a bunch of those but given how diverse Wikidata is, I don't think this will ever cover any significant part of the hierarchy unless we find a non-ad-hoc method of doing this.
This also makes it particularly hard to do something like "let's start using it and fix the issues as we discover them", because the main issue here is that we don't have a way to start with anything useful beyond a tiny subset of classes that we can special-case manually. We can't launch a rocket and figure how to build the engine later - having a working engine is a prerequisite to launching the rocket!
There are also significant technical challenges in this - indexing dynamically changing hierarchy is very problematic, and with our approach to ontology anything can be a class, so we'd have to constantly update the hierarchy. But this is more of a technical challenge, which will come after we have some solution for the above.