On 19/10/2018 01:09, James Heald wrote:
On 18/10/2018 22:33, Markus Kroetzsch wrote:
And, on another note, there is also a huge misunderstanding exposed in
the discussion on th search-related tracker item : Cparle there
speaks about "traversing the subclass hierarchy" but is actually
looking at *super*classes of, e.g., "Clarinet", which he mostly finds
irrelevant to users who care about clarinets. But surely that's the
wrong direction! You have to look for *sub*classes to find special
cases of what you are looking for. Looking downwards will often lead
to much saner ontologies than when turning your head towards the dizzy
heights of upper ontology. Yes, the few of us looking for instances of
"logical consequence" will still get clarinets, but those who look for
instances of clarinet merely will see instances of alto clarinet,
piccolo clarinet, basset horn, Saxonette, and so on . So instead of
trying to suggest to Commons editors meaningful "upper concepts", one
could simply enable the use of lower concepts in search. It does not
work in all cases yet, but it many.
Cparle wants to make sure that people searching for "clarinet" also get
shown images of "piccolo clarinet" etc.
To make this possible, where an image has been tagged "basset horn" he
is therefore looking to add "clarinet" as an additional keyword, so that
if somebody types "clarinet" into the search box, one of the images
retrieved by ElasticSearch will be the basset horn one.
I imagine there are pluses and minuses both ways, whether you try to
make sure one search returns more hits, or try to run multiple searches
each returning fewer hits.
Your suggestion of the latter approach may not involve so much
pre-investigation of the top of the tree, which may be terms that people
are less likely to search for; but on the other hand, the actual
searching may be less efficient than a single indexed search.
True, but with the Wikidata Query Service we already have infrastructure
that completes millions of search requests of this kind (involving path
queries), so that seems doable for Commons as well. WDQS already has
Wikimedia API bindings that allow it to use Lucene-based results in
addition, if needed (though this would only make sense if the search
should use some content that for some reason cannot be imported into a
query service as graph data, mostly free-text search over longer texts).
I think the approach of completing tags towards the upper classes is not
a good idea in general, since it creates extra work for editors that
requires a million times the resources needed in the other approach: if
the subclass hierarchy is wrong, you only need to fix it once to improve
search for all existing Commons content; if you rely on manual extra
tags, you'd have to add them to every file on Commons and keep it
up-to-date with changes in the concepts -- an enormous, redundant effort
that will invariably lead to a very non-uniform search experience across
otherwise similar media. This seems like a huge waste of editors' time
even if it would work (i.e., if we would live in a world where the
superclasses of a class would be easy to understand and closely related
to the topic that an editor is working on -- which will never happen for
Wikidata or Commons, since both cover such a breadth of topics that
their upper ontology necessarily has to be very general even if modelled
in a clean and fully correct way).
There are still problems (such as the biological
modelled as a hierarchy of names rather than animal classes, placing
dog far away from mammal), but it is still always much easier to come
up with a sane organisation for the *sub*classes of a concrete class.
For what it's worth, there's currently quite a lively discussion on
Project Chat about issues with the current modelling of biological
People on this thread might like to comment on some of the less
fortunate elements of current practice, and the appropriateness of some
of the thoughts that have been suggested.
But the taxo project has become such a walled garden, answerable only to
itself, that people with comments may need to be quite forceful to get
their message through, if we are to deal eg with some of the
difficulties Cparle describes in the ticket at
This email has been checked for viruses by AVG.
Wikidata mailing list