@Markus, James:
In my opinion it is better to make the query ask for the most recent
population number. People just need to start using time-qualifiers for
things like census-report numbers.
Unfortunately, this is not sufficient for census number selections,
since the most recent number might be less accurate than another
somewhat-recent number, which is therefore considered "preferred". I
have no idea how to come up with a reasonable SPARQL query to evaluate
this situation.
Similarly, ignoring the instance-of statements that are historic if
other statements may have no times associated whatsoever, and picking
the most recent instance-of statement if all of them have times
associated would require an amount of computation that you really don't
want to encode in SPARQL. Feel free to prove me wrong by posting the
SPARQL query here, but I think it won't be feasible. SPARQL is not a
programming language to implement arbitrarily complex selection rules
in. The current rank-based system, in spite of its necessary
limitations, is in fact highly effective for solving a huge number of
such issues in a pragmatic way. You may need to use the exact data for
many applications (we completely agree there), but ranks will always be
of great use to keep the rest of your query as simple as possible.
And the other issue is one of standardized vocabulary and that is always
a sourcing problem in my opinion. A query could say "get the
instance-of-statement" that has a supporting source from the Spanish
Geographic Society. Then the infobox would only include standardized
vocabulary by that organization. But I aknowledge that large parts of
the world are not covered by standardized vocabulary organizations.
Yes, it seems we need to let the use of references evolve a little more
until such things will be feasible and lead to good coverage.
If that doesn't solve it we could at least think about language specific
rank-overrides.
Storing ranks per language will not be feasible or desirable. I think
the solutions I gave can go a long way. In the end, any
language-specific way to define the classes you want to display/hide
will do. For example, a SPARQL query for all super classes that have an
article in a given Wikipedia is rather easy (querying for the most
specific such superclasses is another matter of course ...).
Markus
2015-11-27 16:41 GMT+01:00 Markus Krötzsch
<markus(a)semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>>:
Hi James,
I would immediately agree to the following measures to alleviate
your problem:
(1) If some instance-of statements are historic (i.e., no longer
valid), then one should make the current ones "preferred" and leave
the historic ones "normal", just like for, e.g., population numbers.
This would get rid of the rather inappropriate "Free imperial city"
label for Frankfurt.
(2) If some classes are redundant, they could be removed (e.g., if
we already have "Big city" we do not need "city"). However,
community might decide to prefer the direct use of a main class
(such as "Human"), even if redundant.
The other issues you mention are more tricky. Especially issues of
translation/cultural specificity. The most specific classes are not
always the ones that all languages would want to see, e.g., if the
concept of the class is not known in that language.
Possible options for solving your problem:
* Make a whitelist of classes you want to show at all in the
template, and default to "city" if none of them occurs.
* Make a blacklist of classes you want to hide.
* Instead of blacklist or whitelist, show only classes that have a
Wikipedia page in your language; default to "city" if there are none.
* Try to generalise overly specific classes (change "big city" to
"city" etc.). I don't know if there is a good programmatic approach
for this, or if you would have to make a substitution list or
something, which would not be very maintainable.
* Do not use instance-of information like this in the infobox. It
might sound radical, but I am not sure if "instance of" is really
working very well for labelling things in the way you expect.
Instance-of can refer to many orthogonal properties of an object, in
essentially random order, while a label should probably focus on
certain aspects only.
For obvious reasons, ranks of statements cannot be used to record
language-specific preferences.
Cheers,
Markus
On 27.11.2015 15:58, James Heald wrote:
Some items have quite a lot of "instance of" statements,
connecting them
to quite a few different classes.
For example, Frankfurt is currently an instance of seven
different classes,
https://www.wikidata.org/wiki/Q1794
and Glasgow is currently an instance of five different classes:
https://www.wikidata.org/wiki/Q4093
This can produce quite a pile-up of descriptions in the
description/subtitle section of an infobox -- for example, as on the
Spanish page for Frankfurt at
https://es.wikipedia.org/wiki/Fr%C3%A1ncfort_del_Meno
in the section between the infobox title and the picture.
Question:
Is it an appropriate use of ranking, to choose a few of the
values to
display, and set those values to be "preferred rank" ?
It would be useful to have wider input, as to whether it is a
good thing
as to whether this is done widely.
Discussions are open at
https://www.wikidata.org/wiki/Wikidata:Project_chat#Preferred_and_normal_ra…
and
https://www.wikidata.org/wiki/Wikidata:Bistro#Rang_pr.C3.A9f.C3.A9r.C3.A9
-- but these have so far been inconclusive, and have got
slightly taken
over by questions such as
* how well terms really do map from one language to another --
near-equivalences that may be near enough for sitelinks may be
jarring
or insufficient when presented boldly up-front in an infobox.
(For example, the French translation "ville" is rather
unspecific, and
perhaps inadequate in what it conveys, compared to "city" in
English or
"ciudad" in Spanish; "town" in English (which might have
over
100,000
inhabitants) doesn't necessarily match "bourg" in French or
"Kleinstadt"
in German).
* whether different-language wikis may seek different degrees of
generalisation or specificity in such sub-title areas, depending
on how
"close" the subject is to that wiki.
(For readers in some languages, some fine distinctions may be highly
relevant and familiar, whereas for other language groups that
level of
detail may be undesirably obscure).
There is also the question of the effect of promoting some values to
"preferred rank" for the visibility of other values in SPARQL -- in
particular when so queries are written assuming they can get
away with
using just the simple "truthy" wdt:... form of properties.
However, making eg the value "city" preferred for Glasgow means
that it
will no longer be returned in searches for its other values, if
these
have been written using "wdt:..." -- so it will now be missed in a
simple-level query for "council areas", the current top-level
administrative subdivisions of Scotland, or for historically-based
"registration counties" -- and this problem will become more
pronounced
if the practice becomes more widespread of making some values
"preferred" (and so other values invisible, at least for queries
using
wdt:...).
From a SPARQL point of view, what would actually be very
helpful would
to add a (new) fourth rank -- "misleading without qualifier", below
"normal" but above "deprecated" -- for statements that *are*
true (with
the qualifiers), but could be misleading without them
* for example, for a town that was the county town of a shire
once, but
hasn't been for two centuries
* or for an administrative area that is partly located in one
higher-level division, and partly in another -- this is very
valuable
information to be able to note, but it's important to be able to
exclude
it from being all included in a recursive search for the places
in one
(but not the other) of that higher-level division.
The statements shouldn't be marked "deprecated", because they
are true
(unlike a widely-given but incorrect date of birth, for
example). At
the moment one can sort of work round the issue, if one can find
another
statement to make "preferred", so that the qualified statement
becomes
invisible to a simple search without qualifiers. However, if
"preferred" status is going to be used just to select things to
show in
infoboxes, it becomes very desirable that "wdt:..." searches should
retrieve things at normal rank as well -- creating a need for a
new rank
for statements which are true, but misleading if read without
qualifiers.
What *is* needed though, is a view on whether trying to tailor
what is
shown in infoboxes is an appropriate reason to alter statement
rankings.
It would be good to get a view on this.
The Spanish guys who stated doing this have temporarily put further
rank-changes on hold, for the issue to be discussed; but so far what
they have done has only just scratched the surface of what could
be done
-- there are still a lot more cases of multiple values they
would like
to tidy.
So: is this the kind of thing that "preferred rank" is envisaged
for ?
Or, should some statements not be marked as less preferred than
others,
if this is the only reason ?
-- James.
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata