Re: [Wikidata] Preferred rank -- choices for infoboxes, versus SPARQL

27 Nov 2015

On 27.11.2015 17:05, Tobias Schönberg wrote:
...
  @Markus, James:
 In my opinion it is better to make the query ask for the most recent
 population number. People just need to start using time-qualifiers for
 things like census-report numbers. 
Unfortunately, this is not sufficient for census number selections, 
since the most recent number might be less accurate than another 
somewhat-recent number, which is therefore considered "preferred". I 
have no idea how to come up with a reasonable SPARQL query to evaluate 
this situation.

Similarly, ignoring the instance-of statements that are historic if 
other statements may have no times associated whatsoever, and picking 
the most recent instance-of statement if all of them have times 
associated would require an amount of computation that you really don't 
want to encode in SPARQL. Feel free to prove me wrong by posting the 
SPARQL query here, but I think it won't be feasible. SPARQL is not a 
programming language to implement arbitrarily complex selection rules 
in. The current rank-based system, in spite of its necessary 
limitations, is in fact highly effective for solving a huge number of 
such issues in a pragmatic way. You may need to use the exact data for 
many applications (we completely agree there), but ranks will always be 
of great use to keep the rest of your query as simple as possible.

...

 And the other issue is one of standardized vocabulary and that is always
 a sourcing problem in my opinion. A query could say "get the
 instance-of-statement" that has a supporting source from the Spanish
 Geographic Society. Then the infobox would only include standardized
 vocabulary by that organization. But I aknowledge that large parts of
 the world are not covered by standardized vocabulary organizations. 
Yes, it seems we need to let the use of references evolve a little more 
until such things will be feasible and lead to good coverage.

...

 If that doesn't solve it we could at least think about language specific
 rank-overrides. 
Storing ranks per language will not be feasible or desirable. I think 
the solutions I gave can go a long way. In the end, any 
language-specific way to define the classes you want to display/hide 
will do. For example, a SPARQL query for all super classes that have an 
article in a given Wikipedia is rather easy (querying for the most 
specific such superclasses is another matter of course ...).

Markus

...
  2015-11-27 16:41 GMT+01:00 Markus Krötzsch
 &lt;markus(a)semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>>:

     Hi James,

     I would immediately agree to the following measures to alleviate
     your problem:

     (1) If some instance-of statements are historic (i.e., no longer
     valid), then one should make the current ones "preferred" and leave
     the historic ones "normal", just like for, e.g., population numbers.
     This would get rid of the rather inappropriate "Free imperial city"
     label for Frankfurt.

     (2) If some classes are redundant, they could be removed (e.g., if
     we already have "Big city" we do not need "city"). However,
     community might decide to prefer the direct use of a main class
     (such as "Human"), even if redundant.

     The other issues you mention are more tricky. Especially issues of
     translation/cultural specificity. The most specific classes are not
     always the ones that all languages would want to see, e.g., if the
     concept of the class is not known in that language.

     Possible options for solving your problem:

     * Make a whitelist of classes you want to show at all in the
     template, and default to "city" if none of them occurs.
     * Make a blacklist of classes you want to hide.
     * Instead of blacklist or whitelist, show only classes that have a
     Wikipedia page in your language; default to "city" if there are none.
     * Try to generalise overly specific classes (change "big city" to
     "city" etc.). I don't know if there is a good programmatic approach
     for this, or if you would have to make a substitution list or
     something, which would not be very maintainable.
     * Do not use instance-of information like this in the infobox. It
     might sound radical, but I am not sure if "instance of" is really
     working very well for labelling things in the way you expect.
     Instance-of can refer to many orthogonal properties of an object, in
     essentially random order, while a label should probably focus on
     certain aspects only.

     For obvious reasons, ranks of statements cannot be used to record
     language-specific preferences.

     Cheers,

     Markus

     On 27.11.2015 15:58, James Heald wrote:

         Some items have quite a lot of "instance of" statements,
         connecting them
         to quite a few different classes.

         For example, Frankfurt is currently an instance of seven
         different classes,
         https://www.wikidata.org/wiki/Q1794

         and Glasgow is currently an instance of five different classes:
         https://www.wikidata.org/wiki/Q4093

         This can produce quite a pile-up of descriptions in the
         description/subtitle section of an infobox -- for example, as on the
         Spanish page for Frankfurt at
         https://es.wikipedia.org/wiki/Fr%C3%A1ncfort_del_Meno
         in the section between the infobox title and the picture.

         Question:

         Is it an appropriate use of ranking, to choose a few of the
         values to
         display, and set those values to be "preferred rank" ?

         It would be useful to have wider input, as to whether it is a
         good thing
         as to whether this is done widely.

         Discussions are open at
         https://www.wikidata.org/wiki/Wikidata:Project_chat#Preferred_and_normal_ra…

         and
         https://www.wikidata.org/wiki/Wikidata:Bistro#Rang_pr.C3.A9f.C3.A9r.C3.A9

         -- but these have so far been inconclusive, and have got
         slightly taken
         over by questions such as

         * how well terms really do map from one language to another --
         near-equivalences that may be near enough for sitelinks may be
         jarring
         or insufficient when presented boldly up-front in an infobox.

         (For example, the French translation "ville" is rather
         unspecific, and
         perhaps inadequate in what it conveys, compared to "city" in
         English or
         "ciudad" in Spanish; "town" in English (which might have
over
         100,000
         inhabitants) doesn't necessarily match "bourg" in French or
         "Kleinstadt"
         in German).

         * whether different-language wikis may seek different degrees of
         generalisation or specificity in such sub-title areas, depending
         on how
         "close" the subject is to that wiki.

         (For readers in some languages, some fine distinctions may be highly
         relevant and familiar, whereas for other language groups that
         level of
         detail may be undesirably obscure).

         There is also the question of the effect of promoting some values to
         "preferred rank" for the visibility of other values in SPARQL -- in
         particular when so queries are written assuming they can get
         away with
         using just the simple "truthy" wdt:... form of properties.

         However, making eg the value "city" preferred for Glasgow means
         that it
         will no longer be returned in searches for its other values, if
         these
         have been written using "wdt:..." -- so it will now be missed in a
         simple-level query for "council areas", the current top-level
         administrative subdivisions of Scotland, or for historically-based
         "registration counties" -- and this problem will become more
         pronounced
         if the practice becomes more widespread of making some values
         "preferred" (and so other values invisible, at least for queries
         using
         wdt:...).

           From a SPARQL point of view, what would actually be very
         helpful would
         to add a (new) fourth rank -- "misleading without qualifier", below
         "normal" but above "deprecated" -- for statements that *are*
         true (with
         the qualifiers), but could be misleading without them
         * for example, for a town that was the county town of a shire
         once, but
         hasn't been for two centuries
         * or for an administrative area that is partly located in one
         higher-level division, and partly in another -- this is very
         valuable
         information to be able to note, but it's important to be able to
         exclude
         it from being all included in a recursive search for the places
         in one
         (but not the other) of that higher-level division.

         The statements shouldn't be marked "deprecated", because they
         are true
         (unlike a widely-given but incorrect date of birth, for
         example).  At
         the moment one can sort of work round the issue, if one can find
         another
         statement to make "preferred", so that the qualified statement
         becomes
         invisible to a simple search without qualifiers.  However, if
         "preferred" status is going to be used just to select things to
         show in
         infoboxes, it becomes very desirable that "wdt:..." searches should
         retrieve things at normal rank as well -- creating a need for a
         new rank
         for statements which are true, but misleading if read without
         qualifiers.

         What *is* needed though, is a view on whether trying to tailor
         what is
         shown in infoboxes is an appropriate reason to alter statement
         rankings.

         It would be good to get a view on this.

         The Spanish guys who stated doing this have temporarily put further
         rank-changes on hold, for the issue to be discussed; but so far what
         they have done has only just scratched the surface of what could
         be done
         -- there are still a lot more cases of multiple values they
         would like
         to tidy.

         So: is this the kind of thing that "preferred rank" is envisaged
         for ?

         Or, should some statements not be marked as less preferred than
         others,
         if this is the only reason ?

              --  James.

         _______________________________________________
         Wikidata mailing list
         Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
         https://lists.wikimedia.org/mailman/listinfo/wikidata

     _______________________________________________
     Wikidata mailing list
     Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
     https://lists.wikimedia.org/mailman/listinfo/wikidata

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Preferred rank -- choices for infoboxes, versus SPARQL