Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata (next round)

28 Aug 2019

Uwe I feel this is more and more important with quality and provenance and also
communicate inside Wikidata the quality of our data.

 I have added maybe the best source for biographies in Sweden P3217 in Wikidata on 7500
person. In Wikipedia those 7500 objects are used on > 200 different languages in
Wikipedia we need to have a ”layer” explaining that data confirmed  with P3217 ”SBL from
Sweden” has very high trust

See https://phabricator.wikimedia.org/T222142

I can also see this quality problem that  Nobelprize.org<http://Nobelprize.org> and
Wikidata has > 30 differencies and its sometimes difficult to understand the quality of
the sources in Wikidata plus that Nobelprize.com<http://Nobelprize.com> has no
sources makes the equation difficult
https://phabricator.wikimedia.org/T200668

Regards
Magnus Sälgö
0046-705937579
Salgo60@msn.com<mailto:Salgo60@msn.com>

A blogpost I wrote
https://minancestry.blogspot.com/2018/04/wikidata-has-design-problem.html&l…

28 aug. 2019 kl. 03:49 skrev Uwe Jung
<jung.uwe@gmail.com<mailto:jung.uwe@gmail.com>>:

Hello,

many thanks for the answers to my contribution from 24.8.
I think that all four opinions contain important things to consider.

@David Abián
I have read the article and agree that in the end the users decide which data is good for
them or not.

@GerardM
It is true that in a possible implementation of the idea, the aspect of computing load
must be taken into account right from the beginning.

Please check that I have not given up on the idea yet. With regard to the acceptance of
Wikidata, I consider a quality indicator of some kind to be absolutely necessary. There
will be a lot of ordinary users who would like to see something like this.

At the same time I completely agree with David;(almost) every chosen indicator is subject
to a certain arbitrariness in the selection. There won't be one easy to understand
super-indicator.
So, let's approach things from the other side. Instead of a global indicator, a
separate indicator should be developed for each quality dimension to be considered. With
some dimensions this should be relatively easy. For others it could take years until we
have agreed on an algorithm for their calculation.

Furthermore, the indicators should not represent discrete values but a continuum of
values. No traffic light statements (i.e.: good, medium, bad) should be made. Rather, when
displaying the qualifiers, the value could be related to the values of all other objects
(e.g. the value x for the current data object in relation to the overall average for all
objects for this indicator). The advantage here is that the total average can increase
over time, meaning that the position of the value for an individual object can also
decrease over time.

Another advantage: Users can define the required quality level themselves. If, for
example, you have high demands on accuracy but few demands on the completeness of the
statements, you can do this.

However, it remains important that these indicators (i.e. the evaluation of the individual
item) must be stored together with the item and can be queried together with the data
using SPARQL.

Greetings

Uwe Jung

Am Sa., 24. Aug. 2019 um 13:54 Uhr schrieb Uwe Jung
<jung.uwe@gmail.com<mailto:jung.uwe@gmail.com>>:
Hello,

As the importance of Wikidata increases, so do the demands on the quality of the data. I
would like to put the following proposal up for discussion.

Two basic ideas:

  1.  Each Wikidata page (item) is scored after each editing. This score should express
different dimensions of data quality in a quickly manageable way.
  2.  A property is created via which the item refers to the score value. Certain
qualifiers can be used for a more detailed description (e.g. time of calculation,
algorithm used to calculate the score value, etc.).

The score value can be calculated either within Wikibase after each data change or
"externally" by a bot. For the calculation can be used among other things:
Number of constraints, completeness of references, degree of completeness in relation to
the underlying ontology, etc. There are already some interesting discussions on the
question of data quality which can be used here ( see 
https://www.wikidata.org/wiki/Wikidata:Item_quality;
https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).

Advantages

  *   Users get a quick overview of the quality of a page (item).
  *   SPARQL can be used to query only those items that meet a certain quality level.
  *   The idea would probably be relatively easy to implement.

Disadvantage:

  *   In a way, the data model is abused by generating statements that no longer describe
the item itself, but make statements about the representation of this item in Wikidata.
  *   Additional computing power must be provided for the regular calculation of all
changed items.
  *   Only the quality of pages is referred to. If it is insufficient, the changes still
have to be made manually.

I would now be interested in the following:

  1.  Is this idea suitable to effectively help solve existing quality problems?
  2.  Which quality dimensions should the score value represent?
  3.  Which quality dimension can be calculated with reasonable effort?
  4.  How to calculate and represent them?
  5.  Which is the most suitable way to further discuss and implement this idea?

Many thanks in advance.

Uwe Jung  (UJung<https://www.wikidata.org/wiki/User:UJung>)
www.archivfuehrer-kolonialzeit.de/thesaurus<http://www.archivfuehrer-kol…

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org<mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata (next round)