Proposal for the introduction of a practicable Data Quality Indicator in Wikidata - Wikidata

24 Aug 2019

Hello,

As the importance of Wikidata increases, so do the demands on the quality
of the data. I would like to put the following proposal up for discussion.

Two basic ideas:

   1. Each Wikidata page (item) is scored after each editing. This score
   should express different dimensions of data quality in a quickly manageable
   way.
   2. A property is created via which the item refers to the score value.
   Certain qualifiers can be used for a more detailed description (e.g. time
   of calculation, algorithm used to calculate the score value, etc.).

The score value can be calculated either within Wikibase after each data
change or "externally" by a bot. For the calculation can be used among
other things: Number of constraints, completeness of references, degree of
completeness in relation to the underlying ontology, etc. There are already
some interesting discussions on the question of data quality which can be
used here ( see  https://www.wikidata.org/wiki/Wikidata:Item_quality;
https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).

Advantages

   - Users get a quick overview of the quality of a page (item).
   - SPARQL can be used to query only those items that meet a certain
   quality level.
   - The idea would probably be relatively easy to implement.

Disadvantage:

   - In a way, the data model is abused by generating statements that no
   longer describe the item itself, but make statements about the
   representation of this item in Wikidata.
   - Additional computing power must be provided for the regular
   calculation of all changed items.
   - Only the quality of pages is referred to. If it is insufficient, the
   changes still have to be made manually.

I would now be interested in the following:

   1. Is this idea suitable to effectively help solve existing quality
   problems?
   2. Which quality dimensions should the score value represent?
   3. Which quality dimension can be calculated with reasonable effort?
   4. How to calculate and represent them?
   5. Which is the most suitable way to further discuss and implement this
   idea?

Many thanks in advance.

Uwe Jung  (UJung <https://www.wikidata.org/wiki/User:UJung>)
www.archivfuehrer-kolonialzeit.de/thesaurus