Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata

24 Aug 2019

TLDR:  it would be useful ; but extreme hard to create rules for every
domains.

...
 4. How to calculate and represent them? 
imho:  it is deepends of the data domain.

For geodata ( human settlements/rivers/mountains/... )  ( with GPS
coordinates ) my simple rules:
- if it has a  "local wikipedia pages" or  any big
lang["EN/FR/PT/ES/RU/.."]  wikipedia page ..  than it is OK.
- if it is only in "cebuano" AND outside of "cebuano BBOX" ->  then
....
this is lower quality
- only:{shwiki+srwiki} AND outside of "sh"&"sr" BBOX ->  this
is lower
quality
- only {huwiki} AND outside of CentralEuropeBBOX -> this is lower quality
- geodata without GPS coordinate ->  ...
- ....
so my rules based on wikipedia pages and languages areas ...  and I prefer
wikidata - with local wikipedia pages.

This is based on my experience - adding Wikidata ID concordances to
NaturalEarth ( https://www.naturalearthdata.com/blog/ )

...
 5. Which is the most suitable way to further discuss
and implement this idea?

imho:  Loading the wikidata dump to the local database;
and creating
- some "proof of concept" quality data indicators.
- some "meta" rules
- some "real" statistics
so the community can decide it is useful or not.

Imre

Uwe Jung &lt;jung.uwe(a)gmail.com&gt; ezt írta (időpont: 2019. aug. 24., Szo,
14:55):

...
  Hello,

 As the importance of Wikidata increases, so do the demands on the quality
 of the data. I would like to put the following proposal up for discussion.

 Two basic ideas:

    1. Each Wikidata page (item) is scored after each editing. This score
    should express different dimensions of data quality in a quickly manageable
    way.
    2. A property is created via which the item refers to the score value.
    Certain qualifiers can be used for a more detailed description (e.g. time
    of calculation, algorithm used to calculate the score value, etc.).

 The score value can be calculated either within Wikibase after each data
 change or "externally" by a bot. For the calculation can be used among
 other things: Number of constraints, completeness of references, degree of
 completeness in relation to the underlying ontology, etc. There are already
 some interesting discussions on the question of data quality which can be
 used here ( see  https://www.wikidata.org/wiki/Wikidata:Item_quality;
 https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).

 Advantages

    - Users get a quick overview of the quality of a page (item).
    - SPARQL can be used to query only those items that meet a certain
    quality level.
    - The idea would probably be relatively easy to implement.

 Disadvantage:

    - In a way, the data model is abused by generating statements that no
    longer describe the item itself, but make statements about the
    representation of this item in Wikidata.
    - Additional computing power must be provided for the regular
    calculation of all changed items.
    - Only the quality of pages is referred to. If it is insufficient, the
    changes still have to be made manually.

 I would now be interested in the following:

    1. Is this idea suitable to effectively help solve existing quality
    problems?
    2. Which quality dimensions should the score value represent?
    3. Which quality dimension can be calculated with reasonable effort?
    4. How to calculate and represent them?
    5. Which is the most suitable way to further discuss and implement
    this idea?

 Many thanks in advance.

 Uwe Jung  (UJung <https://www.wikidata.org/wiki/User:UJung>)
 www.archivfuehrer-kolonialzeit.de/thesaurus

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata