Hi David,

On 24.08.19 21:23, David Abián wrote:
Hi,

If we accept that the quality of the data is the "fitness for use",
which in my opinion is the best and most commonly used definition (as
stated in the article linked by Ettore), then it will never be possible
to define a number that objectively represents data quality. We can
define a number that is the result of an arbitrary weighted average of
different metrics related to various dimensions of quality arbitrarily
captured and transformed, and we can fool ourselves by saying that this
number represents data quality, but it will not, nor will it be an
approximation of what data quality means, nor will this number be able
to order Wikidata entities matching any common, understandable,
high-level criterion. The quality of the data depends on the use, it's
relative to each user, and can't be measured globally and objectively in
any way that is better than another.

True that, but there are these two aspects:

1. what you describe sound more like inherent problem, when "measuring" quality, because as soon as you measure it it becomes  quantity.  It is not specific to data. However, it is a viable helping construct, i.e. you measure something and in the interpretation you can judge quality for better or for worse. There is some merit in quantification for data, e.g. [1] and [2].

[1] SHACL predecessor: http://svn.aksw.org/papers/2014/WWW_Databugger/public.pdf

[2] https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf

2. Data is not understood well in my opinion. There is no good model yet to measure its value, once it becomes information.

Please see below:


As an alternative, however, I can suggest that you separately study some
quality dimensions assuming a particular use case for your study; this
will be correct, doable and greatly appreciated. :-) Please feel free to
ask for help in case you need it, either personally or via this list or
other means. And thanks for your interest in improving Wikidata!

We are studying this in Global Fact Sync at the moment https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE/SyncTargets

The next step here is to define 10 or more sync targets for Wikidata and then assess them as a baseline. Then we will extend the prototype in a way that aims for these synctargets to be:

- in sync with Wikipedia's infoboxes, i.e. all info was transferred in terms of data and citations on wikipages

- near perfect, if we find a good reference source to integrate

Is there someone who has a suggestion? Like a certain part of Wikidata that can serve as a testbed and that we can improve?

Please send it to us.

-- Sebastian



Regards,
David


On 8/24/19 13:54, Uwe Jung wrote:
Hello,

As the importance of Wikidata increases, so do the demands on the
quality of the data. I would like to put the following proposal up for
discussion.

Two basic ideas:

 1. Each Wikidata page (item) is scored after each editing. This score
    should express different dimensions of data quality in a quickly
    manageable way.
 2. A property is created via which the item refers to the score value.
    Certain qualifiers can be used for a more detailed description (e.g.
    time of calculation, algorithm used to calculate the score value, etc.).


The score value can be calculated either within Wikibase after each data
change or "externally" by a bot. For the calculation can be used among
other things: Number of constraints, completeness of references, degree
of completeness in relation to the underlying ontology, etc. There are
already some interesting discussions on the question of data quality
which can be used here ( see 
https://www.wikidata.org/wiki/Wikidata:Item_quality;
https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).

Advantages

  * Users get a quick overview of the quality of a page (item).
  * SPARQL can be used to query only those items that meet a certain
    quality level.
  * The idea would probably be relatively easy to implement.


Disadvantage:

  * In a way, the data model is abused by generating statements that no
    longer describe the item itself, but make statements about the
    representation of this item in Wikidata.
  * Additional computing power must be provided for the regular
    calculation of all changed items.
  * Only the quality of pages is referred to. If it is insufficient, the
    changes still have to be made manually.


I would now be interested in the following:

 1. Is this idea suitable to effectively help solve existing quality
    problems?
 2. Which quality dimensions should the score value represent?
 3. Which quality dimension can be calculated with reasonable effort?
 4. How to calculate and represent them?
 5. Which is the most suitable way to further discuss and implement this
    idea?


Many thanks in advance.

Uwe Jung  (UJung <https://www.wikidata.org/wiki/User:UJung>)
www.archivfuehrer-kolonialzeit.de/thesaurus
<http://www.archivfuehrer-kolonialzeit.de/thesaurus>



_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


    
--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org