Re: [Wikidata] Proposal for the introduction of a practicable Data Quality Indicator in Wikidata - 3rd round - Wikidata

29 Aug 2019

Hello,

thank you very much for your contributions and comments. I would sign most
of your remarks without hesitation.

But I would like to clarify some things again:

   - The importance of Wikidata grows with its acceptance by the
   "unspecialized" audience. This includes also a lot of people who are
   allowed to decide about project funds or donations. As a rule, they have
   little time to inform themselves sufficiently about the problems of
   measuring data quality. In these hectic times, it is unfortunately common
   for the audience to demand solutions that are as simple and quick to
   analyse as possible. (I will leave the last sentence here as a hypothesis.)
   I think it is important to try to meet these expectations.
   - Recoin is known. And yes, it serves only in connection with the
   dimension of *relative* completeness. At present, however, it is primarily
   aimed at people who enter data manually. Thus it remains invisible or
   unusable for many others. To stick with the idea - would it not be possible
   to calculate a one- or multi-dimensional value from the recoin information,
   which then can be stored as a literal via a property "relative
   completeness" into the item? The advantage would be that this value can be
   queried via SPARQL together with the item. Possible decision-makers from
   the field of "jam science" can thus gain an overview of how complete the
   data from this field are in Wikidata and for which data completion projects
   funds may still have to be provided. As described in my last article, a
   single property "relative completeness" is not sufficient to describe data
   quality.
   - I am sorry if I expressed it in a misleading way. I use this mailing
   list to get feedback for an idea. It may be "my" idea (or not), but it is
   far from being "my" project. However, if the idea should ever be realized
   by anyone in any way, I would be interested in making my small modest
   contribution.
   - It's true that the number of current Wikidata items is hard to
   imagine. If a single instance would need only one second per item to
   calculate the different quality scores, it would take about 113 years for
   all. The fact that many items are modified over and over again and
   therefore have to be recalculated is not yet taken into account in the
   calculation. Therefore, the implemented approach would have to use
   strategies that make the first results visible with less effort. One
   possibility is to initially concentrate on the part of the data that is
   being used. We are hear at the question about dynamic quality.
   - People need support so that they can use data and find and fix their
   flaws. In the foreseeable future, there will not be so many supporters who
   will be able to manually check all 60 million items for errors. This is
   another reason why information about the quality of the data should be
   queried together with the data.

Thanks

Uwe Jung