Hello,
thank you very much for your contributions and comments. I would sign most
of your remarks without hesitation.
But I would like to clarify some things again:
- The importance of Wikidata grows with its acceptance by the
"unspecialized" audience. This includes also a lot of people who are
allowed to decide about project funds or donations. As a rule, they have
little time to inform themselves sufficiently about the problems of
measuring data quality. In these hectic times, it is unfortunately common
for the audience to demand solutions that are as simple and quick to
analyse as possible. (I will leave the last sentence here as a hypothesis.)
I think it is important to try to meet these expectations.
- Recoin is known. And yes, it serves only in connection with the
dimension of *relative* completeness. At present, however, it is primarily
aimed at people who enter data manually. Thus it remains invisible or
unusable for many others. To stick with the idea - would it not be possible
to calculate a one- or multi-dimensional value from the recoin information,
which then can be stored as a literal via a property "relative
completeness" into the item? The advantage would be that this value can be
queried via SPARQL together with the item. Possible decision-makers from
the field of "jam science" can thus gain an overview of how complete the
data from this field are in Wikidata and for which data completion projects
funds may still have to be provided. As described in my last article, a
single property "relative completeness" is not sufficient to describe data
quality.
- I am sorry if I expressed it in a misleading way. I use this mailing
list to get feedback for an idea. It may be "my" idea (or not), but it is
far from being "my" project. However, if the idea should ever be realized
by anyone in any way, I would be interested in making my small modest
contribution.
- It's true that the number of current Wikidata items is hard to
imagine. If a single instance would need only one second per item to
calculate the different quality scores, it would take about 113 years for
all. The fact that many items are modified over and over again and
therefore have to be recalculated is not yet taken into account in the
calculation. Therefore, the implemented approach would have to use
strategies that make the first results visible with less effort. One
possibility is to initially concentrate on the part of the data that is
being used. We are hear at the question about dynamic quality.
- People need support so that they can use data and find and fix their
flaws. In the foreseeable future, there will not be so many supporters who
will be able to manually check all 60 million items for errors. This is
another reason why information about the quality of the data should be
queried together with the data.
Thanks
Uwe Jung
Show replies by thread