Hi,
If we accept that the quality of the data is the "fitness for use", which in my opinion is the best and most commonly used definition (as stated in the article linked by Ettore), then it will never be possible to define a number that objectively represents data quality. We can define a number that is the result of an arbitrary weighted average of different metrics related to various dimensions of quality arbitrarily captured and transformed, and we can fool ourselves by saying that this number represents data quality, but it will not, nor will it be an approximation of what data quality means, nor will this number be able to order Wikidata entities matching any common, understandable, high-level criterion. The quality of the data depends on the use, it's relative to each user, and can't be measured globally and objectively in any way that is better than another.
As an alternative, however, I can suggest that you separately study some quality dimensions assuming a particular use case for your study; this will be correct, doable and greatly appreciated. :-) Please feel free to ask for help in case you need it, either personally or via this list or other means. And thanks for your interest in improving Wikidata!
Regards, David
On 8/24/19 13:54, Uwe Jung wrote:
Hello,
As the importance of Wikidata increases, so do the demands on the quality of the data. I would like to put the following proposal up for discussion.
Two basic ideas:
- Each Wikidata page (item) is scored after each editing. This score should express different dimensions of data quality in a quickly manageable way.
- A property is created via which the item refers to the score value. Certain qualifiers can be used for a more detailed description (e.g. time of calculation, algorithm used to calculate the score value, etc.).
The score value can be calculated either within Wikibase after each data change or "externally" by a bot. For the calculation can be used among other things: Number of constraints, completeness of references, degree of completeness in relation to the underlying ontology, etc. There are already some interesting discussions on the question of data quality which can be used here ( see https://www.wikidata.org/wiki/Wikidata:Item_quality; https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).
Advantages
- Users get a quick overview of the quality of a page (item).
- SPARQL can be used to query only those items that meet a certain quality level.
- The idea would probably be relatively easy to implement.
Disadvantage:
- In a way, the data model is abused by generating statements that no longer describe the item itself, but make statements about the representation of this item in Wikidata.
- Additional computing power must be provided for the regular calculation of all changed items.
- Only the quality of pages is referred to. If it is insufficient, the changes still have to be made manually.
I would now be interested in the following:
- Is this idea suitable to effectively help solve existing quality problems?
- Which quality dimensions should the score value represent?
- Which quality dimension can be calculated with reasonable effort?
- How to calculate and represent them?
- Which is the most suitable way to further discuss and implement this idea?
Many thanks in advance.
Uwe Jung (UJung https://www.wikidata.org/wiki/User:UJung) www.archivfuehrer-kolonialzeit.de/thesaurus http://www.archivfuehrer-kolonialzeit.de/thesaurus
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata