Hello,
As the importance of Wikidata increases, so do the demands on the quality of the data. I would like to put the following proposal up for discussion.
Two basic ideas:
1. Each Wikidata page (item) is scored after each editing. This score should express different dimensions of data quality in a quickly manageable way. 2. A property is created via which the item refers to the score value. Certain qualifiers can be used for a more detailed description (e.g. time of calculation, algorithm used to calculate the score value, etc.).
The score value can be calculated either within Wikibase after each data change or "externally" by a bot. For the calculation can be used among other things: Number of constraints, completeness of references, degree of completeness in relation to the underlying ontology, etc. There are already some interesting discussions on the question of data quality which can be used here ( see https://www.wikidata.org/wiki/Wikidata:Item_quality; https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).
Advantages
- Users get a quick overview of the quality of a page (item). - SPARQL can be used to query only those items that meet a certain quality level. - The idea would probably be relatively easy to implement.
Disadvantage:
- In a way, the data model is abused by generating statements that no longer describe the item itself, but make statements about the representation of this item in Wikidata. - Additional computing power must be provided for the regular calculation of all changed items. - Only the quality of pages is referred to. If it is insufficient, the changes still have to be made manually.
I would now be interested in the following:
1. Is this idea suitable to effectively help solve existing quality problems? 2. Which quality dimensions should the score value represent? 3. Which quality dimension can be calculated with reasonable effort? 4. How to calculate and represent them? 5. Which is the most suitable way to further discuss and implement this idea?
Many thanks in advance.
Uwe Jung (UJung https://www.wikidata.org/wiki/User:UJung) www.archivfuehrer-kolonialzeit.de/thesaurus
Hello,
Very interesting idea. Just to feed the discussion, here is a very recent literature survey on data quality in Wikidata: https://opensym.org/wp-content/uploads/2019/08/os19-paper-A17-piscopo.pdf https://opensym.org/wp-content/uploads/2019/08/os19-paper-A17-piscopo.pdf
Cheers,
Ettore Rizza
On Sat, 24 Aug 2019 at 13:55, Uwe Jung jung.uwe@gmail.com wrote:
Hoi, What is it that you hope to achieve by this.. It will add to the time it takes to process an edit. It is a luxury we cannot afford. It is also not something that would influence my edits. Thanks, GerardM
On Sat, 24 Aug 2019 at 13:55, Uwe Jung jung.uwe@gmail.com wrote:
Hi,
If we accept that the quality of the data is the "fitness for use", which in my opinion is the best and most commonly used definition (as stated in the article linked by Ettore), then it will never be possible to define a number that objectively represents data quality. We can define a number that is the result of an arbitrary weighted average of different metrics related to various dimensions of quality arbitrarily captured and transformed, and we can fool ourselves by saying that this number represents data quality, but it will not, nor will it be an approximation of what data quality means, nor will this number be able to order Wikidata entities matching any common, understandable, high-level criterion. The quality of the data depends on the use, it's relative to each user, and can't be measured globally and objectively in any way that is better than another.
As an alternative, however, I can suggest that you separately study some quality dimensions assuming a particular use case for your study; this will be correct, doable and greatly appreciated. :-) Please feel free to ask for help in case you need it, either personally or via this list or other means. And thanks for your interest in improving Wikidata!
Regards, David
On 8/24/19 13:54, Uwe Jung wrote:
Hi David,
On 24.08.19 21:23, David Abián wrote:
True that, but there are these two aspects:
1. what you describe sound more like inherent problem, when "measuring" quality, because as soon as you measure it it becomes quantity. It is not specific to data. However, it is a viable helping construct, i.e. you measure something and in the interpretation you can judge quality for better or for worse. There is some merit in quantification for data, e.g. [1] and [2].
[1] SHACL predecessor: http://svn.aksw.org/papers/2014/WWW_Databugger/public.pdf
[2] https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf
2. Data is not understood well in my opinion. There is no good model yet to measure its value, once it becomes information.
Please see below:
We are studying this in Global Fact Sync at the moment https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE/Sync...
The next step here is to define 10 or more sync targets for Wikidata and then assess them as a baseline. Then we will extend the prototype in a way that aims for these synctargets to be:
- in sync with Wikipedia's infoboxes, i.e. all info was transferred in terms of data and citations on wikipages
- near perfect, if we find a good reference source to integrate
Is there someone who has a suggestion? Like a certain part of Wikidata that can serve as a testbed and that we can improve?
Please send it to us.
-- Sebastian
TLDR: it would be useful ; but extreme hard to create rules for every domains.
imho: it is deepends of the data domain.
For geodata ( human settlements/rivers/mountains/... ) ( with GPS coordinates ) my simple rules: - if it has a "local wikipedia pages" or any big lang["EN/FR/PT/ES/RU/.."] wikipedia page .. than it is OK. - if it is only in "cebuano" AND outside of "cebuano BBOX" -> then .... this is lower quality - only:{shwiki+srwiki} AND outside of "sh"&"sr" BBOX -> this is lower quality - only {huwiki} AND outside of CentralEuropeBBOX -> this is lower quality - geodata without GPS coordinate -> ... - .... so my rules based on wikipedia pages and languages areas ... and I prefer wikidata - with local wikipedia pages.
This is based on my experience - adding Wikidata ID concordances to NaturalEarth ( https://www.naturalearthdata.com/blog/ )
idea?
imho: Loading the wikidata dump to the local database; and creating - some "proof of concept" quality data indicators. - some "meta" rules - some "real" statistics so the community can decide it is useful or not.
Imre
Uwe Jung jung.uwe@gmail.com ezt írta (időpont: 2019. aug. 24., Szo, 14:55):
Hi Imre,
we can encode these rules using the JSON MongoDB database we created in GlobalFactSync project (https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE). As basis for the GFS Data Browser. The database has open read access.
Is there a list of geodata issues, somewhere? Can you give some example? GFS focuses on both: overall quality measures and very domain specific adaptations. We will also try to flag these issues for Wikipedians.
So I see that there is some notion of what is good and what not by source. Do you have a reference dataset as well, or would that be NaturalEarth itself? What would help you to measure completeness for adding concordances to NaturalEarth.
-- Sebastian
On 24.08.19 21:26, Imre Samu wrote:
Hi Sebastian,
Is there a list of geodata issues, somewhere? Can you give some example?
My main "pain" points:
- the cebuano geo duplicates: https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2017/10#Cebuano https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/06#A_propos...
- detecting "anonym" editings of the wikidata labels from wikidata JSON dumps. As I know - Now it is impossible, - no similar information in the JSON dump, so I cant' create a score. This is similar problem like the original posts ; ( ~ quality score ) but I would like to use the original editing history and implementing/tuning my scoring algorithm.
When somebody renaming some city names (trolls) , then my matching algorithm not find them, and in this cases I can use the previous "better" state of the wikidata. It is also important for merging openstreetmap place-names with wikidata labels for end users.
Do you have a reference dataset as well, or would that be NaturalEarth
itself?
Sorry, I don't have a reference datasets. and NaturalEarth is only a subset of the "reality" . not contains all cities, rivers, ... But maybe you can use OpenStreetMap as a best resource. Sometimes I matching add adding wikidata concordances to https://www.whosonfirst.org/ (WOF) gazetteer; but this data originated mostly from similar sources ( geonames,..) so can't use a quality indicator.
If you need some easy example - probably the "airports" is a good start for checking wikidata completeness. (p238_iata_airport_code ; p239_icao_airport_code ; p240_faa_airport_code ; p931_place_served ; p131_located_in )
What would help you to measure completeness for adding concordances to
NaturalEarth.
I have created my own tools/scripts ; because waiting for the community for fixing cebwiki data problems is lot of times.
I am importing wikidata JSON dumps to PostGIS ( the SparQL is not so flexible/scalable for geo matchings , ) - adding some scoring based on cebwiki /srwiki ... - creating some sheets for manual checking. but this process is like a ~ "fuzzy left join" ... with lot of hacky codes and manual tunings.
If I don't find some NaturalEarth/WOF object in the wikidata, then I have to manually debug. The most problems is - different transliterations / spellings / english vs. local names ... - some trolling by anonymous users ( mostly from mobile phone ). - problems with GPS coordinates. - changes in the real data ( cities joining / splitting ) so need lot of background research.
best, Imre
Sebastian Hellmann hellmann@informatik.uni-leipzig.de ezt írta (időpont: 2019. aug. 28., Sze, 11:11):
Hello,
many thanks for the answers to my contribution from 24.8. I think that all four opinions contain important things to consider.
@David Abián I have read the article and agree that in the end the users decide which data is good for them or not.
@GerardM It is true that in a possible implementation of the idea, the aspect of computing load must be taken into account right from the beginning.
Please check that I have not given up on the idea yet. With regard to the acceptance of Wikidata, I consider a quality indicator of some kind to be absolutely necessary. There will be a lot of ordinary users who would like to see something like this.
At the same time I completely agree with David;(almost) every chosen indicator is subject to a certain arbitrariness in the selection. There won't be one easy to understand super-indicator. So, let's approach things from the other side. Instead of a global indicator, a separate indicator should be developed for each quality dimension to be considered. With some dimensions this should be relatively easy. For others it could take years until we have agreed on an algorithm for their calculation.
Furthermore, the indicators should not represent discrete values but a continuum of values. No traffic light statements (i.e.: good, medium, bad) should be made. Rather, when displaying the qualifiers, the value could be related to the values of all other objects (e.g. the value x for the current data object in relation to the overall average for all objects for this indicator). The advantage here is that the total average can increase over time, meaning that the position of the value for an individual object can also decrease over time.
Another advantage: Users can define the required quality level themselves. If, for example, you have high demands on accuracy but few demands on the completeness of the statements, you can do this.
However, it remains important that these indicators (i.e. the evaluation of the individual item) must be stored together with the item and can be queried together with the data using SPARQL.
Greetings
Uwe Jung
Am Sa., 24. Aug. 2019 um 13:54 Uhr schrieb Uwe Jung jung.uwe@gmail.com:
Uwe I feel this is more and more important with quality and provenance and also communicate inside Wikidata the quality of our data.
I have added maybe the best source for biographies in Sweden P3217 in Wikidata on 7500 person. In Wikipedia those 7500 objects are used on > 200 different languages in Wikipedia we need to have a ”layer” explaining that data confirmed with P3217 ”SBL from Sweden” has very high trust
See https://phabricator.wikimedia.org/T222142
I can also see this quality problem that Nobelprize.orghttp://Nobelprize.org and Wikidata has > 30 differencies and its sometimes difficult to understand the quality of the sources in Wikidata plus that Nobelprize.comhttp://Nobelprize.com has no sources makes the equation difficult https://phabricator.wikimedia.org/T200668
Regards Magnus Sälgö 0046-705937579 Salgo60@msn.commailto:Salgo60@msn.com
A blogpost I wrote https://minancestry.blogspot.com/2018/04/wikidata-has-design-problem.htmlhttps://minancestry.blogspot.com/2018/04/wikidata-has-design-problem.html?m=1
28 aug. 2019 kl. 03:49 skrev Uwe Jung <jung.uwe@gmail.commailto:jung.uwe@gmail.com>:
Hello,
many thanks for the answers to my contribution from 24.8. I think that all four opinions contain important things to consider.
@David Abián I have read the article and agree that in the end the users decide which data is good for them or not.
@GerardM It is true that in a possible implementation of the idea, the aspect of computing load must be taken into account right from the beginning.
Please check that I have not given up on the idea yet. With regard to the acceptance of Wikidata, I consider a quality indicator of some kind to be absolutely necessary. There will be a lot of ordinary users who would like to see something like this.
At the same time I completely agree with David;(almost) every chosen indicator is subject to a certain arbitrariness in the selection. There won't be one easy to understand super-indicator. So, let's approach things from the other side. Instead of a global indicator, a separate indicator should be developed for each quality dimension to be considered. With some dimensions this should be relatively easy. For others it could take years until we have agreed on an algorithm for their calculation.
Furthermore, the indicators should not represent discrete values but a continuum of values. No traffic light statements (i.e.: good, medium, bad) should be made. Rather, when displaying the qualifiers, the value could be related to the values of all other objects (e.g. the value x for the current data object in relation to the overall average for all objects for this indicator). The advantage here is that the total average can increase over time, meaning that the position of the value for an individual object can also decrease over time.
Another advantage: Users can define the required quality level themselves. If, for example, you have high demands on accuracy but few demands on the completeness of the statements, you can do this.
However, it remains important that these indicators (i.e. the evaluation of the individual item) must be stored together with the item and can be queried together with the data using SPARQL.
Greetings
Uwe Jung
Am Sa., 24. Aug. 2019 um 13:54 Uhr schrieb Uwe Jung <jung.uwe@gmail.commailto:jung.uwe@gmail.com>: Hello,
As the importance of Wikidata increases, so do the demands on the quality of the data. I would like to put the following proposal up for discussion.
Two basic ideas:
1. Each Wikidata page (item) is scored after each editing. This score should express different dimensions of data quality in a quickly manageable way. 2. A property is created via which the item refers to the score value. Certain qualifiers can be used for a more detailed description (e.g. time of calculation, algorithm used to calculate the score value, etc.).
The score value can be calculated either within Wikibase after each data change or "externally" by a bot. For the calculation can be used among other things: Number of constraints, completeness of references, degree of completeness in relation to the underlying ontology, etc. There are already some interesting discussions on the question of data quality which can be used here ( see https://www.wikidata.org/wiki/Wikidata:Item_quality; https://www.wikidata.org/wiki/Wikidata:WikiProject_Data_Quality, etc).
Advantages
* Users get a quick overview of the quality of a page (item). * SPARQL can be used to query only those items that meet a certain quality level. * The idea would probably be relatively easy to implement.
Disadvantage:
* In a way, the data model is abused by generating statements that no longer describe the item itself, but make statements about the representation of this item in Wikidata. * Additional computing power must be provided for the regular calculation of all changed items. * Only the quality of pages is referred to. If it is insufficient, the changes still have to be made manually.
I would now be interested in the following:
1. Is this idea suitable to effectively help solve existing quality problems? 2. Which quality dimensions should the score value represent? 3. Which quality dimension can be calculated with reasonable effort? 4. How to calculate and represent them? 5. Which is the most suitable way to further discuss and implement this idea?
Many thanks in advance.
Uwe Jung (UJunghttps://www.wikidata.org/wiki/User:UJung) www.archivfuehrer-kolonialzeit.de/thesaurushttp://www.archivfuehrer-kolonialzeit.de/thesaurus
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.orgmailto:Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hoi, Shopping for recognition for "your" project. Making it an issue that has to affect everything else because the quality of "your" project is highly problematic given a few basic facts. Wikidata has 59,284,641 items, this effort is about 7500 people. They drown in a sea of other people, items. Statistically the numbers involved are insignificant.
HOWEVER, when your effort has a practical application, it is that application, the use of the data that ensures that the data will be maintained and hopefully ensure that the quality of this subset is maintained. When you want quality, static quality is achieved by restricting at the gate. Dynamic quality is achieved by making sure that the data is actually used. Scholia is an example of functionality that supports existing data and everyone who uses it will see the flaws in the data. It is why we need to import additional data, merge scientist when there are duplicates, add additional authorities.
Yes, we need to achieve quality results. They will be achieved when people use the data, find its flaws and consequently append and amend. Recognition of quality is done best by supporting and highlighting the application of our data and particularly be thankful to the consequential updates we receive. The users that help us do better are our partners, all others ensure our relevance. Thanks, GerardM
On Wed, 28 Aug 2019 at 00:52, Magnus Sälgö salgo60@msn.com wrote:
@Uwe : I'm sorry if I say trivialities, but are you familiar with the Recoin tool [1] ? It seems to be quite close to what you describe, but only for the data quality dimension of completeness (or more precisely *relative* completeness) and it could perhaps serve as a model for what you are considering. It is also a good example of a data quality tool that is directly useful to editors, as it often allows them to identify and add missing statements on an item.
Regards,
Ettore Rizza
[1] https://www.wikidata.org/wiki/Wikidata:Recoin
On Tue, 27 Aug 2019 at 21:49, Uwe Jung jung.uwe@gmail.com wrote:
Hi Uwe,
I would agree with Gerard's concern about resources. Actually embedding it within Wikidata - stored in the item with a property and queriable by SPARQL - implies that we handle it as a statement. So each edit that materially changed the quality score would prompt another edit to update the scoring. Presumably not all edits would change anything (eg label changes wouldn't be relevant) but even if only 10% made a material difference, that's basically 10% more edits, 10% more contribution to query service updates, etc. And that's quite a substantial chunk of resources for a "nice to have" feature!
So... maybe this suggests a different approach.
You could set up a seperate Wikibase installation (or any other kind of linked-data store) to store the quality ratings, and make that accessible through a federated SPARQL search. The WDQS is capable of handling federated searches reasonably efficiently (see eg https://w.wiki/7at), so you could allow people to do a search using both sets of data ("find me all ABCs on Wikidata, and only return those with a value of X > 0.5 on Wikidata-Scores").
Andrew.
On Tue, 27 Aug 2019 at 20:50, Uwe Jung jung.uwe@gmail.com wrote:
-- - Andrew Gray andrew@generalist.org.uk