We are releasing the alpha version of Wikidata Knowledge Imbalance Dashboard (https://prowd.netlify.com/). The tool measures knowledge imbalances on Wikidata using Gini index based on property existence over entities.
In the examples, data about infectious diseases [1] is shown to be imbalanced (Gini = 0.26), whereas data about countries [2] is balanced (Gini = 0.14). Other examples include programming languages [3] (Gini = 0.37) and association football clubs [4] (Gini = 0.29).
The tool also supports the scenario where instead of analyzing all possible properties of entities, one may focus on specific properties of interest. With respect to diseases [5], for example, analyzing the existence of the properties has effect (P1542), possible treatment (P924), drug used for treatment (P2176), and symptoms (P780), reveals that Wikidata is heavily imbalanced.
As the tool is still in a preliminary stage, we invite you to give it a try at https://prowd.netlify.com/ , and welcome any constructive feedback!
Regards, Fariz
Example links: [1] https://commons.wikimedia.org/wiki/File:Screenshot-prowd.netlify.com-2020.03... [2] https://commons.wikimedia.org/wiki/File:Screenshot-prowd.netlify.com-2020.03... [3] https://commons.wikimedia.org/wiki/File:Screenshot-prowd.netlify.com-2020.03... [4] https://commons.wikimedia.org/wiki/File:Screenshot-prowd.netlify.com-2020.03... [5] https://commons.wikimedia.org/wiki/File:Screenshot-prowd.netlify.com-2020.03...
Hi Fariz,
Nice work!
While running a few properties a question I had is how does the tool handle the 10,000 item limit. Does not seem to be an explanation of how the 10,000 items are selected, be that the first 10,000 item returned in a Wikidata query or other means of sampling. It is important to understand that to consider what bias that selection has on the analysis since the full item list is not being analysed (or maybe it is and you're only displaying 10,000) it just isn't clear.
Additionally adding a "count" column to the property gap would provide a quick overview of how extensive the bias is per property similar to the count of properties are listed per item.
I assume you have more planned for this given its in alpha for feedback so just a few thoughts on what I would like to do, but don't see possible at the moment.
1. It would also be useful to provide an export of the list of items and the associated properties that are missing for each, and their deviation percentile to order the deviation. If these were also saved as part of the tool it could become a working list to focus on to counter the bias.
2. Do you have plans for adding filters or grouping comparisons betweens values of properties or subclasses? Currently running on lakes (Q23397) results in more than 10,000 items. It also doesn't afford an analysis across administrative regions of the world such as between states in the United States of America. This also prompted the above suggestion to clarify how the 10,000 items are handled in the tool as I was unsure if this was a complete analysis or just of 10,000 lakes that may be located in different geographic regions with different level of organization to work on these sets. In the USA, New York, Minnesota, and Michigan seem to have had significantly more lake presence than other states thus analysis by region, by lake size, or other property would make it more clear what type of bias is present for what dataset. The top property may not provide the bias analysis for particular use cases with more nuanced bias.
Lake subclasses https://angryloki.github.io/wikidata-graph-builder/?property=P279&item=Q...
Cheers, Jonathan Brier
PhD Student University of Maryland College Park, College of Information Studies Visiting Researcher University of Gothenburg