Hi Fariz,

Nice work!

While running a few properties a question I had is how does the tool handle the 10,000 item limit. Does not seem to be an explanation of how the 10,000 items are selected, be that the first 10,000 item returned in a Wikidata query or other means of sampling. It is important to understand that to consider what bias that selection has on the analysis since the full item list is not being analysed (or maybe it is and you're only displaying 10,000) it just isn't clear.

Additionally adding a "count" column to the property gap would provide a quick overview of how extensive the bias is per property similar to the count of properties are listed per item.

I assume you have more planned for this given its in alpha for feedback so just a few thoughts on what I would like to do, but don't see possible at the moment.

1. It would also be useful to provide an export of the list of items and the associated properties that are missing for each, and their deviation percentile to order the deviation. If these were also saved as part of the tool it could become a working list to focus on to counter the bias.

2. Do you have plans for adding filters or grouping comparisons betweens values of properties or subclasses?

Currently running on lakes (Q23397) results in more than 10,000 items. It also doesn't afford an analysis across administrative regions of the world such as between states in the United States of America. This also prompted the above suggestion to clarify how the 10,000 items are handled in the tool as I was unsure if this was a complete analysis or just of 10,000 lakes that may be located in different geographic regions with different level of organization to work on these sets. In the USA, New York, Minnesota, and Michigan seem to have had significantly more lake presence than other states thus analysis by region, by lake size, or other property would make it more clear what type of bias is present for what dataset. The top property may not provide the bias analysis for particular use cases with more nuanced bias.

Lake subclasses https://angryloki.github.io/wikidata-graph-builder/?property=P279&item=Q23397&limit=5&mode=reverse

Cheers,

Jonathan Brier

PhD Student
University of Maryland College Park, College of Information Studies

Visiting Researcher

University of Gothenburg