Hoi, There is a lot of knowledge on quality in online databases. It is known that all of them have a certain error rate. This is true for Wikidata as much as any other source.
My question is: is there a way to track Wikidata quality improvements over time. One approach I blogged about [1]. It is however only an approach to improve quality not an approach to determine quality and track the improvement of quality.
The good news is that there are many dumps of Wikidata so it is possible to compare current Wikidata with how it was in the past.
Would this be something that makes sense to get into for Wikimedia research. particularly in the light of Wikidata becoming more easily available to Wikipedia? Thanks, GerardM
[1] http://ultimategerardm.blogspot.nl/2015/08/wikidata-quality-probability-and-...
Hi Gerard,
your blog post got me thinking about designing a Wikidata fact checking tool. The idea would be to rank facts to be checked by a human by some combination of a fact importance score and a fact uncertainty score. Do you know of any work that has already been done in this space? Do you think such a tool would be used? What are the current systems for quality control in Wikidata?
As an aside, estimating fact uncertainty may reduce to estimating Wikidata quality as a whole.
Best,
Ellery
On Mon, Aug 24, 2015 at 11:24 PM, Gerard Meijssen <gerard.meijssen@gmail.com
wrote:
Hoi, There is a lot of knowledge on quality in online databases. It is known that all of them have a certain error rate. This is true for Wikidata as much as any other source.
My question is: is there a way to track Wikidata quality improvements over time. One approach I blogged about [1]. It is however only an approach to improve quality not an approach to determine quality and track the improvement of quality.
The good news is that there are many dumps of Wikidata so it is possible to compare current Wikidata with how it was in the past.
Would this be something that makes sense to get into for Wikimedia research. particularly in the light of Wikidata becoming more easily available to Wikipedia? Thanks, GerardM
[1] http://ultimategerardm.blogspot.nl/2015/08/wikidata-quality-probability-and-...
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hi Ellery and Gerard,
Interesting thread!
I don't know if this could be of interest for you guys, but I have a recent paper where we explore a very simple approach to fact-checking of a knowledge base. We compute the shortest path between entities and convert the distance into a semantic proximity. It's very rudimentary (for one, it doesn't take into account different relations), but it seems to work well on core topics (like geography, US presidents, movies, etc), and it even gives a signal when you ask about stuff that is not contained in the knowledge network itself:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0128193
In the paper we use data from DBpedia dumps, but adapting to Wikidata would be straightforward. There is a lot of research on this problem, and it goes under several, related names (link prediction, relation extraction, knowledge base construction, etc). A recent overview of the literature can be found here:
http://arxiv.org/abs/1503.00759
the review is, authored, among others, by some of the people working on Google's Knowledge Vault.
Cheers,
Giovanni
Giovanni Luca Ciampaglia
Center for Complex Networks and Systems Research School of Informatics and Computing, Indiana University
✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA ☞ http://www.glciampaglia.com/ ✆ +1 812 855-7261 ✉ gciampag@indiana.edu
On Tue, Aug 25, 2015 at 5:11 PM, Ellery Wulczyn ewulczyn@wikimedia.org wrote:
Hi Gerard,
your blog post got me thinking about designing a Wikidata fact checking tool. The idea would be to rank facts to be checked by a human by some combination of a fact importance score and a fact uncertainty score. Do you know of any work that has already been done in this space? Do you think such a tool would be used? What are the current systems for quality control in Wikidata?
As an aside, estimating fact uncertainty may reduce to estimating Wikidata quality as a whole.
Best,
Ellery
On Mon, Aug 24, 2015 at 11:24 PM, Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, There is a lot of knowledge on quality in online databases. It is known that all of them have a certain error rate. This is true for Wikidata as much as any other source.
My question is: is there a way to track Wikidata quality improvements over time. One approach I blogged about [1]. It is however only an approach to improve quality not an approach to determine quality and track the improvement of quality.
The good news is that there are many dumps of Wikidata so it is possible to compare current Wikidata with how it was in the past.
Would this be something that makes sense to get into for Wikimedia research. particularly in the light of Wikidata becoming more easily available to Wikipedia? Thanks, GerardM
[1] http://ultimategerardm.blogspot.nl/2015/08/wikidata-quality-probability-and-...
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hoi, There is work done on software that compares data from Wikidata and other external sources. This is by someone connected to Wikidata. The details are not clear to me. It is supposed to become available in 2015.
What I am looking for is a way to learn at what level the quality is. The problem we face is that Wikipedians are not convinced by the quality of Wikidata because there are no sources. Their observation is correct, there is hardly any credible source information but that does not necessarily imply that quality is worse than info at other sources or in Wikipedia itself. The two things are not really related. My blogpost is an approach to quality. What it does do is make it plausible that an approach where data is compared with data in linked sources may aid in improving quality.It does however not provide an argument that is easy to digest. It does not rate quality in percentages, it does not indicate in numbers how quality is improving when this approach is taken. They are the kind of arguments that may convince Wikipedians that Wikidata is safe to use even without the sources they seek.
I am NOT saying that sources on statements are not good to have, What I am saying is that it is unlikely for the many millions of statements to have credible sources any time soon. Consequently it is best to work on sourcing potential problematic statements and have statistics on problematic statements due to comparisons with other sources. With numbers like this, we encourage people to do the hard work by showing how much of a difference they make.
Finding such numbers is exactly what research is about. This is why I put this challenge to you as I am not a scientist nor do I have the right skills. Thanks, GerardM
On 25 August 2015 at 23:11, Ellery Wulczyn ewulczyn@wikimedia.org wrote:
Hi Gerard,
your blog post got me thinking about designing a Wikidata fact checking tool. The idea would be to rank facts to be checked by a human by some combination of a fact importance score and a fact uncertainty score. Do you know of any work that has already been done in this space? Do you think such a tool would be used? What are the current systems for quality control in Wikidata?
As an aside, estimating fact uncertainty may reduce to estimating Wikidata quality as a whole.
Best,
Ellery
On Mon, Aug 24, 2015 at 11:24 PM, Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, There is a lot of knowledge on quality in online databases. It is known that all of them have a certain error rate. This is true for Wikidata as much as any other source.
My question is: is there a way to track Wikidata quality improvements over time. One approach I blogged about [1]. It is however only an approach to improve quality not an approach to determine quality and track the improvement of quality.
The good news is that there are many dumps of Wikidata so it is possible to compare current Wikidata with how it was in the past.
Would this be something that makes sense to get into for Wikimedia research. particularly in the light of Wikidata becoming more easily available to Wikipedia? Thanks, GerardM
[1] http://ultimategerardm.blogspot.nl/2015/08/wikidata-quality-probability-and-...
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Hello,
If I consider a qualitative approach, maybe it might be useful to talk to Wikipedians what exactly are their concerns and what to do in order to overcome them. A lot of Wikidata information does not really need a 'source', similarly as in Wikipedia itself.
I believe that a lot of resistance against Wikidata in Wikipedia communities stems from the fact that WP and WD are different wikis. Wikipedians feel uncomfortable on WD where they are newbies.
Sometimes I also recognize a certain pride of the own Wikipedia language version and a certain disdain of some other Wikipedia language versions. For example: "We on great Wikipedia in language A are superb in referencing and vandalism fighting, while the stupid people of Wikipedia in language B let all the rubbish go in." And via WD, that "rubbish" enters the superb Wikipedia in language A - that's the fear.
(In my own theorizing, I am asking myself in which ways wikis can be connected with each other, in a technical way, in a social way.)
It would be good to have a look at those feelings in order to understand "community resistance" better. Feelings that are not totally irrational, by the way.
Kind regards Ziko
Am Mittwoch, 26. August 2015 schrieb Gerard Meijssen :
Hoi, There is work done on software that compares data from Wikidata and other external sources. This is by someone connected to Wikidata. The details are not clear to me. It is supposed to become available in 2015.
What I am looking for is a way to learn at what level the quality is. The problem we face is that Wikipedians are not convinced by the quality of Wikidata because there are no sources. Their observation is correct, there is hardly any credible source information but that does not necessarily imply that quality is worse than info at other sources or in Wikipedia itself. The two things are not really related. My blogpost is an approach to quality. What it does do is make it plausible that an approach where data is compared with data in linked sources may aid in improving quality.It does however not provide an argument that is easy to digest. It does not rate quality in percentages, it does not indicate in numbers how quality is improving when this approach is taken. They are the kind of arguments that may convince Wikipedians that Wikidata is safe to use even without the sources they seek.
I am NOT saying that sources on statements are not good to have, What I am saying is that it is unlikely for the many millions of statements to have credible sources any time soon. Consequently it is best to work on sourcing potential problematic statements and have statistics on problematic statements due to comparisons with other sources. With numbers like this, we encourage people to do the hard work by showing how much of a difference they make.
Finding such numbers is exactly what research is about. This is why I put this challenge to you as I am not a scientist nor do I have the right skills. Thanks, GerardM
On 25 August 2015 at 23:11, Ellery Wulczyn <ewulczyn@wikimedia.org javascript:_e(%7B%7D,'cvml','ewulczyn@wikimedia.org');> wrote:
Hi Gerard,
your blog post got me thinking about designing a Wikidata fact checking tool. The idea would be to rank facts to be checked by a human by some combination of a fact importance score and a fact uncertainty score. Do you know of any work that has already been done in this space? Do you think such a tool would be used? What are the current systems for quality control in Wikidata?
As an aside, estimating fact uncertainty may reduce to estimating Wikidata quality as a whole.
Best,
Ellery
On Mon, Aug 24, 2015 at 11:24 PM, Gerard Meijssen < gerard.meijssen@gmail.com javascript:_e(%7B%7D,'cvml','gerard.meijssen@gmail.com');> wrote:
Hoi, There is a lot of knowledge on quality in online databases. It is known that all of them have a certain error rate. This is true for Wikidata as much as any other source.
My question is: is there a way to track Wikidata quality improvements over time. One approach I blogged about [1]. It is however only an approach to improve quality not an approach to determine quality and track the improvement of quality.
The good news is that there are many dumps of Wikidata so it is possible to compare current Wikidata with how it was in the past.
Would this be something that makes sense to get into for Wikimedia research. particularly in the light of Wikidata becoming more easily available to Wikipedia? Thanks, GerardM
[1] http://ultimategerardm.blogspot.nl/2015/08/wikidata-quality-probability-and-...
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org javascript:_e(%7B%7D,'cvml','Wiki-research-l@lists.wikimedia.org'); https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
+1 to that! I think in many ways we are our own best enemies when it comes to cross-project pollination of ideas
On Wed, Aug 26, 2015 at 11:25 AM, Ziko van Dijk zvandijk@gmail.com wrote:
Hello,
If I consider a qualitative approach, maybe it might be useful to talk to Wikipedians what exactly are their concerns and what to do in order to overcome them. A lot of Wikidata information does not really need a 'source', similarly as in Wikipedia itself.
I believe that a lot of resistance against Wikidata in Wikipedia communities stems from the fact that WP and WD are different wikis. Wikipedians feel uncomfortable on WD where they are newbies.
Sometimes I also recognize a certain pride of the own Wikipedia language version and a certain disdain of some other Wikipedia language versions. For example: "We on great Wikipedia in language A are superb in referencing and vandalism fighting, while the stupid people of Wikipedia in language B let all the rubbish go in." And via WD, that "rubbish" enters the superb Wikipedia in language A - that's the fear.
(In my own theorizing, I am asking myself in which ways wikis can be connected with each other, in a technical way, in a social way.)
It would be good to have a look at those feelings in order to understand "community resistance" better. Feelings that are not totally irrational, by the way.
Kind regards Ziko
Am Mittwoch, 26. August 2015 schrieb Gerard Meijssen :
Hoi, There is work done on software that compares data from Wikidata and other external sources. This is by someone connected to Wikidata. The details are not clear to me. It is supposed to become available in 2015.
What I am looking for is a way to learn at what level the quality is. The problem we face is that Wikipedians are not convinced by the quality of Wikidata because there are no sources. Their observation is correct, there is hardly any credible source information but that does not necessarily imply that quality is worse than info at other sources or in Wikipedia itself. The two things are not really related. My blogpost is an approach to quality. What it does do is make it plausible that an approach where data is compared with data in linked sources may aid in improving quality.It does however not provide an argument that is easy to digest. It does not rate quality in percentages, it does not indicate in numbers how quality is improving when this approach is taken. They are the kind of arguments that may convince Wikipedians that Wikidata is safe to use even without the sources they seek.
I am NOT saying that sources on statements are not good to have, What I am saying is that it is unlikely for the many millions of statements to have credible sources any time soon. Consequently it is best to work on sourcing potential problematic statements and have statistics on problematic statements due to comparisons with other sources. With numbers like this, we encourage people to do the hard work by showing how much of a difference they make.
Finding such numbers is exactly what research is about. This is why I put this challenge to you as I am not a scientist nor do I have the right skills. Thanks, GerardM
On 25 August 2015 at 23:11, Ellery Wulczyn ewulczyn@wikimedia.org wrote:
Hi Gerard,
your blog post got me thinking about designing a Wikidata fact checking tool. The idea would be to rank facts to be checked by a human by some combination of a fact importance score and a fact uncertainty score. Do you know of any work that has already been done in this space? Do you think such a tool would be used? What are the current systems for quality control in Wikidata?
As an aside, estimating fact uncertainty may reduce to estimating Wikidata quality as a whole.
Best,
Ellery
On Mon, Aug 24, 2015 at 11:24 PM, Gerard Meijssen < gerard.meijssen@gmail.com> wrote:
Hoi, There is a lot of knowledge on quality in online databases. It is known that all of them have a certain error rate. This is true for Wikidata as much as any other source.
My question is: is there a way to track Wikidata quality improvements over time. One approach I blogged about [1]. It is however only an approach to improve quality not an approach to determine quality and track the improvement of quality.
The good news is that there are many dumps of Wikidata so it is possible to compare current Wikidata with how it was in the past.
Would this be something that makes sense to get into for Wikimedia research. particularly in the light of Wikidata becoming more easily available to Wikipedia? Thanks, GerardM
[1] http://ultimategerardm.blogspot.nl/2015/08/wikidata-quality-probability-and-...
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
wiki-research-l@lists.wikimedia.org