Hey everyone :)
We'll be doing the next Wikidata office hour on September 23rd at 17:00 UTC. See http://www.timeanddate.com/worldclock/fixedtime.html?hour=17&min=00&... for your timezone. We'll be meeting in #wikimedia-office on Freenode IRC.
As usual I'll start with an overview of what's been happening around Wikidata since the last office hour and then we'll have time for questions and discussions.
If there is a particular topic you'd like to have on the agenda please let me know.
Cheers Lydia
Hi Lydia,
If you think it's interesting to have the {{#Invoke:OSM|.... }} module on the agenda, I can say a word or 2 about it, if you like.
It helps to find Openstreetmap objects, related in various ways to Wikipedia articles by means of the Wikidata Q-numbers.
tagging OSM objects with wikidata tags isn't very widespread at the moment, but I like to experiment with new possibilities as they come along.
Cheers,
Jo
2015-09-08 12:34 GMT+02:00 Lydia Pintscher lydia.pintscher@wikimedia.de:
Hey everyone :)
We'll be doing the next Wikidata office hour on September 23rd at 17:00 UTC. See http://www.timeanddate.com/worldclock/fixedtime.html?hour=17&min=00&... for your timezone. We'll be meeting in #wikimedia-office on Freenode IRC.
As usual I'll start with an overview of what's been happening around Wikidata since the last office hour and then we'll have time for questions and discussions.
If there is a particular topic you'd like to have on the agenda please let me know.
Cheers Lydia
-- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata
Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Tue, Sep 8, 2015 at 3:22 PM, Jo winfixit@gmail.com wrote:
Hi Lydia,
If you think it's interesting to have the {{#Invoke:OSM|.... }} module on the agenda, I can say a word or 2 about it, if you like.
It helps to find Openstreetmap objects, related in various ways to Wikipedia articles by means of the Wikidata Q-numbers.
tagging OSM objects with wikidata tags isn't very widespread at the moment, but I like to experiment with new possibilities as they come along.
Sounds cool. Please come to the office hour and talk about it :)
Cheers Lydia
On Tue, Sep 8, 2015 at 12:34 PM, Lydia Pintscher lydia.pintscher@wikimedia.de wrote:
Hey everyone :)
We'll be doing the next Wikidata office hour on September 23rd at 17:00 UTC. See http://www.timeanddate.com/worldclock/fixedtime.html?hour=17&min=00&... for your timezone. We'll be meeting in #wikimedia-office on Freenode IRC.
As usual I'll start with an overview of what's been happening around Wikidata since the last office hour and then we'll have time for questions and discussions.
If there is a particular topic you'd like to have on the agenda please let me know.
A quick reminder that this is happening in 2h and 45 minutes. See you there! :)
Cheers Lydia
On Tue, Sep 8, 2015 at 12:34 PM, Lydia Pintscher lydia.pintscher@wikimedia.de wrote:
Hey everyone :)
We'll be doing the next Wikidata office hour on September 23rd at 17:00 UTC. See http://www.timeanddate.com/worldclock/fixedtime.html?hour=17&min=00&... for your timezone. We'll be meeting in #wikimedia-office on Freenode IRC.
As usual I'll start with an overview of what's been happening around Wikidata since the last office hour and then we'll have time for questions and discussions.
If there is a particular topic you'd like to have on the agenda please let me know.
And here is the log for anyone who missed it yesterday and wants to catch up: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.201...
Cheers Lydia
On Thu, Sep 24, 2015 at 5:43 AM, Lydia Pintscher < lydia.pintscher@wikimedia.de> wrote:
And here is the log for anyone who missed it yesterday and wants to catch up: https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.201...
Thanks! Is there any more information on the issue with MusicBrainz?
17:26:27 <DanielK_WMDE> sjoerddebruin: yes, we went for MusicBrainz first, but it turned out to be impractical. you basically have to run their software in order to use their dumps
MusicBrainz was a major source of information for Freebase, so they appear to have been able to figure out how to parse the dumps (and they already have the MusicBrainz & Wikipedia IDs correlated).
Is there more detail, perhaps in a bug somewhere?
Tom
On Thu, Sep 24, 2015 at 7:54 PM, Tom Morris tfmorris@gmail.com wrote:
Thanks! Is there any more information on the issue with MusicBrainz?
17:26:27 <DanielK_WMDE> sjoerddebruin: yes, we went for MusicBrainz first, but it turned out to be impractical. you basically have to run their software in order to use their dumps
MusicBrainz was a major source of information for Freebase, so they appear to have been able to figure out how to parse the dumps (and they already have the MusicBrainz & Wikipedia IDs correlated).
Is there more detail, perhaps in a bug somewhere?
The issue is that they do offer dumps but you need to set up your own musicbrainz server to really use it. This was too time-intensive and complicated for the students to make progress on during their project. Because of this they decided to instead opt for another dataset instead to get started. In the future Musicbrainz should still get done. If anyone wants to work on adding more datasets to the tool please let me know.
Cheers Lydia
On 09/24/2015 10:59 AM, Lydia Pintscher wrote:
On Thu, Sep 24, 2015 at 7:54 PM, Tom Morris tfmorris@gmail.com wrote:
Thanks! Is there any more information on the issue with MusicBrainz?
17:26:27 <DanielK_WMDE> sjoerddebruin: yes, we went for MusicBrainz first, but it turned out to be impractical. you basically have to run their software in order to use their dumps
MusicBrainz was a major source of information for Freebase, so they appear to have been able to figure out how to parse the dumps (and they already have the MusicBrainz & Wikipedia IDs correlated).
Is there more detail, perhaps in a bug somewhere?
The issue is that they do offer dumps but you need to set up your own musicbrainz server to really use it. This was too time-intensive and complicated for the students to make progress on during their project. Because of this they decided to instead opt for another dataset instead to get started. In the future Musicbrainz should still get done. If anyone wants to work on adding more datasets to the tool please let me know.
Cheers Lydia
This is to add MusicBrainz to the primary source tool, not anything else?
peter
On Thu, Sep 24, 2015 at 2:18 PM, Peter F. Patel-Schneider < pfpschneider@gmail.com> wrote:
On 09/24/2015 10:59 AM, Lydia Pintscher wrote:
On Thu, Sep 24, 2015 at 7:54 PM, Tom Morris tfmorris@gmail.com wrote:
Thanks! Is there any more information on the issue with MusicBrainz?
17:26:27 <DanielK_WMDE> sjoerddebruin: yes, we went for MusicBrainz
first,
but it turned out to be impractical. you basically have to run their software in order to use their dumps
MusicBrainz was a major source of information for Freebase, so they
appear
to have been able to figure out how to parse the dumps (and they already have the MusicBrainz & Wikipedia IDs correlated).
Is there more detail, perhaps in a bug somewhere?
The issue is that they do offer dumps but you need to set up your own musicbrainz server to really use it. This was too time-intensive and complicated for the students to make progress on during their project. Because of this they decided to instead opt for another dataset instead to get started. In the future Musicbrainz should still get done. If anyone wants to work on adding more datasets to the tool please let me know.
Cheers Lydia
This is to add MusicBrainz to the primary source tool, not anything else?
It's apparently worse than that (which I hadn't realized until I re-read the transcript). It sounds like it's just going to generate little warning icons for "bad" facts and not lead to the recording of any new facts at all.
17:22:33 <Lydia_WMDE> we'll also work on getting the extension deployed that will help with checking against 3rd party databases17:23:33 <Lydia_WMDE> the result of constraint checks and checks against 3rd party databases will then be used to display little indicators next to a statement in case it is problematic17:23:47 <Lydia_WMDE> i hope this way more people become aware of issues and can help fix them17:24:35 <sjoerddebruin> Do you have any names of databases that are supported? :)17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german national library. it can be extended later
I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz considered trustworthy enough to import directly or will its facts need to be dripped through the primary source soda straw one at a time too?
Tom
On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris tfmorris@gmail.com wrote:
This is to add MusicBrainz to the primary source tool, not anything else?
It's apparently worse than that (which I hadn't realized until I re-read the transcript). It sounds like it's just going to generate little warning icons for "bad" facts and not lead to the recording of any new facts at all.
17:22:33 <Lydia_WMDE> we'll also work on getting the extension deployed that will help with checking against 3rd party databases 17:23:33 <Lydia_WMDE> the result of constraint checks and checks against 3rd party databases will then be used to display little indicators next to a statement in case it is problematic 17:23:47 <Lydia_WMDE> i hope this way more people become aware of issues and can help fix them 17:24:35 <sjoerddebruin> Do you have any names of databases that are supported? :) 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german national library. it can be extended later
I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz considered trustworthy enough to import directly or will its facts need to be dripped through the primary source soda straw one at a time too?
The primary sources tool and the extension that helps us check against other databases are two independent things. Imports from Musicbrainz have been happening since a very long time already.
Cheers Lydia
Has anybody actually done an assessment on Freebase and its reliability?
Is it *really* too unreliable to import wholesale?
Are there any stats/progress graphs as to how the actual import is in fact going?
-- James.
On 24/09/2015 19:35, Lydia Pintscher wrote:
On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris tfmorris@gmail.com wrote:
This is to add MusicBrainz to the primary source tool, not anything else?
It's apparently worse than that (which I hadn't realized until I re-read the transcript). It sounds like it's just going to generate little warning icons for "bad" facts and not lead to the recording of any new facts at all.
17:22:33 <Lydia_WMDE> we'll also work on getting the extension deployed that will help with checking against 3rd party databases 17:23:33 <Lydia_WMDE> the result of constraint checks and checks against 3rd party databases will then be used to display little indicators next to a statement in case it is problematic 17:23:47 <Lydia_WMDE> i hope this way more people become aware of issues and can help fix them 17:24:35 <sjoerddebruin> Do you have any names of databases that are supported? :) 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german national library. it can be extended later
I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz considered trustworthy enough to import directly or will its facts need to be dripped through the primary source soda straw one at a time too?
The primary sources tool and the extension that helps us check against other databases are two independent things. Imports from Musicbrainz have been happening since a very long time already.
Cheers Lydia
On 24.09.2015 23:48, James Heald wrote:
Has anybody actually done an assessment on Freebase and its reliability?
Is it *really* too unreliable to import wholesale?
From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts.
An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong).
Are there any stats/progress graphs as to how the actual import is in fact going?
It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems).
Markus
-- James.
On 24/09/2015 19:35, Lydia Pintscher wrote:
On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris tfmorris@gmail.com wrote:
This is to add MusicBrainz to the primary source tool, not anything else?
It's apparently worse than that (which I hadn't realized until I re-read the transcript). It sounds like it's just going to generate little warning icons for "bad" facts and not lead to the recording of any new facts at all.
17:22:33 <Lydia_WMDE> we'll also work on getting the extension deployed that will help with checking against 3rd party databases 17:23:33 <Lydia_WMDE> the result of constraint checks and checks against 3rd party databases will then be used to display little indicators next to a statement in case it is problematic 17:23:47 <Lydia_WMDE> i hope this way more people become aware of issues and can help fix them 17:24:35 <sjoerddebruin> Do you have any names of databases that are supported? :) 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german national library. it can be extended later
I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz considered trustworthy enough to import directly or will its facts need to be dripped through the primary source soda straw one at a time too?
The primary sources tool and the extension that helps us check against other databases are two independent things. Imports from Musicbrainz have been happening since a very long time already.
Cheers Lydia
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hoi,
From experience with Wikidata data I often find that data is wrong or of
questionable quality. Duh. Wikidata and Freebase are of a similar quality. What we know is that there is some data Freebase has we are not interested in.. For instance things that double the data by having it in two places.
We know that data can be easily improved by comparing it with other sources. This process is available for Wikidata but not for the primary sources tool as far as I am aware.
The problem with the primary sources tool is that it does not lead to imports in Wikidata and therefore it is one big miserable failure. To compound this issue, it is an article of faith that we "need" it and it is therefore not a subject that is talked about. Importing it one statement at a time is an absolute waste of time. It makes the user experience horrible. Thanks, GerardM
On 25 September 2015 at 01:02, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
On 24.09.2015 23:48, James Heald wrote:
Has anybody actually done an assessment on Freebase and its reliability?
Is it *really* too unreliable to import wholesale?
From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts.
An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong).
Are there any stats/progress graphs as to how the actual import is in fact going?
It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems).
Markus
-- James.
On 24/09/2015 19:35, Lydia Pintscher wrote:
On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris tfmorris@gmail.com wrote:
This is to add MusicBrainz to the primary source tool, not anything
else?
It's apparently worse than that (which I hadn't realized until I re-read the transcript). It sounds like it's just going to generate little warning icons for "bad" facts and not lead to the recording of any new facts at all.
17:22:33 <Lydia_WMDE> we'll also work on getting the extension deployed that will help with checking against 3rd party databases 17:23:33 <Lydia_WMDE> the result of constraint checks and checks against 3rd party databases will then be used to display little indicators next to a statement in case it is problematic 17:23:47 <Lydia_WMDE> i hope this way more people become aware of issues and can help fix them 17:24:35 <sjoerddebruin> Do you have any names of databases that are supported? :) 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german national library. it can be extended later
I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz considered trustworthy enough to import directly or will its facts need to be dripped through the primary source soda straw one at a time too?
The primary sources tool and the extension that helps us check against other databases are two independent things. Imports from Musicbrainz have been happening since a very long time already.
Cheers Lydia
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool.
On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
On 24.09.2015 23:48, James Heald wrote:
Has anybody actually done an assessment on Freebase and its reliability?
Is it *really* too unreliable to import wholesale?
From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts.
An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong).
Are there any stats/progress graphs as to how the actual import is in fact going?
It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems).
Markus
-- James.
On 24/09/2015 19:35, Lydia Pintscher wrote:
On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris tfmorris@gmail.com wrote:
This is to add MusicBrainz to the primary source tool, not anything else?
It's apparently worse than that (which I hadn't realized until I re-read the transcript). It sounds like it's just going to generate little warning icons for "bad" facts and not lead to the recording of any new facts at all.
17:22:33 <Lydia_WMDE> we'll also work on getting the extension deployed that will help with checking against 3rd party databases 17:23:33 <Lydia_WMDE> the result of constraint checks and checks against 3rd party databases will then be used to display little indicators next to
a
statement in case it is problematic 17:23:47 <Lydia_WMDE> i hope this way more people become aware of issues and can help fix them 17:24:35 <sjoerddebruin> Do you have any names of databases that are supported? :) 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german national library. it can be extended later
I know Freebase is deemed to be nasty and unreliable, but is
MusicBrainz
considered trustworthy enough to import directly or will its facts need to be dripped through the primary source soda straw one at a time too?
The primary sources tool and the extension that helps us check against other databases are two independent things. Imports from Musicbrainz have been happening since a very long time already.
Cheers Lydia
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted.
+1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well.
Thad +ThadGuidry https://www.google.com/+ThadGuidry
On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas jasondouglas@google.com wrote:
It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool.
On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
On 24.09.2015 23:48, James Heald wrote:
Has anybody actually done an assessment on Freebase and its reliability?
Is it *really* too unreliable to import wholesale?
From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts.
An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong).
Are there any stats/progress graphs as to how the actual import is in fact going?
It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems).
Markus
-- James.
On 24/09/2015 19:35, Lydia Pintscher wrote:
On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris tfmorris@gmail.com
wrote:
This is to add MusicBrainz to the primary source tool, not anything else?
It's apparently worse than that (which I hadn't realized until I re-read the transcript). It sounds like it's just going to generate little
warning
icons for "bad" facts and not lead to the recording of any new facts at all.
17:22:33 <Lydia_WMDE> we'll also work on getting the extension deployed that will help with checking against 3rd party databases 17:23:33 <Lydia_WMDE> the result of constraint checks and checks against 3rd party databases will then be used to display little indicators next
to a
statement in case it is problematic 17:23:47 <Lydia_WMDE> i hope this way more people become aware of issues and can help fix them 17:24:35 <sjoerddebruin> Do you have any names of databases that are supported? :) 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german national library. it can be extended later
I know Freebase is deemed to be nasty and unreliable, but is
MusicBrainz
considered trustworthy enough to import directly or will its facts need to be dripped through the primary source soda straw one at a time too?
The primary sources tool and the extension that helps us check against other databases are two independent things. Imports from Musicbrainz have been happening since a very long time already.
Cheers Lydia
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry thadguidry@gmail.com wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted.
+1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well.
Thad +ThadGuidry https://www.google.com/+ThadGuidry
On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas jasondouglas@google.com wrote:
It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool.
On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
On 24.09.2015 23:48, James Heald wrote:
Has anybody actually done an assessment on Freebase and its
reliability?
Is it *really* too unreliable to import wholesale?
From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts.
An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong).
Are there any stats/progress graphs as to how the actual import is in fact going?
It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems).
Markus
-- James.
On 24/09/2015 19:35, Lydia Pintscher wrote:
On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris tfmorris@gmail.com
wrote:
> This is to add MusicBrainz to the primary source tool, not anything > else?
It's apparently worse than that (which I hadn't realized until I re-read the transcript). It sounds like it's just going to generate little
warning
icons for "bad" facts and not lead to the recording of any new facts at all.
17:22:33 <Lydia_WMDE> we'll also work on getting the extension deployed that will help with checking against 3rd party databases 17:23:33 <Lydia_WMDE> the result of constraint checks and checks against 3rd party databases will then be used to display little indicators next
to a
statement in case it is problematic 17:23:47 <Lydia_WMDE> i hope this way more people become aware of issues and can help fix them 17:24:35 <sjoerddebruin> Do you have any names of databases that are supported? :) 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german national library. it can be extended later
I know Freebase is deemed to be nasty and unreliable, but is
MusicBrainz
considered trustworthy enough to import directly or will its facts need to be dripped through the primary source soda straw one at a time too?
The primary sources tool and the extension that helps us check against other databases are two independent things. Imports from Musicbrainz have been happening since a very long time already.
Cheers Lydia
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems). +1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong). > > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
First ... it looks like you REALLY need my help to finish the Freebase mapping ? Hardly anything looks done...and I have the time and knowledge to fill it all in completely... https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Class_mapping
Markus, do you want me to start on that ? Probably take me this week to fill it out.
Thad +ThadGuidry https://www.google.com/+ThadGuidry
Hoi,
Sorry I disagree with your analysis. The fundamental issue is not quality and it is not the size of our community. The issue is that we have our priorities wrong. As far as I am concerned the "primary sources tool" is a wrong approach for a dataset like Freebase or DBpedia.
What we should concentrate on is find likely issues that exist in Wikidata. Make people aware of them and have a proper workflow that will point people to the things they care about. When I care about "polders" show me content where another source disagrees with what we have. As I care about "polders" I will spend time on it BECAUSE I care and am invited to resolve issues. I will be challenged because every item I touch has an issue. I do not mind to do this when the data in Wikidata differs from DBpedia, Freebase or whatever.. My time is well spend. THAT is why I will be challenged, that is why I will be willing to work on this.
I will not do this for new data in the primary sources tool. At most I will give it a glance and accept it. I would only do this where data in the primary sources tool differs. That however is exactly the same scenario that I just described.
I am not willing to look at data in Wikidata Freebase or DBpedia in the primary sources tool one item/statement at a time; we know that they are of a similar quality as Wikidata. The percentages make it a waste of time. With iterative comparisons of other sources we will find the booboos easy enough. We will spend the time of our communities effectively and we will increase quality and quality and community.
The approach of the primary sources tool is wrong. It should only be about linking data and define how this is done.
The problem is indeed with the community. Its time is wasted and it is much more effective for me to add new data than work on data that is already in the primary sources tool. Thanks, GerardM
On 28 September 2015 at 16:52, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of
proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong). > > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Sep 28, 2015 20:03, "Gerard Meijssen" gerard.meijssen@gmail.com wrote:
Hoi,
Sorry I disagree with your analysis. The fundamental issue is not quality
and it is not the size of our community. The issue is that we have our priorities wrong. As far as I am concerned the "primary sources tool" is a wrong approach for a dataset like Freebase or DBpedia.
What we should concentrate on is find likely issues that exist in
Wikidata. Make people aware of them and have a proper workflow that will point people to the things they care about. When I care about "polders" show me content where another source disagrees with what we have.
As i have said before the extension to check against third party databases is being worked on. This is not an argument against the primary sources tool. It is simply something very different.
As I care about "polders" I will spend time on it BECAUSE I care and am
invited to resolve issues. I will be challenged because every item I touch has an issue. I do not mind to do this when the data in Wikidata differs from DBpedia, Freebase or whatever.. My time is well spend. THAT is why I will be challenged, that is why I will be willing to work on this.
I will not do this for new data in the primary sources tool. At most I
will give it a glance and accept it. I would only do this where data in the primary sources tool differs. That however is exactly the same scenario that I just described.
I am not willing to look at data in Wikidata Freebase or DBpedia in the
primary sources tool one item/statement at a time; we know that they are of a similar quality as Wikidata. The percentages make it a waste of time. With iterative comparisons of other sources we will find the booboos easy enough. We will spend the time of our communities effectively and we will increase quality and quality and community.
The approach of the primary sources tool is wrong. It should only be
about linking data and define how this is done.
The problem is indeed with the community. Its time is wasted and it is
much more effective for me to add new data than work on data that is already in the primary sources tool.
Thanks, GerardM
On 28 September 2015 at 16:52, Markus Krötzsch <
markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase
import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough
community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young
community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of
errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I
cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe
there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The
Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work
through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in
Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable
data they might take from Freebase.
Freebase is a much better resource than many other data resources we are
already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates
you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of
proposals are
> being approved (and stay in Wikidata after a while), and
whether there
> is a pattern (100% approval on some type of fact that could
then be
> merged more quickly; or very low approval on something else
that would
> maybe better revisited for mapping errors or other systematic
problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong). > > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:
Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Markus, Lydia...
It looks like TPT had another page where the WD Properties were being mapped to Freebase here: https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Mapping
Do you need help in filling that out more ?
Thad +ThadGuidry https://www.google.com/+ThadGuidry
Hi Thad,
thanks for your support. I think this can be really useful. Now just to clarify: I am not developing or maintaining the Primary Sources tool, I just want to see more Freebase data being migrated :-) I think making the mapping more complete is clearly necessary and valuable, but maybe someone with more insights into the current progress on that level can make a more insightful comment.
Markus
On 28.09.2015 20:44, Thad Guidry wrote:
Markus, Lydia...
It looks like TPT had another page where the WD Properties were being mapped to Freebase here: https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Mapping
Do you need help in filling that out more ?
Thad +ThadGuidry https://www.google.com/+ThadGuidry
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
I think more fundamentally there is the issue that Wikidata doesn't serve end users well because the end users are not paying for it. (Contrast an NGO that would be doing things for people in Africa without asking the people what they want as opposed to a commercial operation that is going to fly or die based on the ability to serve identified needs of Africans.)
I am by no means a market fundamentalist but when you look at Amazon.com, you see there is a virtuous circle where small incremental improvements that make the store better put money on the bottom line, linking career advancement to customer success, etc. Over time the incremental changes snowball. (Alternatively we could have exponential convergence instead of expansion)
I was looking around for API management solutions, and they all address things like "creating stubs for the end user", "increasing developer engagement", "converting XML to JSON and vice versa" and the always dubious idea that adding a proxy server of some kind on the public internet would help you meet an SLA. None of them support the minimal viable product function of 'charging people to use the API' at a basic level, although if you talk to the sales people maybe they will help you with a "monetization engine" (who knows if it puts ads in the results) but you will pay at least as much a month for this feature as the Silk Road spent on software development (unfortunately earning it back in the form of marked bitcoins)
And the API management sites are dealing with big name companies like Target and Clorox, all of these companies that are avaricious and smart about money are not charging people for APIs.
If you are not the customer, you are the product.
"End user" is a fuzzy word though because that Dutch guy who is interested in Polders is not the ordinary end user, although you practically need to bring people like that into things like Wikidata because you need their curation. Another tough problem is that we all have our specialties, so one person really needs a good database of wine regions, another one ski areas, another one cares about books and another couldn't care less about books but is into video games. (The person who wants to contribute or pay for improvements for area Z does not care about area Y)
Freebase was not particularly successful at getting unpaid help to improve their database because of these fundamental economics; you might make the case that friction in the form of "this data format is different from everything else" or "the UI sux" or "the rest of the world hasn't caught up with us on tooling" is the main problem, but people would overcome those problems if the motivation existed.
Anyhow, there is this funny little thing that the gap between "5 cents" and free is bigger than the gap between "5 cents" and $1000, so you have the Bloombergs and Elseviers of the world charging $1000 for what somebody could provide for much less. This problem exists for the human readable web and so far advertising has been the answer, but it has not been solved for open data.
On Mon, Sep 28, 2015 at 2:01 PM, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi,
Sorry I disagree with your analysis. The fundamental issue is not quality and it is not the size of our community. The issue is that we have our priorities wrong. As far as I am concerned the "primary sources tool" is a wrong approach for a dataset like Freebase or DBpedia.
What we should concentrate on is find likely issues that exist in Wikidata. Make people aware of them and have a proper workflow that will point people to the things they care about. When I care about "polders" show me content where another source disagrees with what we have. As I care about "polders" I will spend time on it BECAUSE I care and am invited to resolve issues. I will be challenged because every item I touch has an issue. I do not mind to do this when the data in Wikidata differs from DBpedia, Freebase or whatever.. My time is well spend. THAT is why I will be challenged, that is why I will be willing to work on this.
I will not do this for new data in the primary sources tool. At most I will give it a glance and accept it. I would only do this where data in the primary sources tool differs. That however is exactly the same scenario that I just described.
I am not willing to look at data in Wikidata Freebase or DBpedia in the primary sources tool one item/statement at a time; we know that they are of a similar quality as Wikidata. The percentages make it a waste of time. With iterative comparisons of other sources we will find the booboos easy enough. We will spend the time of our communities effectively and we will increase quality and quality and community.
The approach of the primary sources tool is wrong. It should only be about linking data and define how this is done.
The problem is indeed with the community. Its time is wasted and it is much more effective for me to add new data than work on data that is already in the primary sources tool. Thanks, GerardM
On 28 September 2015 at 16:52, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of
proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong). > > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:
Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 9/28/15 2:36 PM, Paul Houle wrote:
Anyhow, there is this funny little thing that the gap between "5 cents" and free is bigger than the gap between "5 cents" and $1000, so you have the Bloombergs and Elseviers of the world charging $1000 for what somebody could provide for much less. This problem exists for the human readable web and so far advertising has been the answer, but it has not been solved for open data.
There is a solution for Open Data, trouble is that attention is increasingly mercurial.
You need Identity [1], Tickets [2], and ACLs [3].
All doable using existing Web Architecture.
Links:
[1] http://linkeddata.uriburner.com/c/9DV22GPS -- About Tickets [2] http://linkeddata.uriburner.com/c/9G36GVL -- About WebID [3] http://linkeddata.uriburner.com/c/9DFX6GKO -- Attribute-Based Access Controls (ABAC)
Gerard,
Why do you spend so much energy on criticising the work of other volunteers and companies that want to help Wikidata? Switching off Primary Sources would not achieve any progress towards what you want. I have made some proposals in my email on what else could be done to speed things up. You could work on realising some of these ideas, you could propose other activities to the community, or you could just help elsewhere on Wikidata. Focussing on a tool you don't like and don't want to use will not make you (or the rest of us) happy.
Markus
On 28.09.2015 20:01, Gerard Meijssen wrote:
Hoi,
Sorry I disagree with your analysis. The fundamental issue is not quality and it is not the size of our community. The issue is that we have our priorities wrong. As far as I am concerned the "primary sources tool" is a wrong approach for a dataset like Freebase or DBpedia.
What we should concentrate on is find likely issues that exist in Wikidata. Make people aware of them and have a proper workflow that will point people to the things they care about. When I care about "polders" show me content where another source disagrees with what we have. As I care about "polders" I will spend time on it BECAUSE I care and am invited to resolve issues. I will be challenged because every item I touch has an issue. I do not mind to do this when the data in Wikidata differs from DBpedia, Freebase or whatever.. My time is well spend. THAT is why I will be challenged, that is why I will be willing to work on this.
I will not do this for new data in the primary sources tool. At most I will give it a glance and accept it. I would only do this where data in the primary sources tool differs. That however is exactly the same scenario that I just described.
I am not willing to look at data in Wikidata Freebase or DBpedia in the primary sources tool one item/statement at a time; we know that they are of a similar quality as Wikidata. The percentages make it a waste of time. With iterative comparisons of other sources we will find the booboos easy enough. We will spend the time of our communities effectively and we will increase quality and quality and community.
The approach of the primary sources tool is wrong. It should only be about linking data and define how this is done.
The problem is indeed with the community. Its time is wasted and it is much more effective for me to add new data than work on data that is already in the primary sources tool. Thanks, GerardM
On 28 September 2015 at 16:52, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all, The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future. The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned. Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time. Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome. What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas: (1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data. (2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI? (3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements. (4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea. (5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase. Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution. Cheers, Markus [1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man On 26.09.2015 18:31, Gerard Meijssen wrote: Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included. Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well. I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable. We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com <mailto:thadguidry@gmail.com> <mailto:thadguidry@gmail.com <mailto:thadguidry@gmail.com>>> wrote: Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com> <mailto:jasondouglas@google.com <mailto:jasondouglas@google.com>>> wrote: > It would indeed be interesting to see which percentage of proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems). +1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org> <mailto:markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong). > > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com> <mailto:tfmorris@gmail.com <mailto:tfmorris@gmail.com>>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi!
Thank you Thad for your support!
First some pieces of news about the current progress:
The work on Primary Sources and the Freebase mapping is currently on hold since the last day of my Google internship (in late August). We have already a lot (13.7M) statements in the Primary Sources tool and I think that we should maybe try to make Wikidata adopt them before creating some other ones.
Some answer:
First ... it looks like you REALLY need my help to finish the Freebase mapping ? Hardly anything looks done...and I have the time and knowledge to fill it all in completely... https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Class_mapping
This page is an attempt to map Freebase types to Wikidata classes. But it seems to me that it won't lead to any big addition of new good statements: the class hierarchy of Wikidata is very different from the Freebase type hierarchy making the mapping difficult. I have already done something for people by creating a file with the Qids of Wikidata items mapped to a /people/person but without P31 Q5. Something like an half of these were not, in fact, items about a person (it's a wet finger estimation) so I decided not to add these data into Primary Sources. But I have given this file to Magnus who has imported them into his "person" game (thank you Magnus :-)).
It looks like TPT had another page where the WD Properties were being mapped to Freebase here: https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Mapping Do you need help in filling that out more ?
I believe that the top properties are now mapped (we have 360 properties mapped). For example, if I take the dataset of facts tagged as reviewed in the dump [1] that have as subject a mapped topic, I am able to map 92% of them to Wikidata claims. So, if you have time to improve the mapping it would be a very nice task but I don't think it'll be the most rewarding. I believe that a task to improve the mapping between Freebase topics and Wikidata item will lead to far more additions (the mapping used to create the current content of the Primary Sources tool has only 4.56M connections).
This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
Thank you very much for all these ideas. I am currently working on these two sides in order to move forward the importation of the already mapped statements:
1. Import some "good" datasets using my bot. I have already done it for the "simple" facts about humans (birth date, birth place...) that are tagged as reviewed in the Freebase dump [1]. I have created a wiki page to coordinate this work: https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Good_datasets
2. Optimize the Primary Sources tool in order to make it more usable. I have done some work in order to decrease the load time and my aim is now to try to avoid the unneeded page reloads.
Cheers,
Thomas
[1] See http://www.freebase.com/freebase/valuenotation/is_reviewed
Le 28 sept. 2015 à 21:36, Markus Krötzsch markus@semantic-mediawiki.org a écrit :
Gerard,
Why do you spend so much energy on criticising the work of other volunteers and companies that want to help Wikidata? Switching off Primary Sources would not achieve any progress towards what you want. I have made some proposals in my email on what else could be done to speed things up. You could work on realising some of these ideas, you could propose other activities to the community, or you could just help elsewhere on Wikidata. Focussing on a tool you don't like and don't want to use will not make you (or the rest of us) happy.
Markus
On 28.09.2015 20:01, Gerard Meijssen wrote:
Hoi,
Sorry I disagree with your analysis. The fundamental issue is not quality and it is not the size of our community. The issue is that we have our priorities wrong. As far as I am concerned the "primary sources tool" is a wrong approach for a dataset like Freebase or DBpedia.
What we should concentrate on is find likely issues that exist in Wikidata. Make people aware of them and have a proper workflow that will point people to the things they care about. When I care about "polders" show me content where another source disagrees with what we have. As I care about "polders" I will spend time on it BECAUSE I care and am invited to resolve issues. I will be challenged because every item I touch has an issue. I do not mind to do this when the data in Wikidata differs from DBpedia, Freebase or whatever.. My time is well spend. THAT is why I will be challenged, that is why I will be willing to work on this.
I will not do this for new data in the primary sources tool. At most I will give it a glance and accept it. I would only do this where data in the primary sources tool differs. That however is exactly the same scenario that I just described.
I am not willing to look at data in Wikidata Freebase or DBpedia in the primary sources tool one item/statement at a time; we know that they are of a similar quality as Wikidata. The percentages make it a waste of time. With iterative comparisons of other sources we will find the booboos easy enough. We will spend the time of our communities effectively and we will increase quality and quality and community.
The approach of the primary sources tool is wrong. It should only be about linking data and define how this is done.
The problem is indeed with the community. Its time is wasted and it is much more effective for me to add new data than work on data that is already in the primary sources tool. Thanks, GerardM
On 28 September 2015 at 16:52, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included. Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well. I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable. We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com <mailto:thadguidry@gmail.com> <mailto:thadguidry@gmail.com <mailto:thadguidry@gmail.com>>> wrote: Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com> <mailto:jasondouglas@google.com <mailto:jasondouglas@google.com>>> wrote: > It would indeed be interesting to see which percentage of proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems). +1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org> <mailto:markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong). > > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com> <mailto:tfmorris@gmail.com <mailto:tfmorris@gmail.com>>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>> > https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hoi, So far no argument is given why the primary sources tool WOULD work. People will be interested in curating Wikidata. People will not be interested in checking the primary sources tool one item / statement at a time. It is a numbers game; there is too much to do in this way.
I know that comparing Wikidata against other sources is a different tool. It however provides a sane way of working on data. When used in an iterative way it provides a clean process to integrate this data effectively. It has our communitity involved and concentrated on things where human effort makes a difference.
I have asked time and again to provide arguments why the primary sources tool would work. Arguably it does not function at all and the statistics prove this. I have argued for a different approach and as there are no arguments there is silence. I do not want to rubbish the work of others but given that it is the only route, the official route to import data from other sources at some stage there is no alternative.
At some stage a tool like the primary sources tool becomes a liability. In my mind it certainly is when we have the announced tool for comparing data against sources. When the only argument for the primary sources tool is the effort people / companies put in there, it is a pitiful argument. Pity for the people involved but that is it. Thanks, GerardM
On 28 September 2015 at 21:36, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Gerard,
Why do you spend so much energy on criticising the work of other volunteers and companies that want to help Wikidata? Switching off Primary Sources would not achieve any progress towards what you want. I have made some proposals in my email on what else could be done to speed things up. You could work on realising some of these ideas, you could propose other activities to the community, or you could just help elsewhere on Wikidata. Focussing on a tool you don't like and don't want to use will not make you (or the rest of us) happy.
Markus
On 28.09.2015 20:01, Gerard Meijssen wrote:
Hoi,
Sorry I disagree with your analysis. The fundamental issue is not quality and it is not the size of our community. The issue is that we have our priorities wrong. As far as I am concerned the "primary sources tool" is a wrong approach for a dataset like Freebase or DBpedia.
What we should concentrate on is find likely issues that exist in Wikidata. Make people aware of them and have a proper workflow that will point people to the things they care about. When I care about "polders" show me content where another source disagrees with what we have. As I care about "polders" I will spend time on it BECAUSE I care and am invited to resolve issues. I will be challenged because every item I touch has an issue. I do not mind to do this when the data in Wikidata differs from DBpedia, Freebase or whatever.. My time is well spend. THAT is why I will be challenged, that is why I will be willing to work on this.
I will not do this for new data in the primary sources tool. At most I will give it a glance and accept it. I would only do this where data in the primary sources tool differs. That however is exactly the same scenario that I just described.
I am not willing to look at data in Wikidata Freebase or DBpedia in the primary sources tool one item/statement at a time; we know that they are of a similar quality as Wikidata. The percentages make it a waste of time. With iterative comparisons of other sources we will find the booboos easy enough. We will spend the time of our communities effectively and we will increase quality and quality and community.
The approach of the primary sources tool is wrong. It should only be about linking data and define how this is done.
The problem is indeed with the community. Its time is wasted and it is much more effective for me to add new data than work on data that is already in the primary sources tool. Thanks, GerardM
On 28 September 2015 at 16:52, Markus Krötzsch <markus@semantic-mediawiki.org mailto:markus@semantic-mediawiki.org>
wrote:
Hi Gerard, hi all, The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future. The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned. Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time. Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome. What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas: (1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data. (2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI? (3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements. (4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea. (5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase. Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution. Cheers, Markus [1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man On 26.09.2015 18:31, Gerard Meijssen wrote: Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of
the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we
have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly
deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com <mailto:thadguidry@gmail.com> <mailto:thadguidry@gmail.com <mailto:thadguidry@gmail.com>>>
wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com> <mailto:jasondouglas@google.com <mailto:jasondouglas@google.com>>> wrote: > It would indeed be interesting to see which percentage of proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems). +1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the
primary sources tool.
On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org> <mailto:markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure,
which apparently was just wrong).
> > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com> <mailto:tfmorris@gmail.com <mailto:tfmorris@gmail.com>>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more
people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> <mailto:Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Could it be possible to create some kind of info (notification?) in a wikipedia article that additional data is available in a queue ("freebase") somewhere?
If you have the article on your watch-list, then you will get a warning that says "You lazy boy, get your ass over here and help us out!" Or perhaps slightly rephrased.
On Mon, Sep 28, 2015 at 4:52 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of
proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong). > > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Another; make a kind of worklist on Wikidata that reflect the watchlist on the clients (Wikipedias) but then, we often have items on our watchlist that we don't know much about. (Digression: Somehow we should be able to sort out those things we know (the place we live, the persons we have meet) from those things we have done (edited, copy-pasted).)
I been trying to get some interest in the past for worklists on Wikipedia, it isn't much interest to make them. It would speed up tedious tasks of finding the next page to edit after a given edit is completed. It is the same problem with imports from Freebase on Wikidata, locate the next item on Wikidata with the same queued statement from Freebase, but within some worklist that the user has some knowledge about.
Imagine "municipalities within a county" or "municipalities that is also on the users watchlist", and combine that with available unhandled Freebase-statements.
On Mon, Sep 28, 2015 at 10:09 PM, John Erling Blad jeblad@gmail.com wrote:
Could it be possible to create some kind of info (notification?) in a wikipedia article that additional data is available in a queue ("freebase") somewhere?
If you have the article on your watch-list, then you will get a warning that says "You lazy boy, get your ass over here and help us out!" Or perhaps slightly rephrased.
On Mon, Sep 28, 2015 at 4:52 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of
proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong). > > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:
Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi Gerard,
given the statistics you cite from
https://tools.wmflabs.org/wikidata-primary-sources/status.html
I see that 19.6k statements have been approved through the tool, and 5.1k statements have been rejected - which means that about 1 in 5 statements is deemed unsuitable by the users of primary sources.
Given that there are 12.4M statements in the tool, this means that about 2.5M statements will turn out to be unsuitable for inclusion in Wikidata (if the current ratio holds). Are you suggesting to upload all of these statements to Wikidata?
Tpt already did upload pieces of the data which have sufficient quality outside the primary sources tool, and more is planned. But for the data where the suitability for Wikidata seems questionable, I would not know what other approach to use. Do you have a suggestion?
Once you have a suggestion and there is community consensus in doing it, no one will stand in the way of implementing that suggestion.
Cheers, Denny
On Mon, Sep 28, 2015 at 1:19 PM John Erling Blad jeblad@gmail.com wrote:
Another; make a kind of worklist on Wikidata that reflect the watchlist on the clients (Wikipedias) but then, we often have items on our watchlist that we don't know much about. (Digression: Somehow we should be able to sort out those things we know (the place we live, the persons we have meet) from those things we have done (edited, copy-pasted).)
I been trying to get some interest in the past for worklists on Wikipedia, it isn't much interest to make them. It would speed up tedious tasks of finding the next page to edit after a given edit is completed. It is the same problem with imports from Freebase on Wikidata, locate the next item on Wikidata with the same queued statement from Freebase, but within some worklist that the user has some knowledge about.
Imagine "municipalities within a county" or "municipalities that is also on the users watchlist", and combine that with available unhandled Freebase-statements.
On Mon, Sep 28, 2015 at 10:09 PM, John Erling Blad jeblad@gmail.com wrote:
Could it be possible to create some kind of info (notification?) in a wikipedia article that additional data is available in a queue ("freebase") somewhere?
If you have the article on your watch-list, then you will get a warning that says "You lazy boy, get your ass over here and help us out!" Or perhaps slightly rephrased.
On Mon, Sep 28, 2015 at 4:52 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of
proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong). > > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> > https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:
Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Actually, my suggestion would be to switch on Primary Sources as a default tool for everyone. That should increase exposure and turnover, without compromising quality of data.
On Mon, Sep 28, 2015 at 2:23 PM Denny Vrandečić vrandecic@google.com wrote:
Hi Gerard,
given the statistics you cite from
https://tools.wmflabs.org/wikidata-primary-sources/status.html
I see that 19.6k statements have been approved through the tool, and 5.1k statements have been rejected - which means that about 1 in 5 statements is deemed unsuitable by the users of primary sources.
Given that there are 12.4M statements in the tool, this means that about 2.5M statements will turn out to be unsuitable for inclusion in Wikidata (if the current ratio holds). Are you suggesting to upload all of these statements to Wikidata?
Tpt already did upload pieces of the data which have sufficient quality outside the primary sources tool, and more is planned. But for the data where the suitability for Wikidata seems questionable, I would not know what other approach to use. Do you have a suggestion?
Once you have a suggestion and there is community consensus in doing it, no one will stand in the way of implementing that suggestion.
Cheers, Denny
On Mon, Sep 28, 2015 at 1:19 PM John Erling Blad jeblad@gmail.com wrote:
Another; make a kind of worklist on Wikidata that reflect the watchlist on the clients (Wikipedias) but then, we often have items on our watchlist that we don't know much about. (Digression: Somehow we should be able to sort out those things we know (the place we live, the persons we have meet) from those things we have done (edited, copy-pasted).)
I been trying to get some interest in the past for worklists on Wikipedia, it isn't much interest to make them. It would speed up tedious tasks of finding the next page to edit after a given edit is completed. It is the same problem with imports from Freebase on Wikidata, locate the next item on Wikidata with the same queued statement from Freebase, but within some worklist that the user has some knowledge about.
Imagine "municipalities within a county" or "municipalities that is also on the users watchlist", and combine that with available unhandled Freebase-statements.
On Mon, Sep 28, 2015 at 10:09 PM, John Erling Blad jeblad@gmail.com wrote:
Could it be possible to create some kind of info (notification?) in a wikipedia article that additional data is available in a queue ("freebase") somewhere?
If you have the article on your watch-list, then you will get a warning that says "You lazy boy, get your ass over here and help us out!" Or perhaps slightly rephrased.
On Mon, Sep 28, 2015 at 4:52 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's
data submissions should probably not be trusted.
+1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their
entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well.
Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of
proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable,
but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts.
An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested
to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong).
> > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source
tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:
Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Yes! +1
On Mon, Sep 28, 2015 at 11:27 PM, Denny Vrandečić vrandecic@gmail.com wrote:
Actually, my suggestion would be to switch on Primary Sources as a default tool for everyone. That should increase exposure and turnover, without compromising quality of data.
On Mon, Sep 28, 2015 at 2:23 PM Denny Vrandečić vrandecic@google.com wrote:
Hi Gerard,
given the statistics you cite from
https://tools.wmflabs.org/wikidata-primary-sources/status.html
I see that 19.6k statements have been approved through the tool, and 5.1k statements have been rejected - which means that about 1 in 5 statements is deemed unsuitable by the users of primary sources.
Given that there are 12.4M statements in the tool, this means that about 2.5M statements will turn out to be unsuitable for inclusion in Wikidata (if the current ratio holds). Are you suggesting to upload all of these statements to Wikidata?
Tpt already did upload pieces of the data which have sufficient quality outside the primary sources tool, and more is planned. But for the data where the suitability for Wikidata seems questionable, I would not know what other approach to use. Do you have a suggestion?
Once you have a suggestion and there is community consensus in doing it, no one will stand in the way of implementing that suggestion.
Cheers, Denny
On Mon, Sep 28, 2015 at 1:19 PM John Erling Blad jeblad@gmail.com wrote:
Another; make a kind of worklist on Wikidata that reflect the watchlist on the clients (Wikipedias) but then, we often have items on our watchlist that we don't know much about. (Digression: Somehow we should be able to sort out those things we know (the place we live, the persons we have meet) from those things we have done (edited, copy-pasted).)
I been trying to get some interest in the past for worklists on Wikipedia, it isn't much interest to make them. It would speed up tedious tasks of finding the next page to edit after a given edit is completed. It is the same problem with imports from Freebase on Wikidata, locate the next item on Wikidata with the same queued statement from Freebase, but within some worklist that the user has some knowledge about.
Imagine "municipalities within a county" or "municipalities that is also on the users watchlist", and combine that with available unhandled Freebase-statements.
On Mon, Sep 28, 2015 at 10:09 PM, John Erling Blad jeblad@gmail.com wrote:
Could it be possible to create some kind of info (notification?) in a wikipedia article that additional data is available in a queue ("freebase") somewhere?
If you have the article on your watch-list, then you will get a warning that says "You lazy boy, get your ass over here and help us out!" Or perhaps slightly rephrased.
On Mon, Sep 28, 2015 at 4:52 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work....
some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted.
+1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their
entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well.
Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of
proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were
much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool.
On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable,
but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts.
An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested
to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong).
> > Are there any stats/progress graphs as to how the
actual import is in > fact going?
It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that
could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems).
Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source
tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:
Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Denny Vrandečić, 28/09/2015 23:27:
Actually, my suggestion would be to switch on Primary Sources as a default tool for everyone.
Yes, it's a desirable aim to have one-click suggested actions (à la Wikidata game) embedded into items for everyone. As for this tool, unrelatedly from the data used, at least slowness and misleading messaging need to be fixed first: https://www.wikidata.org/wiki/Wikidata_talk:Primary_sources_tool
(Compare: we already have very easy "remove" buttons on all statements on all items. So the interface for large-scale easy correction of mistakes is already there, while for *insertion* it's still missing. Which is also the gist of Gerard's argument, I believe. I agree with Lydia we can eventually do both, of course.)
Nemo
Tpt did take a few datasets that have a high-enough quality from the Freebase dataset and uploaded it directly. These numbers do not appear in the Primary Sources tool, because they were uploaded directly - each set going through the normal community process.
The Primary Sources Tool is left with the datasets where we were not able to establish a high enough threshold of quality. For any dataset where this quality can be demonstrated to the community, I assume they will agree with a direct upload.
I am not sure what else to do here.
I am very thankful to Nemo for his rephrasing of the discussion and to pull it to a constructive and actionable level.
Gerard, regarding your arguments:
- why would someone work on data in the primary sources tool when it is more effective to add data directly
Can you explain what you mean with "add data directly". I am really not sure what you mean with this argument. Are you suggesting to upload the whole dataset without further review?
- why is data that is over 90% good denied access to Wikidata (ie as good as Wikidata itself)
But it is not over 90% good! We have a rejection rate of almost 20%. Also, 10% errors means more than 1 Million errors. I yet need to see consensus to upload this.
- how do you justify the pst when so little data was included in Wikidata
The tool has been used to add thousands of statements and references to Wikidata, and that by a rather small set of people (because you need to intentionally install it). I would think that if we switch it on per default, the throughput should grow considerably. Nemo identified a few issues for that, and it would be good if we would work on these. Everyone is invited to help out with that.
- why not have Kian learn from the data set of Freebase and Wikidata and have smart suggestions
Kian is free to learn from the datasets. The data of Freebase has been available for years, and Kian would by far not be the first ML tool to use it for training purposes. If there is anything hindering Kian to use the Freebase data, let me know, I will try to fix it.
- why waste people's time adding one item/statement at a time when you can focus on the statements that are in doubt (either in Freebase or in Wikidata
Because we don't know which ones are which. If you could tell me which of the 12 Million statements are good and which ones are not, and if there is consensus about that assessment, I'd be happy to upload them.
I hope that this answers your arguments.
Again, I do not understand what your proposal is. I am going through the process to release the data in an easy to use way. If the community agrees with that, it can then be directly imported to Wikidata - I certainly won't stop anyone from doing so and never had.
My feeling is that you are frustrated by what you perceive as slow progress. You keep yelling at people that their ideas and work are not good. I remember how much you attacked me about Wikidata and all the things I have been doing wrong about it. Gerard, if you think you are motivating me with your constant attacks, I have to tell you, you are not. I am not speaking for anyone else, but I am getting tired of this. I appreciate a critical voice, but not in the tone you are often delivering it.
So, instead of telling everyone how we are supposed to spend our volunteer time in order to get things done better, and how we are doing things wrong, why don't you lead by example, and do it right? All the data, all the tools, for anything you want to get done are available to you for free. It is a pretty amazing world - all you need is at click away. So go ahead and do what you want to get done.
On Tue, Sep 29, 2015 at 1:07 AM Federico Leva (Nemo) nemowiki@gmail.com wrote:
Denny Vrandečić, 28/09/2015 23:27:
Actually, my suggestion would be to switch on Primary Sources as a default tool for everyone.
Yes, it's a desirable aim to have one-click suggested actions (à la Wikidata game) embedded into items for everyone. As for this tool, unrelatedly from the data used, at least slowness and misleading messaging need to be fixed first: https://www.wikidata.org/wiki/Wikidata_talk:Primary_sources_tool
(Compare: we already have very easy "remove" buttons on all statements on all items. So the interface for large-scale easy correction of mistakes is already there, while for *insertion* it's still missing. Which is also the gist of Gerard's argument, I believe. I agree with Lydia we can eventually do both, of course.)
Nemo
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hi all,
Note: as far as I can tell, the stats available at https://tools.wmflabs.org/wikidata-primary-sources/status.html so far do not differentiate between "fact wrong" (as in "Barack Obama is president of Croatia" [fact wrong]) and "source wrong" ("Barack Obama is president of the United States", "according to http://www.theonion.com/" [fact correct, source wrong]). From anecdotal evidence: most rejected facts were rejected due to shady sources… A big problem is also Citogenesis (https://xkcd.com/978/).
Cheers, Tom
Thomas Steiner, 28/09/2015 23:32:
Note: as far as I can tell, the stats available at https://tools.wmflabs.org/wikidata-primary-sources/status.html so far do not differentiate between "fact wrong" (as in "Barack Obama is president of Croatia" [fact wrong]) and "source wrong" ("Barack Obama is president of the United States", "according to http://www.theonion.com/" [fact correct, source wrong]).
Indeed. I only briefly tested "primary sources" because it's frustratingly slow, but the statements I rejected were not wrong, just ugly: for instance redundant references where we already had some. I'd dare calling them formatting issues, which a bot can certainly filter. But maybe I was lucky!
Nemo
Hi!
I see that 19.6k statements have been approved through the tool, and 5.1k statements have been rejected - which means that about 1 in 5 statements is deemed unsuitable by the users of primary sources.
From my (limited) experience with Primary Sources, there are several
kinds of things there that I had rejected:
- Unsourced statements that contradict what is written in Wikidata - Duplicate claims already existing in Wikidata - Duplicate claims with worse data (i.e. less accurate location, less specific categorization, etc) or unnecessary qualifiers (such as adding information which is already contained in the item to item's qualifiers - e.g. zip code for a building) - Source references that do not exist (404, etc.) - Source references that do exist but either duplicate existing one (a number of sources just refer to different URL of the same data) or do not contain the information they should (e.g. link to newspaper's homepage instead of specific article) - Claims that are almost obviously invalid (e.g. "United Kingdom" as a genre of a play)
I think at least some of these - esp. references that do not exist and duplicates with no refs - could be removed automatically, thus raising the relative quality of the remaining items.
OTOH, some of the entries can be made self-evident - i.e. if we talk about movie and Freebase has IMDB ID or Netflix ID, it may be quite easy to check if that ID is valid and refers to a movie by the same name, which should be enough to merge it.
Not sure if those one-off things worth bothering with, just putting it out there to consider.
I would like to add old URLs that seems to be a source but does not reference anything in the claim. For example in an item about a person, the name or the birth date of the person does not appear on the page still the page is used as a source for the persons birth date.
On Mon, Sep 28, 2015 at 11:44 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I see that 19.6k statements have been approved through the tool, and 5.1k statements have been rejected - which means that about 1 in 5 statements is deemed unsuitable by the users of primary sources.
From my (limited) experience with Primary Sources, there are several kinds of things there that I had rejected:
- Unsourced statements that contradict what is written in Wikidata
- Duplicate claims already existing in Wikidata
- Duplicate claims with worse data (i.e. less accurate location, less
specific categorization, etc) or unnecessary qualifiers (such as adding information which is already contained in the item to item's qualifiers
- e.g. zip code for a building)
- Source references that do not exist (404, etc.)
- Source references that do exist but either duplicate existing one (a
number of sources just refer to different URL of the same data) or do not contain the information they should (e.g. link to newspaper's homepage instead of specific article)
- Claims that are almost obviously invalid (e.g. "United Kingdom" as a
genre of a play)
I think at least some of these - esp. references that do not exist and duplicates with no refs - could be removed automatically, thus raising the relative quality of the remaining items.
OTOH, some of the entries can be made self-evident - i.e. if we talk about movie and Freebase has IMDB ID or Netflix ID, it may be quite easy to check if that ID is valid and refers to a movie by the same name, which should be enough to merge it.
Not sure if those one-off things worth bothering with, just putting it out there to consider.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
If we want more domain-specific wikidata curators we absolutely have to improve the flow of: (1) viewing an article on Wikipedia (2) discovering the associated item on wikidata (3) making useful contributions to the item and the items surrounding it in the graph
That little link on the side of every article in Wikipedia is literally invaluable... and is the main thing that distinguishes wikidata from freebase (IMHO). The (large) technical differences pale in comparison. I know that people are already working on that flow, but I think its worth emphasizing here as we consider the requirements for scaling up community as we scale up data.
2 cents.. -Ben
On Mon, Sep 28, 2015 at 4:12 PM, John Erling Blad jeblad@gmail.com wrote:
I would like to add old URLs that seems to be a source but does not reference anything in the claim. For example in an item about a person, the name or the birth date of the person does not appear on the page still the page is used as a source for the persons birth date.
On Mon, Sep 28, 2015 at 11:44 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I see that 19.6k statements have been approved through the tool, and 5.1k statements have been rejected - which means that about 1 in 5 statements is deemed unsuitable by the users of primary sources.
From my (limited) experience with Primary Sources, there are several kinds of things there that I had rejected:
- Unsourced statements that contradict what is written in Wikidata
- Duplicate claims already existing in Wikidata
- Duplicate claims with worse data (i.e. less accurate location, less
specific categorization, etc) or unnecessary qualifiers (such as adding information which is already contained in the item to item's qualifiers
- e.g. zip code for a building)
- Source references that do not exist (404, etc.)
- Source references that do exist but either duplicate existing one (a
number of sources just refer to different URL of the same data) or do not contain the information they should (e.g. link to newspaper's homepage instead of specific article)
- Claims that are almost obviously invalid (e.g. "United Kingdom" as a
genre of a play)
I think at least some of these - esp. references that do not exist and duplicates with no refs - could be removed automatically, thus raising the relative quality of the remaining items.
OTOH, some of the entries can be made self-evident - i.e. if we talk about movie and Freebase has IMDB ID or Netflix ID, it may be quite easy to check if that ID is valid and refers to a movie by the same name, which should be enough to merge it.
Not sure if those one-off things worth bothering with, just putting it out there to consider.
-- Stas Malyshev smalyshev@wikimedia.org
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Hoi, I have seen the statistics. The quality of Freebase cannot be understood by simply looking at the problems. People have been looking for problems and been identifying them. As a consequence more data ended up in the error bucket than in the good bucket. I have for instance added a lot of statements as "wrong" because they were exactly the same as the value already present. Consequently the error rate is not representative.
Denny, I have a suggestion. It is backed by math, it is backed by how people think. All the arguments are on my side. I have not heard your arguments and the "primary sources tool" was announced as a good thing and the community never agreed to having it. So leave the community out of it and focus on arguments.
- why would someone work on data in the primary sources tool when it is more effective to add data directly - why is data that is over 90% good denied access to Wikidata (ie as good as Wikidata itself) - how do you justify the pst when so little data was included in Wikidata - why not have Kian learn from the data set of Freebase and Wikidata and have smart suggestions - why waste people's time adding one item/statement at a time when you can focus on the statements that are in doubt (either in Freebase or in Wikidata
The notion of having all new data go through the primary sources tool will see me leave the project when this is realised. I will feel that my time and intelligence is wasted.
Thanks,
GerardM
On 28 September 2015 at 22:54, Denny Vrandečić vrandecic@google.com wrote:
Hi Gerard,
given the statistics you cite from
https://tools.wmflabs.org/wikidata-primary-sources/status.html
I see that 19.6k statements have been approved through the tool, and 5.1k statements have been rejected - which means that about 1 in 5 statements is deemed unsuitable by the users of primary sources.
Given that there are 12.4M statements in the tool, this means that about 2.5M statements will turn out to be unsuitable for inclusion in Wikidata (if the current ratio holds). Are you suggesting to upload all of these statements to Wikidata?
Tpt already did upload pieces of the data which have sufficient quality outside the primary sources tool, and more is planned. But for the data where the suitability for Wikidata seems questionable, I would not know what other approach to use. Do you have a suggestion?
Once you have a suggestion and there is community consensus in doing it, no one will stand in the way of implementing that suggestion.
Cheers, Denny
On Mon, Sep 28, 2015 at 1:19 PM John Erling Blad jeblad@gmail.com wrote:
Another; make a kind of worklist on Wikidata that reflect the watchlist on the clients (Wikipedias) but then, we often have items on our watchlist that we don't know much about. (Digression: Somehow we should be able to sort out those things we know (the place we live, the persons we have meet) from those things we have done (edited, copy-pasted).)
I been trying to get some interest in the past for worklists on Wikipedia, it isn't much interest to make them. It would speed up tedious tasks of finding the next page to edit after a given edit is completed. It is the same problem with imports from Freebase on Wikidata, locate the next item on Wikidata with the same queued statement from Freebase, but within some worklist that the user has some knowledge about.
Imagine "municipalities within a county" or "municipalities that is also on the users watchlist", and combine that with available unhandled Freebase-statements.
On Mon, Sep 28, 2015 at 10:09 PM, John Erling Blad jeblad@gmail.com wrote:
Could it be possible to create some kind of info (notification?) in a wikipedia article that additional data is available in a queue ("freebase") somewhere?
If you have the article on your watch-list, then you will get a warning that says "You lazy boy, get your ass over here and help us out!" Or perhaps slightly rephrased.
On Mon, Sep 28, 2015 at 4:52 PM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's
data submissions should probably not be trusted.
+1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their
entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well.
Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of
proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable,
but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts.
An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested
to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong).
> > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source
tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >> > > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:
Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Tue, Sep 29, 2015 at 8:15 AM, Gerard Meijssen gerard.meijssen@gmail.com wrote:
Hoi, I have seen the statistics. The quality of Freebase cannot be understood by simply looking at the problems. People have been looking for problems and been identifying them. As a consequence more data ended up in the error bucket than in the good bucket. I have for instance added a lot of statements as "wrong" because they were exactly the same as the value already present. Consequently the error rate is not representative.
Denny, I have a suggestion. It is backed by math, it is backed by how people think. All the arguments are on my side. I have not heard your arguments and the "primary sources tool" was announced as a good thing and the community never agreed to having it. So leave the community out of it and focus on arguments.
why would someone work on data in the primary sources tool when it is more effective to add data directly
There are several reasons. My personal reasons for using it: It is more convenient for me and I have more security that what I am putting into Wikidata is actually useful and correct.
why is data that is over 90% good denied access to Wikidata (ie as good as Wikidata itself)
Do you have any way to back up this 90% claim?
how do you justify the pst when so little data was included in Wikidata
pst?
why not have Kian learn from the data set of Freebase and Wikidata and have smart suggestions
No-one is preventing that.
why waste people's time adding one item/statement at a time when you can focus on the statements that are in doubt (either in Freebase or in Wikidata
You consider it wasting people's time. Please recognize that other people do not consider it wasting their time.
The notion of having all new data go through the primary sources tool will see me leave the project when this is realised. I will feel that my time and intelligence is wasted.
No-one said all new data should go through it as far as I know. I do want it to be a major part of our workflows but that is not the same thing as not allowing anything else.
Cheers Lydia
Thanks for creating a dedicated thread, Markus. It saddens me to see this opportunity squandered and I'd love to be able to help, but I find the project so opaque that it's difficult to find a way to engage. Perhaps it's just an artifact of the lack of transparency, but the current approach seems very ad hoc to me. It's difficult to tease apart which problems are due to bad Freebase data, which are due to the way the Freebase data is being processed for import, and which are due to the attitudes of the reviewers.
As Jason Douglas said on the other thread, the Freebase data isn't homogenous in terms of quality or importance and the appropriate way to evaluate and import the data is by segmenting it, whether that be by property, or data source, or whatever. The only analysis that seems to have been done so far is to rank properties by the number of values they have which: a) isn't a good proxy for quality and b) isn't even a good proxy for importance (there are a bunch of high frequency things which are basically dead/obsolete).
The two things that I think would greatly improve things are: - document the current process & methodology - adopt a systematic, iterative, evaluation and improvement feedback loop
Since data is what drives this whole process understanding how the existing data has been evaluated, filtered, transformed, etc before being loaded into the primary sources tool is critical to understanding what the starting basis is. After that, understanding the meaning of the stats (and fixing them if they don't have the right meanings) is necessary to know how things need to be improved.
I'm having a hard time understanding the existing stats as well as correlating them with both people's anecdotal accounts and my understanding of the strengths and weaknesses of the Freebase data. Additionally, the stats represent, as I understand it, a single user's opinion of the quality of the fact, the property mapping, the source URL and probably other factors like their mood, how hungry they are, etc. It's going to include both false negatives and false positives.
When I look at one recent "approved" Freebase primary sources fact, I see that it was reverted the next day https://www.wikidata.org/w/index.php?title=Q464371&dir=prev&offset=20140524064128&action=history as a duplicate, but I also see that Maryse Condé's occupation (P106) has a long and tortured history on Wikidata with Dexbot importing "Woman of letters" from Italian Wikipedia, Brackibot switching it to "Author," then Rezabot, and a few more users all taking a shot at changing it to what they thought was best.
My gut feeling is that the bulk of the problems that people are complaining about the Freebase-derived data that's been loaded into the Primary Sources tool are due to the tool chain that's preparing the data, without better stats and insight into the processes it's really impossible to say. A systematic analysis is needed, not a bunch of recitations of anecdotes.
Tom
On Mon, Sep 28, 2015 at 10:52 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of
proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong). > > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >>
Hi Tom,
we are in the process of writing a document and preparing a release of the whole pipieline. Tpt only finished his internship a few weeks ago, and things simply take a bit of time to go through the review processes.
The release of the data and the document should allow for the insights you are asking for.
I hope that will help with most of the questions that are open. I am sorry we appeared intransparent - this was certainly not our intention. We were frequently asking for input and putting data and code out there.
Cheers, Denny
On Tue, Sep 29, 2015 at 12:38 PM Tom Morris tfmorris@gmail.com wrote:
Thanks for creating a dedicated thread, Markus. It saddens me to see this opportunity squandered and I'd love to be able to help, but I find the project so opaque that it's difficult to find a way to engage. Perhaps it's just an artifact of the lack of transparency, but the current approach seems very ad hoc to me. It's difficult to tease apart which problems are due to bad Freebase data, which are due to the way the Freebase data is being processed for import, and which are due to the attitudes of the reviewers.
As Jason Douglas said on the other thread, the Freebase data isn't homogenous in terms of quality or importance and the appropriate way to evaluate and import the data is by segmenting it, whether that be by property, or data source, or whatever. The only analysis that seems to have been done so far is to rank properties by the number of values they have which: a) isn't a good proxy for quality and b) isn't even a good proxy for importance (there are a bunch of high frequency things which are basically dead/obsolete).
The two things that I think would greatly improve things are:
- document the current process & methodology
- adopt a systematic, iterative, evaluation and improvement feedback loop
Since data is what drives this whole process understanding how the existing data has been evaluated, filtered, transformed, etc before being loaded into the primary sources tool is critical to understanding what the starting basis is. After that, understanding the meaning of the stats (and fixing them if they don't have the right meanings) is necessary to know how things need to be improved.
I'm having a hard time understanding the existing stats as well as correlating them with both people's anecdotal accounts and my understanding of the strengths and weaknesses of the Freebase data. Additionally, the stats represent, as I understand it, a single user's opinion of the quality of the fact, the property mapping, the source URL and probably other factors like their mood, how hungry they are, etc. It's going to include both false negatives and false positives.
When I look at one recent "approved" Freebase primary sources fact, I see that it was reverted the next day https://www.wikidata.org/w/index.php?title=Q464371&dir=prev&offset=20140524064128&action=history as a duplicate, but I also see that Maryse Condé's occupation (P106) has a long and tortured history on Wikidata with Dexbot importing "Woman of letters" from Italian Wikipedia, Brackibot switching it to "Author," then Rezabot, and a few more users all taking a shot at changing it to what they thought was best.
My gut feeling is that the bulk of the problems that people are complaining about the Freebase-derived data that's been loaded into the Primary Sources tool are due to the tool chain that's preparing the data, without better stats and insight into the processes it's really impossible to say. A systematic analysis is needed, not a bunch of recitations of anecdotes.
Tom
On Mon, Sep 28, 2015 at 10:52 AM, Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase import would be data quality. It is actually community support. The goal of the current slow import process is for the Wikidata community to "adopt" the Freebase data. It's not about "storing" the data somewhere, but about finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough community power for a quick import. This is regrettable, but not something that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young community. We really need your help to digest this huge amount of data! I am absolutely convinced from the emails I saw here that none of the former Freebase editors on this list would support low quality standards. They have fought hard to fix errors and avoid issues coming into their data for a long time.
Nobody believes that either Freebase or Wikidata can ever be free of errors, and this is really not the point of this discussion at all [1]. The experienced community managers among us know that it is not about the amount of data you have. Data is cheap and easy to get, even free data with very high quality. But the value proposition of Wikidata is not that it can provide storage space for lot of data -- it is that we have a functioning community that can maintain it. For the Freebase data donation, we do not seem to have this community yet. We need to find a way to engage people to do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting a lot of effort into integrating the data already. This is great, and we should thank these people because they are the ones who are now working on what we are just talking about here. In addition, we should think about ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there are cases where Freebase already had a working import infrastructure that could be migrated to Wikidata? This would also solve the community support problem in one way. We just need to import the maintenance infrastructure together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata Games have attracted so many contributions. Could some of the Freebase data be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data they might take from Freebase.
Freebase is a much better resource than many other data resources we are already using with similar approaches as (1)-(5) above, and yet it seems many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates you think are typical or acceptable for Freebase and Wikidata, respectively. Without giving actual numbers you just produce empty strawman arguments (for example: claiming that anyone would think that Wikidata is better quality than Freebase and then refuting this point, which nobody is trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
Hoi, When you analyse the statistics, it shows how bad the current state of affairs is. Slightly over one in a thousanths of the content of the primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be improved. Where we differ is that the same can be said for Wikidata. It is not much better and by including the data from Freebase we have a much improved coverage of facts. The same can be said for the content of DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are maintained. We should do this repeatedly and concentrate on workflows that seek the differences and provide workflows that help our community to improve what we have. What we have is the sum of all available knowledge and by splitting it up, we are weakened as a result. Thanks, GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some where passing users, some tried harder, but made lots of erroneous entries (battling against our Experts at times). We could probably provide a list of those sorta community blacklisted users who's data submissions should probably not be trusted. +1 for looking at better maintained specific properties. +1 for being cautious for some Freebase usernames and their entries. +1 for trusting wholesale all of the Freebase Experts submissions. We policed each other quite well. Thad +ThadGuidry <https://www.google.com/+ThadGuidry> On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas <jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote: > It would indeed be interesting to see which percentage of
proposals are > being approved (and stay in Wikidata after a while), and whether there > is a pattern (100% approval on some type of fact that could then be > merged more quickly; or very low approval on something else that would > maybe better revisited for mapping errors or other systematic problems).
+1, I think that's your best bet. Specific properties were much better maintained than others -- identify those that meet the bar for wholesale import and leave the rest to the primary sources tool. On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch <markus@semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>> wrote: On 24.09.2015 23:48, James Heald wrote: > Has anybody actually done an assessment on Freebase and its reliability? > > Is it *really* too unreliable to import wholesale? From experience with the Primary Sources tool proposals, the quality is mixed. Some things it proposes are really very valuable, but other things are also just wrong. I added a few very useful facts and fitting references based on the suggestions, but I also rejected others. Not sure what the success rate is for the cases I looked at, but my feeling is that some kind of "supervised import" approach is really needed when considering the total amount of facts. An issue is that it is often fairly hard to tell if a suggestion is true or not (mainly in cases where no references are suggested to check). In other cases, I am just not sure if a fact is correct for the property used. For example, I recently ended up accepting "architect: Charles Husband" for Lovell Telescope (Q555130), but to be honest I am not sure that this is correct: he was the leading engineer contracted to design the telescope, which seems different from an architect; no official web site uses the word "architect" it seems; I could not find a better property though, and it seemed "good enough" to accept it (as opposed to the post code of the location of this structure, which apparently was just wrong). > > Are there any stats/progress graphs as to how the actual import is in > fact going? It would indeed be interesting to see which percentage of proposals are being approved (and stay in Wikidata after a while), and whether there is a pattern (100% approval on some type of fact that could then be merged more quickly; or very low approval on something else that would maybe better revisited for mapping errors or other systematic problems). Markus > > -- James. > > > On 24/09/2015 19:35, Lydia Pintscher wrote: >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >>>> This is to add MusicBrainz to the primary source tool, not anything >>>> else? >>> >>> >>> It's apparently worse than that (which I hadn't realized until I >>> re-read the >>> transcript). It sounds like it's just going to generate little warning >>> icons for "bad" facts and not lead to the recording of any new facts >>> at all. >>> >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the extension >>> deployed that >>> will help with checking against 3rd party databases >>> 17:23:33 <Lydia_WMDE> the result of constraint checks and checks >>> against 3rd >>> party databases will then be used to display little indicators next to a >>> statement in case it is problematic >>> 17:23:47 <Lydia_WMDE> i hope this way more people become aware of >>> issues and >>> can help fix them >>> 17:24:35 <sjoerddebruin> Do you have any names of databases that are >>> supported? :) >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first version the german >>> national library. it can be extended later >>> >>> >>> I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz >>> considered trustworthy enough to import directly or will its facts >>> need to >>> be dripped through the primary source soda straw one at a time too? >> >> The primary sources tool and the extension that helps us check against >> other databases are two independent things. >> Imports from Musicbrainz have been happening since a very long time >> already. >> >> >> Cheers >> Lydia >>
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Thanks, that sounds like it will be useful. Is there a date for when this will happen?
On Tue, Sep 29, 2015 at 3:54 PM, Denny Vrandečić vrandecic@google.com wrote:
I hope that will help with most of the questions that are open. I am sorry we appeared intransparent - this was certainly not our intention. We were frequently asking for input and putting data and code out there.
Perhaps I've missed some of what has already been published. Is there stuff beyond these two pages?
https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool
The only request for input that I saw was for help with property mapping:
https://lists.wikimedia.org/pipermail/wikidata/2015-June/006503.html
which I and several others commented on, but questions like "Has this been reviewed by anyone at Google familiar with the Freebase schema?" were ignored and comments like "Don't select properties based solely on frequency" and "Ignore deprecated properties" were rejected.
My closing comment was that there was a bunch of really basic groundwork to be done before it made sense to ask for wider review, but, as far as I know, that was the last thread addressing the project. When I look at the mapping page https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Mapping is still see a ton of duplicate entries, deprecated properties, user & base properties which don't make sense to import and other cruft that people shouldn't have to wade through.
If there have been frequent requests for input, they've been in fora that I don't follow.
Tom
On Thu, Sep 24, 2015 at 11:48 PM, James Heald j.heald@ucl.ac.uk wrote:
Has anybody actually done an assessment on Freebase and its reliability?
Is it *really* too unreliable to import wholesale?
My own experience matches Markus'.
Are there any stats/progress graphs as to how the actual import is in fact going?
https://tools.wmflabs.org/wikidata-primary-sources/status.html and https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool/URL_blacklist
Cheers Lydia
James Heald, 24/09/2015 23:48:
Has anybody actually done an assessment on Freebase and its reliability?
Perhaps: http://www.semantic-web-journal.net/system/files/swj1141.pdf Help need to publish a review of it: https://etherpad.wikimedia.org/p/WRN201509
Nemo
On 09/24/2015 11:31 AM, Tom Morris wrote:
On Thu, Sep 24, 2015 at 2:18 PM, Peter F. Patel-Schneider <pfpschneider@gmail.com mailto:pfpschneider@gmail.com> wrote:
On 09/24/2015 10:59 AM, Lydia Pintscher wrote: > On Thu, Sep 24, 2015 at 7:54 PM, Tom Morris <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote: >> Thanks! Is there any more information on the issue with MusicBrainz? >> >> 17:26:27 <DanielK_WMDE> sjoerddebruin: yes, we went for MusicBrainz first, >> but it turned out to be impractical. you basically have to run their >> software in order to use their dumps >> >> >> MusicBrainz was a major source of information for Freebase, so they appear >> to have been able to figure out how to parse the dumps (and they already >> have the MusicBrainz & Wikipedia IDs correlated). >> >> Is there more detail, perhaps in a bug somewhere? > > The issue is that they do offer dumps but you need to set up your own > musicbrainz server to really use it. This was too time-intensive and > complicated for the students to make progress on during their project. > Because of this they decided to instead opt for another dataset > instead to get started. In the future Musicbrainz should still get > done. If anyone wants to work on adding more datasets to the tool > please let me know. > > > Cheers > Lydia > This is to add MusicBrainz to the primary source tool, not anything else?
It's apparently worse than that (which I hadn't realized until I re-read the transcript). It sounds like it's just going to generate little warning icons for "bad" facts and not lead to the recording of any new facts at all.
17:22:33<Lydia_WMDE> we'll also work on getting the extension deployed that will help with checking against 3rd party databases 17:23:33<Lydia_WMDE> the result of constraint checks and checks against 3rd party databases will then be used to display little indicators next to a statement in case it is problematic 17:23:47<Lydia_WMDE> i hope this way more people become aware of issues and can help fix them 17:24:35<sjoerddebruin> Do you have any names of databases that are supported? :) 17:24:59<Lydia_WMDE> sjoerddebruin: in the first version the german national library. it can be extended later
I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz considered trustworthy enough to import directly or will its facts need to be dripped through the primary source soda straw one at a time too?
Tom
I wonder how these warnings will work. I can see lots and lots of warnings due to minor variations in names of artists.
I do agree that MusicBrainz data should pass the Wikidata bar, as the data in MusicBrainz appear to me to be noteworthy and related to information in some Wiki (and true, although this is not part of the Wikidata bar as far as I know).
peter
On Thu, Sep 24, 2015 at 9:17 PM, Peter F. Patel-Schneider pfpschneider@gmail.com wrote:
I wonder how these warnings will work. I can see lots and lots of warnings due to minor variations in names of artists.
The software will take aliases into account as well as minor spelling changes. We'll need to see how it behaves with live-data and then tweak but I am confident it'll not be too huge a problem.
Cheers Lydia