Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

28 Sep 2015

Gerard,

Why do you spend so much energy on criticising the work of other 
volunteers and companies that want to help Wikidata? Switching off 
Primary Sources would not achieve any progress towards what you want. I 
have made some proposals in my email on what else could be done to speed 
things up. You could work on realising some of these ideas, you could 
propose other activities to the community, or you could just help 
elsewhere on Wikidata. Focussing on a tool you don't like and don't want 
to use will not make you (or the rest of us) happy.

Markus

On 28.09.2015 20:01, Gerard Meijssen wrote:
...
  Hoi,

 Sorry I disagree with your analysis. The fundamental issue is not
 quality and it is not the size of our community. The issue is that we
 have our priorities wrong. As far as I am concerned the "primary sources
 tool" is a wrong approach for a dataset like Freebase or DBpedia.

 What we should concentrate on is find likely issues that exist in
 Wikidata. Make people aware of them and have a proper workflow that will
 point people to the things they care about. When I care about "polders"
 show me content where another source disagrees with what we have. As I
 care about "polders" I will spend time on it BECAUSE I care and am
 invited to resolve issues. I will be challenged because every item I
 touch has an issue. I do not mind to do this when the data in Wikidata
 differs from DBpedia, Freebase or whatever.. My time is well spend. THAT
 is why I will be challenged, that is why I will be willing to work on this.

 I will not do this for new data in the primary sources tool. At most I
 will give it a glance and accept it. I would only do this where data in
 the primary sources tool differs. That however is exactly the same
 scenario that I just described.

 I am not willing to look at data in Wikidata Freebase or DBpedia in the
 primary sources tool one item/statement at a time; we know that they are
 of a similar quality as Wikidata. The percentages make it a waste of
 time. With iterative comparisons of other sources we will find the
 booboos easy enough. We will spend the time of our communities
 effectively and we will increase quality and quality and community.

 The approach of the primary sources tool is wrong. It should only be
 about linking data and define how this is done.

 The problem is indeed with the community. Its time is wasted and it is
 much more effective for me to add new data than work on data that is
 already in the primary sources tool.
 Thanks,
         GerardM

 On 28 September 2015 at 16:52, Markus Krötzsch
 &lt;markus(a)semantic-mediawiki.org <mailto:markus@semantic-mediawiki.org>>
 wrote:

     Hi Gerard, hi all,

     The key misunderstanding here is that the main issue with the
     Freebase import would be data quality. It is actually community
     support. The goal of the current slow import process is for the
     Wikidata community to "adopt" the Freebase data. It's not about
     "storing" the data somewhere, but about finding a way to maintain it
     in the future.

     The import statistics show that Wikidata does not currently have
     enough community power for a quick import. This is regrettable, but
     not something that we can fix by dumping in more data that will then
     be orphaned.

     Freebase people: this is not a small amount of data for our young
     community. We really need your help to digest this huge amount of
     data! I am absolutely convinced from the emails I saw here that none
     of the former Freebase editors on this list would support low
     quality standards. They have fought hard to fix errors and avoid
     issues coming into their data for a long time.

     Nobody believes that either Freebase or Wikidata can ever be free of
     errors, and this is really not the point of this discussion at all
     [1]. The experienced community managers among us know that it is not
     about the amount of data you have. Data is cheap and easy to get,
     even free data with very high quality. But the value proposition of
     Wikidata is not that it can provide storage space for lot of data --
     it is that we have a functioning community that can maintain it. For
     the Freebase data donation, we do not seem to have this community
     yet. We need to find a way to engage people to do this. Ideas are
     welcome.

     What I can see from the statistics, however, is that some users (and
     I cannot say if they are "Freebase users" or "Wikidata users"
;-)
     are putting a lot of effort into integrating the data already. This
     is great, and we should thank these people because they are the ones
     who are now working on what we are just talking about here. In
     addition, we should think about ways of engaging more community in
     this. Some ideas:

     (1) Find a way to clean and import some statements using bots. Maybe
     there are cases where Freebase already had a working import
     infrastructure that could be migrated to Wikidata? This would also
     solve the community support problem in one way. We just need to
     import the maintenance infrastructure together with the data.

     (2) Find a way to expose specific suggestions to more people. The
     Wikidata Games have attracted so many contributions. Could some of
     the Freebase data be solved in this way, with a dedicated UI?

     (3) Organise Freebase edit-a-thons where people come together to
     work through a bunch of suggested statements.

     (4) Form wiki projects that discuss a particular topic domain in
     Freebase and how it could be imported faster using (1)-(3) or any
     other idea.

     (5) Connect to existing Wiki projects to make them aware of valuable
     data they might take from Freebase.

     Freebase is a much better resource than many other data resources we
     are already using with similar approaches as (1)-(5) above, and yet
     it seems many people are waiting for Google alone to come up with a
     solution.

     Cheers,

     Markus

     [1] Gerard, if you think otherwise, please let us know which error
     rates you think are typical or acceptable for Freebase and Wikidata,
     respectively. Without giving actual numbers you just produce empty
     strawman arguments (for example: claiming that anyone would think
     that Wikidata is better quality than Freebase and then refuting this
     point, which nobody is trying to make). See
     https://en.wikipedia.org/wiki/Straw_man

     On 26.09.2015 18:31, Gerard Meijssen wrote:

         Hoi,
         When you analyse the statistics, it shows how bad the current
         state of
         affairs is. Slightly over one in a thousanths of the content of the
         primary sources tool has been included.

         Markus, Lydia and myself agree that the content of Freebase may be
         improved. Where we differ is that the same can be said for
         Wikidata. It
         is not much better and by including the data from Freebase we have a
         much improved coverage of facts. The same can be said for the
         content of
         DBpedia probably other sources as well.

         I seriously hate this procrastination and the denial of the
         efforts of
         others. It is one type of discrimination that is utterly deplorable.

         We should concentrate on comparing Wikidata with other sources
         that are
         maintained. We should do this repeatedly and concentrate on
         workflows
         that seek the differences and provide workflows that help our
         community
         to improve what we have. What we have is the sum of all available
         knowledge and by splitting it up, we are weakened as a result.
         Thanks,
                 GerardM

         On 26 September 2015 at 03:32, Thad Guidry &lt;thadguidry(a)gmail.com
         <mailto:thadguidry@gmail.com>
         <mailto:thadguidry@gmail.com <mailto:thadguidry@gmail.com>>>
wrote:

              Also, Freebase users themselves who did daily, weekly
         work.... some
              where passing users, some tried harder, but made lots of
         erroneous
              entries (battling against our Experts at times).  We could
         probably
              provide a list of those sorta community blacklisted users
         who's data
              submissions should probably not be trusted.

              +1 for looking at better maintained specific properties.
              +1 for being cautious for some Freebase usernames and their
         entries.
              +1 for trusting wholesale all of the Freebase Experts
         submissions.
              We policed each other quite well.

              Thad
              +ThadGuidry <https://www.google.com/+ThadGuidry>

              On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas
              &lt;jasondouglas(a)google.com <mailto:jasondouglas@google.com>
         <mailto:jasondouglas@google.com
         <mailto:jasondouglas@google.com>>> wrote:

  It would indeed be interesting to see which
         percentage of proposals are
  being approved (and stay in Wikidata after a
while),          and whether there
  is a pattern (100% approval on some type of fact
that          could then be
  merged more quickly; or very low approval on
         something else that would
  maybe better revisited for mapping errors or
other          systematic problems).

                  +1, I think that's your best bet. Specific properties
         were much
                  better maintained than others -- identify those that
         meet the
                  bar for wholesale import and leave the rest to the primary
                  sources tool.

                  On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch
                  &lt;markus(a)semantic-mediawiki.org
         <mailto:markus@semantic-mediawiki.org>
                  <mailto:markus@semantic-mediawiki.org
         <mailto:markus@semantic-mediawiki.org>>> wrote:

                      On 24.09.2015 23:48, James Heald wrote:
  Has anybody actually done an assessment on 
        Freebase and
                      its reliability?

 Is it *really* too unreliable to import wholesale? 
                        From experience with the Primary Sources tool
         proposals,
                      the quality is
                      mixed. Some things it proposes are really very
         valuable, but
                      other
                      things are also just wrong. I added a few very
         useful facts
                      and fitting
                      references based on the suggestions, but I also
         rejected
                      others. Not
                      sure what the success rate is for the cases I
         looked at, but
                      my feeling
                      is that some kind of "supervised import" approach
         is really
                      needed when
                      considering the total amount of facts.

                      An issue is that it is often fairly hard to tell if a
                      suggestion is true
                      or not (mainly in cases where no references are
         suggested to
                      check). In
                      other cases, I am just not sure if a fact is
         correct for the
                      property
                      used. For example, I recently ended up accepting
         "architect:
                      Charles
                      Husband" for Lovell Telescope (Q555130), but to be
         honest I
                      am not sure
                      that this is correct: he was the leading engineer
         contracted
                      to design
                      the telescope, which seems different from an
         architect; no
                      official web
                      site uses the word "architect" it seems; I could
         not find a
                      better
                      property though, and it seemed "good enough" to
         accept it
                      (as opposed to
                      the post code of the location of this structure, which
                      apparently was
                      just wrong).

 Are there any stats/progress graphs as to how          the actual
                      import is in
  fact going? 
                      It would indeed be interesting to see which
         percentage of
                      proposals are
                      being approved (and stay in Wikidata after a
         while), and
                      whether there
                      is a pattern (100% approval on some type of fact
         that could
                      then be
                      merged more quickly; or very low approval on
         something else
                      that would
                      maybe better revisited for mapping errors or other
                      systematic problems).

                      Markus

    -- James.

 On 24/09/2015 19:35, Lydia Pintscher wrote:
> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris                      
&lt;tfmorris(a)gmail.com <mailto:tfmorris@gmail.com>
         <mailto:tfmorris@gmail.com <mailto:tfmorris@gmail.com>>> wrote:
 >>> This is to add MusicBrainz to the
primary          source tool,
                      not anything
 >>> else?
>>
>>
>> It's apparently worse than that (which I hadn't                    
  realized until I
 >> re-read the
>> transcript).  It sounds like it's just going to                      
generate little warning
 >> icons for "bad" facts and not
lead to the          recording of
                      any new facts
 >> at all.
>>
>> 17:22:33 <Lydia_WMDE> we'll also work on          getting the
                      extension
 >> deployed that
>> will help with checking against 3rd party          databases
 >> 17:23:33 <Lydia_WMDE> the result of
constraint          checks
                      and checks
 >> against 3rd
>> party databases will then be used to display          little
                      indicators next to a
 >> statement in case it is problematic
>> 17:23:47 <Lydia_WMDE> i hope this way more people                    
  become aware of
 >> issues and
>> can help fix them
>> 17:24:35 <sjoerddebruin> Do you have any names of                    
  databases that are
 >> supported? :)
>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first                    
  version the german
 >> national library. it can be extended
later
>>
>>
>> I know Freebase is deemed to be nasty and          unreliable,
                      but is MusicBrainz
 >> considered trustworthy enough to import
         directly or
                      will its facts
 >> need to
>> be dripped through the primary source soda          straw one at
                      a time too?
 >
> The primary sources tool and the extension that          helps us
                      check against
 > other databases are two independent things.
> Imports from Musicbrainz have been happening          since a
                      very long time
   already.

 Cheers
 Lydia

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org          <mailto:Wikidata@lists.wikimedia.org>
                      <mailto:Wikidata@lists.wikimedia.org
         <mailto:Wikidata@lists.wikimedia.org>>

https://lists.wikimedia.org/mailman/listinfo/wikidata

                      _______________________________________________
                      Wikidata mailing list
         Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
                      <mailto:Wikidata@lists.wikimedia.org
         <mailto:Wikidata@lists.wikimedia.org>>
         https://lists.wikimedia.org/mailman/listinfo/wikidata

                  _______________________________________________
                  Wikidata mailing list
         Wikidata(a)lists.wikimedia.org
         <mailto:Wikidata@lists.wikimedia.org>
         <mailto:Wikidata@lists.wikimedia.org
         <mailto:Wikidata@lists.wikimedia.org>>
         https://lists.wikimedia.org/mailman/listinfo/wikidata

              _______________________________________________
              Wikidata mailing list
         Wikidata(a)lists.wikimedia.org
         <mailto:Wikidata@lists.wikimedia.org>
         <mailto:Wikidata@lists.wikimedia.org
         <mailto:Wikidata@lists.wikimedia.org>>
         https://lists.wikimedia.org/mailman/listinfo/wikidata

         _______________________________________________
         Wikidata mailing list
         Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
         https://lists.wikimedia.org/mailman/listinfo/wikidata

     _______________________________________________
     Wikidata mailing list
     Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
     https://lists.wikimedia.org/mailman/listinfo/wikidata

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)