Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

29 Sep 2015

      Thanks for creating a dedicated thread, Markus.  It saddens me to see this
opportunity squandered and I'd love to be able to help, but I find the
project so opaque that it's difficult to find a way to engage.  Perhaps
it's just an artifact of the lack of transparency, but the current approach
seems very ad hoc to me.  It's difficult to tease apart which problems are
due to bad Freebase data, which are due to the way the Freebase data is
being processed for import, and which are due to the attitudes of the
reviewers.
As Jason Douglas said on the other thread, the Freebase data isn't
homogenous in terms of quality or importance and the appropriate way to
evaluate and import the data is by segmenting it, whether that be by
property, or data source, or whatever.  The only analysis that seems to
have been done so far is to rank properties by the number of values they
have which: a) isn't a good proxy for quality and b) isn't even a good
proxy for importance (there are a bunch of high frequency things which are
basically dead/obsolete).
The two things that I think would greatly improve things are:
- document the current process & methodology
- adopt a systematic, iterative, evaluation and improvement feedback loop
Since data is what drives this whole process understanding how the existing
data has been evaluated, filtered, transformed, etc before being loaded
into the primary sources tool is critical to understanding what the
starting basis is.  After that, understanding the meaning of the stats (and
fixing them if they don't have the right meanings) is necessary to know how
things need to be improved.
I'm having a hard time understanding the existing stats as well as
correlating them with both people's anecdotal accounts and my understanding
of the strengths and weaknesses of the Freebase data.  Additionally, the
stats represent, as I understand it, a single user's opinion of the quality
of the fact, the property mapping, the source URL and probably other
factors like their mood, how hungry they are, etc.  It's going to include
both false negatives and false positives.
When I look at one recent "approved" Freebase primary sources fact, I see
that it was reverted the next day
https://www.wikidata.org/w/index.php?title=Q464371&dir=prev&offset=20140524064128&action=history
as a duplicate, but I also see that Maryse Condé's occupation (P106) has a
long and tortured history on Wikidata with Dexbot importing "Woman of
letters" from Italian Wikipedia, Brackibot switching it to "Author," then
Rezabot, and a few more users all taking a shot at changing it to what they
thought was best.
My gut feeling is that the bulk of the problems that people are complaining
about the Freebase-derived data that's been loaded into the Primary Sources
tool are due to the tool chain that's preparing the data, without better
stats and insight into the processes it's really impossible to say.  A
systematic analysis is needed, not a bunch of recitations of anecdotes.
Tom
On Mon, Sep 28, 2015 at 10:52 AM, Markus Krötzsch <
markus@semantic-mediawiki.org> wrote:
...
Hi Gerard, hi all,
The key misunderstanding here is that the main issue with the Freebase
import would be data quality. It is actually community support. The goal of
the current slow import process is for the Wikidata community to "adopt"
the Freebase data. It's not about "storing" the data somewhere, but about
finding a way to maintain it in the future.
The import statistics show that Wikidata does not currently have enough
community power for a quick import. This is regrettable, but not something
that we can fix by dumping in more data that will then be orphaned.
Freebase people: this is not a small amount of data for our young
community. We really need your help to digest this huge amount of data! I
am absolutely convinced from the emails I saw here that none of the former
Freebase editors on this list would support low quality standards. They
have fought hard to fix errors and avoid issues coming into their data for
a long time.
Nobody believes that either Freebase or Wikidata can ever be free of
errors, and this is really not the point of this discussion at all [1]. The
experienced community managers among us know that it is not about the
amount of data you have. Data is cheap and easy to get, even free data with
very high quality. But the value proposition of Wikidata is not that it can
provide storage space for lot of data -- it is that we have a functioning
community that can maintain it. For the Freebase data donation, we do not
seem to have this community yet. We need to find a way to engage people to
do this. Ideas are welcome.
What I can see from the statistics, however, is that some users (and I
cannot say if they are "Freebase users" or "Wikidata users" ;-) are putting
a lot of effort into integrating the data already. This is great, and we
should thank these people because they are the ones who are now working on
what we are just talking about here. In addition, we should think about
ways of engaging more community in this. Some ideas:
(1) Find a way to clean and import some statements using bots. Maybe there
are cases where Freebase already had a working import infrastructure that
could be migrated to Wikidata? This would also solve the community support
problem in one way. We just need to import the maintenance infrastructure
together with the data.
(2) Find a way to expose specific suggestions to more people. The Wikidata
Games have attracted so many contributions. Could some of the Freebase data
be solved in this way, with a dedicated UI?
(3) Organise Freebase edit-a-thons where people come together to work
through a bunch of suggested statements.
(4) Form wiki projects that discuss a particular topic domain in Freebase
and how it could be imported faster using (1)-(3) or any other idea.
(5) Connect to existing Wiki projects to make them aware of valuable data
they might take from Freebase.
Freebase is a much better resource than many other data resources we are
already using with similar approaches as (1)-(5) above, and yet it seems
many people are waiting for Google alone to come up with a solution.
Cheers,
Markus
[1] Gerard, if you think otherwise, please let us know which error rates
you think are typical or acceptable for Freebase and Wikidata,
respectively. Without giving actual numbers you just produce empty strawman
arguments (for example: claiming that anyone would think that Wikidata is
better quality than Freebase and then refuting this point, which nobody is
trying to make). See https://en.wikipedia.org/wiki/Straw_man
On 26.09.2015 18:31, Gerard Meijssen wrote:
...
Hoi,
When you analyse the statistics, it shows how bad the current state of
affairs is. Slightly over one in a thousanths of the content of the
primary sources tool has been included.
Markus, Lydia and myself agree that the content of Freebase may be
improved. Where we differ is that the same can be said for Wikidata. It
is not much better and by including the data from Freebase we have a
much improved coverage of facts. The same can be said for the content of
DBpedia probably other sources as well.
I seriously hate this procrastination and the denial of the efforts of
others. It is one type of discrimination that is utterly deplorable.
We should concentrate on comparing Wikidata with other sources that are
maintained. We should do this repeatedly and concentrate on workflows
that seek the differences and provide workflows that help our community
to improve what we have. What we have is the sum of all available
knowledge and by splitting it up, we are weakened as a result.
Thanks,
       GerardM
On 26 September 2015 at 03:32, Thad Guidry <thadguidry@gmail.com
mailto:thadguidry@gmail.com> wrote:
Also, Freebase users themselves who did daily, weekly work.... some
where passing users, some tried harder, but made lots of erroneous
entries (battling against our Experts at times).  We could probably
provide a list of those sorta community blacklisted users who's data
submissions should probably not be trusted.

+1 for looking at better maintained specific properties.
+1 for being cautious for some Freebase usernames and their entries.
+1 for trusting wholesale all of the Freebase Experts submissions.
We policed each other quite well.

Thad
+ThadGuidry <https://www.google.com/+ThadGuidry>

On Fri, Sep 25, 2015 at 11:45 AM, Jason Douglas
<jasondouglas@google.com <mailto:jasondouglas@google.com>> wrote:

    > It would indeed be interesting to see which percentage of

proposals are
        > being approved (and stay in Wikidata after a while), and
whether there
        > is a pattern (100% approval on some type of fact that could
then be
        > merged more quickly; or very low approval on something else
that would
        > maybe better revisited for mapping errors or other systematic
problems).
    +1, I think that's your best bet. Specific properties were much
    better maintained than others -- identify those that meet the
    bar for wholesale import and leave the rest to the primary
    sources tool.

    On Thu, Sep 24, 2015 at 4:03 PM Markus Krötzsch
    <markus@semantic-mediawiki.org
    <mailto:markus@semantic-mediawiki.org>> wrote:

        On 24.09.2015 23:48, James Heald wrote:
         > Has anybody actually done an assessment on Freebase and
        its reliability?
         >
         > Is it *really* too unreliable to import wholesale?

          From experience with the Primary Sources tool proposals,
        the quality is
        mixed. Some things it proposes are really very valuable, but
        other
        things are also just wrong. I added a few very useful facts
        and fitting
        references based on the suggestions, but I also rejected
        others. Not
        sure what the success rate is for the cases I looked at, but
        my feeling
        is that some kind of "supervised import" approach is really
        needed when
        considering the total amount of facts.

        An issue is that it is often fairly hard to tell if a
        suggestion is true
        or not (mainly in cases where no references are suggested to
        check). In
        other cases, I am just not sure if a fact is correct for the
        property
        used. For example, I recently ended up accepting "architect:
        Charles
        Husband" for Lovell Telescope (Q555130), but to be honest I
        am not sure
        that this is correct: he was the leading engineer contracted
        to design
        the telescope, which seems different from an architect; no
        official web
        site uses the word "architect" it seems; I could not find a
        better
        property though, and it seemed "good enough" to accept it
        (as opposed to
        the post code of the location of this structure, which
        apparently was
        just wrong).

         >
         > Are there any stats/progress graphs as to how the actual
        import is in
         > fact going?

        It would indeed be interesting to see which percentage of
        proposals are
        being approved (and stay in Wikidata after a while), and
        whether there
        is a pattern (100% approval on some type of fact that could
        then be
        merged more quickly; or very low approval on something else
        that would
        maybe better revisited for mapping errors or other
        systematic problems).

        Markus

         >
         >    -- James.
         >
         >
         > On 24/09/2015 19:35, Lydia Pintscher wrote:
         >> On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris
        <tfmorris@gmail.com <mailto:tfmorris@gmail.com>> wrote:
         >>>> This is to add MusicBrainz to the primary source tool,
        not anything
         >>>> else?
         >>>
         >>>
         >>> It's apparently worse than that (which I hadn't
        realized until I
         >>> re-read the
         >>> transcript).  It sounds like it's just going to
        generate little warning
         >>> icons for "bad" facts and not lead to the recording of
        any new facts
         >>> at all.
         >>>
         >>> 17:22:33 <Lydia_WMDE> we'll also work on getting the
        extension
         >>> deployed that
         >>> will help with checking against 3rd party databases
         >>> 17:23:33 <Lydia_WMDE> the result of constraint checks
        and checks
         >>> against 3rd
         >>> party databases will then be used to display little
        indicators next to a
         >>> statement in case it is problematic
         >>> 17:23:47 <Lydia_WMDE> i hope this way more people
        become aware of
         >>> issues and
         >>> can help fix them
         >>> 17:24:35 <sjoerddebruin> Do you have any names of
        databases that are
         >>> supported? :)
         >>> 17:24:59 <Lydia_WMDE> sjoerddebruin: in the first
        version the german
         >>> national library. it can be extended later
         >>>
         >>>
         >>> I know Freebase is deemed to be nasty and unreliable,
        but is MusicBrainz
         >>> considered trustworthy enough to import directly or
        will its facts
         >>> need to
         >>> be dripped through the primary source soda straw one at
        a time too?
         >>
         >> The primary sources tool and the extension that helps us
        check against
         >> other databases are two independent things.
         >> Imports from Musicbrainz have been happening since a
        very long time
         >> already.
         >>
         >>
         >> Cheers
         >> Lydia
         >>

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)