Re: [Wikidata] Comparison of Wikidata, DBpedia, and Freebase (draft and invitation)

16 Nov 2019

Hi Denny, all,

here is the second prototype of the new overarching DBpedia approach:

https://databus.dbpedia.org/vehnem/flexifusion/prefusion/2019.11.01

Datasets are grouped by property, DBpedia ontology is used, if exists. 
Data contains all Wkipedia languages mapped via DBpedia, Wikidata where 
mapped, some properties from DNB, Musicbrainz, Geonames.

We normalized the subjects based on the sameas links with some quality 
control. Datatypes will be normalised by rules plus machine learning in 
the future.

As soon as we make some adjustments, we can load it into the GFS GUI.

We are also working on an export using Wikidata Q's and P's so it is 
easier to ingest into Wikidata. More datasets from LOD will follow.

All the best,

Sebastian

On 04.10.19 01:23, Sebastian Hellmann wrote:
...

 Hi Denny,

 here are some initial points:

 1. there is also the generic dataset from last month: 
 https://databus.dbpedia.org/dbpedia/generic/infobox-properties/2019.08.30 
 dataset (We still need to copy the docu on the bus). This has the 
 highest coverage, but lowest consistency. English has around 50k 
 parent properties maybe more if you count child inverse and other 
 variants. We would need to check the mappings at 
 http://mappings.dbpedia.org , which we are doing at the moment anyhow. 
 It could take only an hour to map some healthy chunks into the 
 mappings dataset.

 curl 

https://downloads.dbpedia.org/repo/lts/generic/infobox-properties/2019.08.3…

 | bzcat | grep "/parent"

 http://temporary.dbpedia.org/temporary/parentrel.nt.bz2

 Normally this dataset is messy, but still quite useful, because you 
 can write the queries with alternatives (see 
 dbo:position|dbp:position) in a way that make them useable, like this 
 query that works since 13 years:

  soccer players, who are born in a country with
more than 10 million 
 inhabitants, who played as goalkeeper for a club that has a stadium 
 with more than 30.000 seats and the club country is different from 
 the birth country 

<http://dbpedia.org/snorql/?query=SELECT+distinct+%3Fsoccerplayer+%3FcountryOfBirth+%3Fteam+%3FcountryOfTeam+%3Fstadiumcapacity%0D%0A{+%0D%0A%3Fsoccerplayer+a+dbo%3ASoccerPlayer+%3B%0D%0A+++dbo%3Aposition|dbp%3Aposition+%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FGoalkeeper_%28association_football%29%3E+%3B%0D%0A+++dbo%3AbirthPlace%2Fdbo%3Acountry*+%3FcountryOfBirth+%3B%0D%0A+++%23dbo%3Anumber+13+%3B%0D%0A+++dbo%3Ateam+%3Fteam+.%0D%0A+++%3Fteam+dbo%3Acapacity+%3Fstadiumcapacity+%3B+dbo%3Aground+%3FcountryOfTeam+.+%0D%0A+++%3FcountryOfBirth+a+dbo%3ACountry+%3B+dbo%3ApopulationTotal+%3Fpopulation+.%0D%0A+++%3FcountryOfTeam+a+dbo%3ACountry+.%0D%0AFILTER+%28%3FcountryOfTeam+!%3D+%3FcountryOfBirth%29%0D%0AFILTER+%28%3Fstadiumcapacity+%3E+30000%29%0D%0AFILTER+%28%3Fpopulation+%3E+10000000%29%0D%0A}+order+by+%3Fsoccerplayer>
 Maybe, we could also evaluate some queries which can be answered by 
 one or the other? Can you do the query above in Wikidata?

 2. We also have an API to get all references from infoboxes now as a 
 partial result of the GFS project . See point 5 here : 
 https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE

 3. This particular dataset (generic/infobox-properties) above is also 
 a good measure of non-adoption of Wikidata in Wikipedia. In total, it 
 has over 500 million statements for all languages. Having a statement 
 here means, that the data is using an infobox template parameter and 
 no wikidata is used. The dataset is still extracted in the same way. 
 We can check whether it got bigger or smaller. It is the same 
 algorithm. But the fact that this still works and has a decent size 
 indicates that Wikidata adoption by Wikipedians is low.

 4. I need to look at the parent example in detail. However, I have to 
 say that the property lends itself well for the Wikidata approach 
 since it is easily understood and has sort of a truthiness and is easy 
 to research and add.

 I am not sure if it is representative as e.g. "employer" is more 
 difficult to model (time scoped). Like my data here is outdated: 
 https://www.wikidata.org/wiki/Q39429171

 Also I don't see yet how this will become a more systematic approach 
 that shows where to optimize, but I still need to read it fully.

 We can start with this one however.

 -- Sebastian

 On 01.10.19 01:13, Denny Vrandečić wrote:
  Hi all,

 as promised, now that I am back from my trip, here's my draft of the 
 comparison of Wikidata, DBpedia, and Freebase.

 It is a draft, it is obviously potentially biased given my 
 background, etc., but I hope that we can work on it together to get 
 it into a good shape.

 Markus, amusingly I took pretty much the same example that you went 
 for, the parent predicate. So yes, I was also surprised by the 
 results, and would love to have Sebastian or Kingsley look into it 
 and see if I conducted it fairly.

 SJ, Andra, thanks for offering to take a look. I am sure you all can 
 contribute your own unique background and make suggestions on how to 
 improve things and whether the results ring true.

 Marco, I totally agree with what you said - the project has stalled, 
 and there is plenty of opportunity to harvest more data from Freebase 
 and bring it to Wikidata, and this should be reignited. Sebastian, I 
 also agree with you, and the numbers do so too, the same is true with 
 the extraction results from DBpedia.

 Sebastian, Kingsley, I tried to describe how I understand DBpedia, 
 and all steps should be reproducible. As it seems that the two of you 
 also have to discuss one or the other thing about DBpedia's identity, 
 I am relieved that my confusion is not entirely unjustified. So I 
 tried to use both the last stable DBpedia release as well as a 
 new-style DBpedia fusion dataset for the comparison. But I might have 
 gotten the whole procedure wrong. I am happy to be corrected.

 On Sat, Sep 28, 2019 at 12:28 AM &lt;hellmann(a)informatik.uni-leipzig.de 
 <mailto:hellmann@informatik.uni-leipzig.de>> wrote:
  Meanwhile, Google crawls all the references and
extracts facts from   there. We don't
  have that available, but there is Linked Open
Data. 
 Potentially, not a bad idea, but we don't do that.

 Everyone, this is the first time I share a Colab notebook, and I have 
 no idea if I did it right. So any feedback of the form "oh you didn't 
 switch on that bit over here" or "yes, this works, thank you" is very 
 welcome, because I have no clue what I am doing :) Also, I never did 
 this kind of analysis so transparently, which is kinda both totally 
 cool and rather scary, because now you can all see how dumb I am :)

 So everyone is invited to send Pull Requests (I guess that's how this 
 works?), and I would love for us to create a result together that we 
 agree on. I see the result of this exercise to be potentially twofold:

 1) a publication we can point people to who ask about the differences 
 between Wikidata, DBpedia, and Freebase

 2) to reignite or start projects and processes to reduce these 
 differences

 So, here is the link to my Colab notebook:

https://github.com/vrandezo/colabs/blob/master/Comparing_coverage_and_accur…

 Ideally, the third goal could be to get to a deeper understanding of 
 how these three projects relate to each other - in my point of view, 
 Freebase is dead and outdated, Wikidata is the core knowledge base 
 that anyone can edit, and DBpedia is the core project to weave 
 value-adding workflows on top of Wikidata or other datasets from the 
 linked open data cloud together. But that's just a proposal.

 Cheers,
 Denny

 On Sat, Sep 28, 2019 at 12:28 AM &lt;hellmann(a)informatik.uni-leipzig.de 
 <mailto:hellmann@informatik.uni-leipzig.de>> wrote:

     Hi Gerard,

     I was not trying to judge here. I was just saying that it wasn't
     much data in the end.
     For me Freebase was basically cherry-picked.

     Meanwhile, the data we extract is more pertinent to the goal of
     having Wikidata cover the info boxes. We still have ~ 500 million
     statements left. But none of it is used yet. Hopefully we can
     change that.

     Meanwhile, Google crawls all the references and extracts facts
     from there. We don't have that available, but there is Linked
     Open Data.

     --
     Sebastian

     On September 27, 2019 5:26:43 PM GMT+02:00, Gerard Meijssen
     &lt;gerard.meijssen(a)gmail.com <mailto:gerard.meijssen@gmail.com>>
     wrote:

         Hoi,
         I totally reject the assertion was so bad. I have always had
         the opinion that the main issue was an atrocious user
         interface. Add to this the people that have Wikipedia notions
         about quality. They have and had a detrimental effect on both
         the quantity and quality of Wikidata.

         When you add the functionality that is being build by the
         datawranglers at DBpedia, it becomes easy/easier to compare
         the data from Wikipedias with Wikidata (and why not Freebase)
         add what has consensus and curate the differences. This will
         enable a true datasense of quality and allows us to provide a
         much improved service.
         Thanks,
               GerardM

         On Fri, 27 Sep 2019 at 15:54, Marco Fossati
         &lt;fossati(a)spaziodati.eu <mailto:fossati@spaziodati.eu>> wrote:

             Hey Sebastian,

             On 9/20/19 10:22 AM, Sebastian Hellmann wrote:
  Not much of Freebase did end up in Wikidata.

             Dropping here some pointers to shed light on the
             migration of Freebase
             to Wikidata, since I was partially involved in the process:
             1. WikiProject [1];
             2. the paper behind [2];
             3. datasets to be migrated [3].

             I can confirm that the migration has stalled: as of
             today, *528
             thousands* Freebase statements were curated by the
             community, out of *10
             million* ones. By 'curated', I mean approved or rejected.
             These numbers come from two queries against the primary
             sources tool
             database.

             The stall is due to several causes: in my opinion, the
             most important
             one was the bad quality of sources [4,5] coming from the
             Knowledge Vault
             project [6].

             Cheers,

             Marco

             [1]
             https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase
             [2]

http://static.googleusercontent.com/media/research.google.com/en//pubs/arch…
             [3]
             https://www.wikidata.org/wiki/Wikidata:Primary_sources_tool/Version_1#Data
             [4]

https://www.wikidata.org/wiki/Wikidata_talk:Primary_sources_tool/Archive/20…
             [5]

https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Semi-automatic_…
             [6] https://www.cs.ubc.ca/~murphyk/Papers/kv-kdd14.pdf

             _______________________________________________
             Wikidata mailing list
             Wikidata(a)lists.wikimedia.org
             <mailto:Wikidata@lists.wikimedia.org>
             https://lists.wikimedia.org/mailman/listinfo/wikidata

     -- 
     Sent from my Android device with K-9 Mail. Please excuse my brevity.
     _______________________________________________
     Wikidata mailing list
     Wikidata(a)lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
     https://lists.wikimedia.org/mailman/listinfo/wikidata

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata  -- 
 All the best,
 Sebastian Hellmann

 Director of Knowledge Integration and Linked Data Technologies (KILT) 
 Competence Center
 at the Institute for Applied Informatics (InfAI) at Leipzig University
 Executive Director of the DBpedia Association
 Projects: http://dbpedia.org, http://nlp2rdf.org, 
 http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
 <http://www.w3.org/community/ld4lt>
 Homepage: http://aksw.org/SebastianHellmann
 Research Group: http://aksw.org

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata -- 
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 
<http://www.w3.org/community/ld4lt>
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Comparison of Wikidata, DBpedia, and Freebase (draft and invitation)