Re: [Wikidata] Freebase to Wikidata: Results from Tpt internship

21 Feb 2016

On Fri, Oct 2, 2015 at 11:59 AM, Tom Morris &lt;tfmorris(a)gmail.com&gt; wrote:

...
  Denny/Thomas - Thanks for publishing these artefacts. 
I'll look forward
 to the report with the metrics.

This is now, finally, available:
http://static.googleusercontent.com/media/research.google.com/en//pubs/arch…

...
  Are there plans for next steps or is this the end of
the project as far as
 the two of you go?

I'm going to assume that the lack of answer to this question over the last
four months, the lack of updates on the project, and the fact no one is
even bothering to respond to issues
<https://github.com/google/primarysources/issues> means that this project
is dead and abandoned.  That's pretty sad. For an internship, it sounds
like a cool project and a decent result. As an actual serious attempt to
make productive use of the Freebase data, it's a weak, half-hearted effort
by Google.

Is there any interest in the Wikidata community for making use of the
Freebase data now that Google has abandoned their effort, or is there too
much negative sentiment against it to make it worth the effort?

Tom

p.s. I'm surprised that none of the stuff mentioned below is addressed in
the paper. Was it already submitted by the beginning of October?

Comments on individual items inline below:
...

 On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić &lt;vrandecic(a)google.com&gt;
 wrote:

 The scripts that were created and used can be found here:

 https://github.com/google/freebase-wikidata-converter

 Oh no!  Not PHP!! :-)  One thing that concerns me is that the scripts seem
 to work on the Freebase RDF dump which is derivative artefact subject to a
 lossy transform.  I assumed that one of the reasons for having this work
 hosted at Google was that it would allow direct access to the Freebase
 graphd quads.  Is that not what happened?  There's a bunch of provenance
 information which is very valuable for quality analysis in the graphd graph
 which gets lost during the RDF transformation.

This isn't addressed in the paper and represents a significant loss of
provenance information.

...

https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-mis…
 The actual missing statements, including URLs for sources, are in this
 file. This was filtered against statements already existing in Wikidata,
 and the statements are mapped to Wikidata IDs. This contains about 14.3M
 statements (214MB gzipped, 831MB unzipped). These are created using the
 mappings below in addition to the mappings already in Wikidata. The quality
 of these statements is rather mixed.

 From my brief exposure and the comments of others, the quality seems
 highly problematic, but the issue seems to mainly be with the URLs
 proposed, which are of unknown provenance.  Presumably in whatever Google
 database these were derived from, they were tagged with the tool/pipeline
 that produced them and some type of probability of relevance.  Including
 this information in the data set would help pick the most relevant URLs to
 present and also help identify low-quality sources as voting feedback is
 collected.  Also, filtering the URLs for known unacceptable citations (485K
 IMDB references, BBC Music entries which consist solely of EN Wikipedia
 snippets, etc) would cut down on a lot of the noise.

 Some quick stats in addition to the 14.3M statements: 2.3M entities, 183
 properties, 284K different web sites.

 Additional datasets that we know meet a higher quality bar have been
  previously released and uploaded directly to
Wikidata by Tpt, following
 community consultation.

 Is there a pointer to these?

https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.…
 Contains additional mappings between Freebase MIDs and Wikidata QIDs,
 which are not available in Wikidata. These are mappings based on
 statistical methods and single interwiki links. Unlike the first set of
 mappings we had created and published previously (which required multiple
 interwiki links at least), these mappings are expected to have a lower
 quality - sufficient for a manual process, but probably not sufficient for
 an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB
 unzipped).

 I was really excited when I saw this because the first step in the
 Freebase migration project should be to increase the number of topic
 mappings between the two databases and 3.4M would almost double the number
 of existing mappings.  Then I looked at the first 10K Q numbers and found
 of the 7,500 "new" mappings, almost 6,700 were already in Wikidata.

 Fortunately when I took a bigger sample, things improved.  For a 4%
 sample, it looks just under 30% are already in Wikidata, so if the quality
 of the remainder is good, that would yield an additional 2.4M mappings,
 which is great!  Interestingly there were also a smattering of Wikidata
 404s (25), redirects (71), and values which conflicted with Wikidata (530),
 a cursory analysis of the latter showed that they were mostly the result of
 merges on the Freebase end (so the entity now has two MIDs).

It's not clear to me if these additional mappings are being used in the
Primary Sources tool (or anywhere else). Are they?

...

https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels…
 This file includes labels and aliases for Wikidata items which seem to be
 currently missing. The quality of these labels is undetermined. The file
 contains about 860k labels in about 160 languages, with 33 languages having
 more than 10k labels each (14MB gzipped, 32MB unzipped).

 Their provenance is available in the Freebase graph.  The most likely
 source is other language Wikipedias, but this could be easily confirmed.

 https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-m…
 This is an interesting file as it includes a quality signal for the
 statements in Freebase. What you will find here are ordered pairs of
 Freebase mids and properties, each indicating that the given pair were
 going through a review process and likely have a higher quality on average.
 This is only for those pairs that are missing from Wikidata. The file
 includes about 1.4M pairs, and this can be used for importing part of the
 data directly (6MB gzipped, 52MB unzipped).

 This appears to be a dump of the instances for the property
 /freebase/valuenotation/is_reviewed but it's not usable as is because of
 the intended semantics of the property.  The property indicates that *when
 the triple was written* the reviewer asserts that the current value of
 the name property is correct.  This means that you need to use the creation
 date of the triple to extract the right property value from the graph for
 the named property (and because there's no write protection, probably only
 reviewers who are members of groups like "Staff IC" or "Staff OD"
should be
 counted).

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Freebase to Wikidata: Results from Tpt internship