Freebase to Wikidata: Results from Tpt internship - Wikidata

2 Oct 2015


      Hi all,
as you know, Tpt has been working as an intern this summer at Google. He
finished his work a few weeks ago and I am happy to announce today the
publication of all scripts and the resulting data he has been working on.
Additionally, we publish a few novel visualizations of the data in Wikidata
and Freebase. We are still working on the actual report summarizing the
effort and providing numbers on its effectiveness and progress. This will
take another few weeks.
First, thanks to Tpt for his amazing work! I have not expected to see such
rich results. He has exceeded my expectations by far, and produced much
more transferable data than I expected. Additionally, he also was working
on the primary sources tool directly and helped Marco Fossati to upload a
second, sports-related dataset (you can select that by clicking on the
gears icon next to the Freebase item link in the sidebar on Wikidata, when
you switch on the Primary Sources tool).
The scripts that were created and used can be found here:
https://github.com/google/freebase-wikidata-converter
All scripts are released under the Apache license v2.
The following data files are also released. All data is released under the
CC0 license (in order to make this explicit, a comment has been added to
the start of each file, stating the copyright and the license. If any
script dealing with the files hiccups due to that line, simply remove the
first line).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-miss...
The actual missing statements, including URLs for sources, are in this
file. This was filtered against statements already existing in Wikidata,
and the statements are mapped to Wikidata IDs. This contains about 14.3M
statements (214MB gzipped, 831MB unzipped). These are created using the
mappings below in addition to the mappings already in Wikidata. The quality
of these statements is rather mixed.
Additional datasets that we know meet a higher quality bar have been
previously released and uploaded directly to Wikidata by Tpt, following
community consultation.
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.p...
Contains additional mappings between Freebase MIDs and Wikidata QIDs, which
are not available in Wikidata. These are mappings based on statistical
methods and single interwiki links. Unlike the first set of mappings we had
created and published previously (which required multiple interwiki links
at least), these mappings are expected to have a lower quality - sufficient
for a manual process, but probably not sufficient for an automatic upload.
This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels....
This file includes labels and aliases for Wikidata items which seem to be
currently missing. The quality of these labels is undetermined. The file
contains about 860k labels in about 160 languages, with 33 languages having
more than 10k labels each (14MB gzipped, 32MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-mi...
This is an interesting file as it includes a quality signal for the
statements in Freebase. What you will find here are ordered pairs of
Freebase mids and properties, each indicating that the given pair were
going through a review process and likely have a higher quality on average.
This is only for those pairs that are missing from Wikidata. The file
includes about 1.4M pairs, and this can be used for importing part of the
data directly (6MB gzipped, 52MB unzipped).
Now anyone can take the statements, analyse them, slice and dice them,
upload them, use them for your own tools and games, etc. They remain
available through the primary sources tool as well, which has already led
to several thousand new statements in the last few weeks.
Additionally, Tpt and I created in the last few days of his internship a
few visualizations of the current data in Wikidata and in Freebase.
First, the following is a visualization of the whole of Wikidata:
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png
The visualization needs a bit of explanation, I guess. The y-axis (up/down)
represents time, the x-axis (left/right) represents space / geolocation.
The further down, the closer you are to the present, the further up the
more you go in the past. Time is given in a rational scale - the 20th
century gets much more space than the 1st century. The x-axis represents
longitude, with the prime meridian in the center of the image.
Every item is being put at its longitude (averaged, if several) and at its
earliest point of time mentioned on the item. For items without either,
neighbouring items propagate their value to them (averaging, if necessary).
This is done repeatedly until the items are saturated.
In order to understand that a bit better, the following image offers a
supporting grid: each line from left to right represents a century (up to
the first century), and each line from top to bottom represent a meridian
(with London in the middle of the graph).
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color....
The same visualizations has also been created for Freebase:
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color....
In order to compare the two graphs, we also overlaid them over each other.
I will leave the interpretation to you, but you can easily see the
strengths of weaknesses of both knowledge bases.
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freebas...
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikidat...
The programs for creating the visualizations are all available in the
Github repository mentioned above (plenty of RAM is recommended to run it).
Enjoy the visualizations, the data and the script! Tpt and I are available
to answer questions. I hope this will help with understanding and analysing
some of the results of the work that we did this summer.
Cheers,
Denny