Le 1 oct. 2015 à 21:10, Stéphane Corlosquet
<scorlosquet(a)gmail.com> a écrit :
This is great work! who is Tpt?
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić <vrandecic(a)google.com> wrote:
as you know, Tpt has been working as an intern this summer at Google. He finished his
work a few weeks ago and I am happy to announce today the publication of all scripts and
the resulting data he has been working on. Additionally, we publish a few novel
visualizations of the data in Wikidata and Freebase. We are still working on the actual
report summarizing the effort and providing numbers on its effectiveness and progress.
This will take another few weeks.
First, thanks to Tpt for his amazing work! I have not expected to see such rich results.
He has exceeded my expectations by far, and produced much more transferable data than I
expected. Additionally, he also was working on the primary sources tool directly and
helped Marco Fossati to upload a second, sports-related dataset (you can select that by
clicking on the gears icon next to the Freebase item link in the sidebar on Wikidata, when
you switch on the Primary Sources tool).
The scripts that were created and used can be found here:
All scripts are released under the Apache license v2.
The following data files are also released. All data is released under the CC0 license
(in order to make this explicit, a comment has been added to the start of each file,
stating the copyright and the license. If any script dealing with the files hiccups due to
that line, simply remove the first line).
The actual missing statements, including URLs for sources, are in this file. This was
filtered against statements already existing in Wikidata, and the statements are mapped to
Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These
are created using the mappings below in addition to the mappings already in Wikidata. The
quality of these statements is rather mixed.
Additional datasets that we know meet a higher quality bar have been previously released
and uploaded directly to Wikidata by Tpt, following community consultation.
Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not
available in Wikidata. These are mappings based on statistical methods and single
interwiki links. Unlike the first set of mappings we had created and published previously
(which required multiple interwiki links at least), these mappings are expected to have a
lower quality - sufficient for a manual process, but probably not sufficient for an
automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped).
This file includes labels and aliases for Wikidata items which seem to be currently
missing. The quality of these labels is undetermined. The file contains about 860k labels
in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped,
This is an interesting file as it includes a quality signal for the statements in
Freebase. What you will find here are ordered pairs of Freebase mids and properties, each
indicating that the given pair were going through a review process and likely have a
higher quality on average. This is only for those pairs that are missing from Wikidata.
The file includes about 1.4M pairs, and this can be used for importing part of the data
directly (6MB gzipped, 52MB unzipped).
Now anyone can take the statements, analyse them, slice and dice them, upload them, use
them for your own tools and games, etc. They remain available through the primary sources
tool as well, which has already led to several thousand new statements in the last few
Additionally, Tpt and I created in the last few days of his internship a few
visualizations of the current data in Wikidata and in Freebase.
First, the following is a visualization of the whole of Wikidata:
The visualization needs a bit of explanation, I guess. The y-axis (up/down) represents
time, the x-axis (left/right) represents space / geolocation. The further down, the closer
you are to the present, the further up the more you go in the past. Time is given in a
rational scale - the 20th century gets much more space than the 1st century. The x-axis
represents longitude, with the prime meridian in the center of the image.
Every item is being put at its longitude (averaged, if several) and at its earliest point
of time mentioned on the item. For items without either, neighbouring items propagate
their value to them (averaging, if necessary). This is done repeatedly until the items are
In order to understand that a bit better, the following image offers a supporting grid:
each line from left to right represents a century (up to the first century), and each line
from top to bottom represent a meridian (with London in the middle of the graph).
The same visualizations has also been created for Freebase:
In order to compare the two graphs, we also overlaid them over each other. I will leave
the interpretation to you, but you can easily see the strengths of weaknesses of both
The programs for creating the visualizations are all available in the Github repository
mentioned above (plenty of RAM is recommended to run it).
Enjoy the visualizations, the data and the script! Tpt and I are available to answer
questions. I hope this will help with understanding and analysing some of the results of
the work that we did this summer.
Wikidata mailing list
Wikidata mailing list