Hi all,
as you know, Tpt has been working as an intern this summer at Google. He finished his work a few weeks ago and I am happy to announce today the publication of all scripts and the resulting data he has been working on. Additionally, we publish a few novel visualizations of the data in Wikidata and Freebase. We are still working on the actual report summarizing the effort and providing numbers on its effectiveness and progress. This will take another few weeks.
First, thanks to Tpt for his amazing work! I have not expected to see such rich results. He has exceeded my expectations by far, and produced much more transferable data than I expected. Additionally, he also was working on the primary sources tool directly and helped Marco Fossati to upload a second, sports-related dataset (you can select that by clicking on the gears icon next to the Freebase item link in the sidebar on Wikidata, when you switch on the Primary Sources tool).
The scripts that were created and used can be found here:
https://github.com/google/freebase-wikidata-converter
All scripts are released under the Apache license v2.
The following data files are also released. All data is released under the CC0 license (in order to make this explicit, a comment has been added to the start of each file, stating the copyright and the license. If any script dealing with the files hiccups due to that line, simply remove the first line).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-miss... The actual missing statements, including URLs for sources, are in this file. This was filtered against statements already existing in Wikidata, and the statements are mapped to Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These are created using the mappings below in addition to the mappings already in Wikidata. The quality of these statements is rather mixed.
Additional datasets that we know meet a higher quality bar have been previously released and uploaded directly to Wikidata by Tpt, following community consultation.
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.p... Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not available in Wikidata. These are mappings based on statistical methods and single interwiki links. Unlike the first set of mappings we had created and published previously (which required multiple interwiki links at least), these mappings are expected to have a lower quality - sufficient for a manual process, but probably not sufficient for an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels.... This file includes labels and aliases for Wikidata items which seem to be currently missing. The quality of these labels is undetermined. The file contains about 860k labels in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped, 32MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-mi... This is an interesting file as it includes a quality signal for the statements in Freebase. What you will find here are ordered pairs of Freebase mids and properties, each indicating that the given pair were going through a review process and likely have a higher quality on average. This is only for those pairs that are missing from Wikidata. The file includes about 1.4M pairs, and this can be used for importing part of the data directly (6MB gzipped, 52MB unzipped).
Now anyone can take the statements, analyse them, slice and dice them, upload them, use them for your own tools and games, etc. They remain available through the primary sources tool as well, which has already led to several thousand new statements in the last few weeks.
Additionally, Tpt and I created in the last few days of his internship a few visualizations of the current data in Wikidata and in Freebase.
First, the following is a visualization of the whole of Wikidata:
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png
The visualization needs a bit of explanation, I guess. The y-axis (up/down) represents time, the x-axis (left/right) represents space / geolocation. The further down, the closer you are to the present, the further up the more you go in the past. Time is given in a rational scale - the 20th century gets much more space than the 1st century. The x-axis represents longitude, with the prime meridian in the center of the image.
Every item is being put at its longitude (averaged, if several) and at its earliest point of time mentioned on the item. For items without either, neighbouring items propagate their value to them (averaging, if necessary). This is done repeatedly until the items are saturated.
In order to understand that a bit better, the following image offers a supporting grid: each line from left to right represents a century (up to the first century), and each line from top to bottom represent a meridian (with London in the middle of the graph).
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color....
The same visualizations has also been created for Freebase:
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color....
In order to compare the two graphs, we also overlaid them over each other. I will leave the interpretation to you, but you can easily see the strengths of weaknesses of both knowledge bases.
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freebas... https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikidat...
The programs for creating the visualizations are all available in the Github repository mentioned above (plenty of RAM is recommended to run it).
Enjoy the visualizations, the data and the script! Tpt and I are available to answer questions. I hope this will help with understanding and analysing some of the results of the work that we did this summer.
Cheers, Denny