Le 1 oct. 2015 à 21:10, Stéphane Corlosquet
<scorlosquet(a)gmail.com>
a écrit :
Hi Denny,
This is great work! who is Tpt?
Steph.
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić
<vrandecic(a)google.com> wrote:
Hi all,
as you know, Tpt has been working as an intern this summer at
Google. He finished his work a few weeks ago and I am happy to
announce today the publication of all scripts and the resulting data
he has been working on. Additionally, we publish a few novel
visualizations of the data in Wikidata and Freebase. We are still
working on the actual report summarizing the effort and providing
numbers on its effectiveness and progress. This will take another
few weeks.
First, thanks to Tpt for his amazing work! I have not expected to
see such rich results. He has exceeded my expectations by far, and
produced much more transferable data than I expected. Additionally,
he also was working on the primary sources tool directly and helped
Marco Fossati to upload a second, sports-related dataset (you can
select that by clicking on the gears icon next to the Freebase item
link in the sidebar on Wikidata, when you switch on the Primary
Sources tool).
The scripts that were created and used can be found here:
https://github.com/google/freebase-wikidata-converter
All scripts are released under the Apache license v2.
The following data files are also released. All data is released
under the CC0 license (in order to make this explicit, a comment has
been added to the start of each file, stating the copyright and the
license. If any script dealing with the files hiccups due to that
line, simply remove the first line).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-mis…
The actual missing statements, including URLs for sources, are in
this file. This was filtered against statements already existing in
Wikidata, and the statements are mapped to Wikidata IDs. This
contains about 14.3M statements (214MB gzipped, 831MB unzipped).
These are created using the mappings below in addition to the
mappings already in Wikidata. The quality of these statements is
rather mixed.
Additional datasets that we know meet a higher quality bar have been
previously released and uploaded directly to Wikidata by Tpt,
following community consultation.
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.…
Contains additional mappings between Freebase MIDs and Wikidata
QIDs, which are not available in Wikidata. These are mappings based
on statistical methods and single interwiki links. Unlike the first
set of mappings we had created and published previously (which
required multiple interwiki links at least), these mappings are
expected to have a lower quality - sufficient for a manual process,
but probably not sufficient for an automatic upload. This contains
about 3.4M mappings (30 MB gzipped, 64MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels…
This file includes labels and aliases for Wikidata items which seem
to be currently missing. The quality of these labels is
undetermined. The file contains about 860k labels in about 160
languages, with 33 languages having more than 10k labels each (14MB
gzipped, 32MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-m…
This is an interesting file as it includes a quality signal for the
statements in Freebase. What you will find here are ordered pairs of
Freebase mids and properties, each indicating that the given pair
were going through a review process and likely have a higher quality
on average. This is only for those pairs that are missing from
Wikidata. The file includes about 1.4M pairs, and this can be used
for importing part of the data directly (6MB gzipped, 52MB unzipped).
Now anyone can take the statements, analyse them, slice and dice
them, upload them, use them for your own tools and games, etc. They
remain available through the primary sources tool as well, which has
already led to several thousand new statements in the last few weeks.
Additionally, Tpt and I created in the last few days of his
internship a few visualizations of the current data in Wikidata and
in Freebase.
First, the following is a visualization of the whole of Wikidata:
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png
The visualization needs a bit of explanation, I guess. The y-axis
(up/down) represents time, the x-axis (left/right) represents space
/ geolocation. The further down, the closer you are to the present,
the further up the more you go in the past. Time is given in a
rational scale - the 20th century gets much more space than the 1st
century. The x-axis represents longitude, with the prime meridian in
the center of the image.
Every item is being put at its longitude (averaged, if several) and
at its earliest point of time mentioned on the item. For items
without either, neighbouring items propagate their value to them
(averaging, if necessary). This is done repeatedly until the items
are saturated.
In order to understand that a bit better, the following image offers
a supporting grid: each line from left to right represents a century
(up to the first century), and each line from top to bottom
represent a meridian (with London in the middle of the graph).
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color…
The same visualizations has also been created for Freebase:
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color…
In order to compare the two graphs, we also overlaid them over each
other. I will leave the interpretation to you, but you can easily
see the strengths of weaknesses of both knowledge bases.
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freeba…
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikida…
The programs for creating the visualizations are all available in
the Github repository mentioned above (plenty of RAM is recommended
to run it).
Enjoy the visualizations, the data and the script! Tpt and I are
available to answer questions. I hope this will help with
understanding and analysing some of the results of the work that we
did this summer.
Cheers,
Denny
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
--
Steph.
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org