Hi all,
as you know, Tpt has been working as an intern this summer at Google. He finished his work a few weeks ago and I am happy to announce today the publication of all scripts and the resulting data he has been working on. Additionally, we publish a few novel visualizations of the data in Wikidata and Freebase. We are still working on the actual report summarizing the effort and providing numbers on its effectiveness and progress. This will take another few weeks.
First, thanks to Tpt for his amazing work! I have not expected to see such rich results. He has exceeded my expectations by far, and produced much more transferable data than I expected. Additionally, he also was working on the primary sources tool directly and helped Marco Fossati to upload a second, sports-related dataset (you can select that by clicking on the gears icon next to the Freebase item link in the sidebar on Wikidata, when you switch on the Primary Sources tool).
The scripts that were created and used can be found here:
https://github.com/google/freebase-wikidata-converter
All scripts are released under the Apache license v2.
The following data files are also released. All data is released under the CC0 license (in order to make this explicit, a comment has been added to the start of each file, stating the copyright and the license. If any script dealing with the files hiccups due to that line, simply remove the first line).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-miss... The actual missing statements, including URLs for sources, are in this file. This was filtered against statements already existing in Wikidata, and the statements are mapped to Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These are created using the mappings below in addition to the mappings already in Wikidata. The quality of these statements is rather mixed.
Additional datasets that we know meet a higher quality bar have been previously released and uploaded directly to Wikidata by Tpt, following community consultation.
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.p... Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not available in Wikidata. These are mappings based on statistical methods and single interwiki links. Unlike the first set of mappings we had created and published previously (which required multiple interwiki links at least), these mappings are expected to have a lower quality - sufficient for a manual process, but probably not sufficient for an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels.... This file includes labels and aliases for Wikidata items which seem to be currently missing. The quality of these labels is undetermined. The file contains about 860k labels in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped, 32MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-mi... This is an interesting file as it includes a quality signal for the statements in Freebase. What you will find here are ordered pairs of Freebase mids and properties, each indicating that the given pair were going through a review process and likely have a higher quality on average. This is only for those pairs that are missing from Wikidata. The file includes about 1.4M pairs, and this can be used for importing part of the data directly (6MB gzipped, 52MB unzipped).
Now anyone can take the statements, analyse them, slice and dice them, upload them, use them for your own tools and games, etc. They remain available through the primary sources tool as well, which has already led to several thousand new statements in the last few weeks.
Additionally, Tpt and I created in the last few days of his internship a few visualizations of the current data in Wikidata and in Freebase.
First, the following is a visualization of the whole of Wikidata:
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png
The visualization needs a bit of explanation, I guess. The y-axis (up/down) represents time, the x-axis (left/right) represents space / geolocation. The further down, the closer you are to the present, the further up the more you go in the past. Time is given in a rational scale - the 20th century gets much more space than the 1st century. The x-axis represents longitude, with the prime meridian in the center of the image.
Every item is being put at its longitude (averaged, if several) and at its earliest point of time mentioned on the item. For items without either, neighbouring items propagate their value to them (averaging, if necessary). This is done repeatedly until the items are saturated.
In order to understand that a bit better, the following image offers a supporting grid: each line from left to right represents a century (up to the first century), and each line from top to bottom represent a meridian (with London in the middle of the graph).
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color....
The same visualizations has also been created for Freebase:
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color....
In order to compare the two graphs, we also overlaid them over each other. I will leave the interpretation to you, but you can easily see the strengths of weaknesses of both knowledge bases.
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freebas... https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikidat...
The programs for creating the visualizations are all available in the Github repository mentioned above (plenty of RAM is recommended to run it).
Enjoy the visualizations, the data and the script! Tpt and I are available to answer questions. I hope this will help with understanding and analysing some of the results of the work that we did this summer.
Cheers, Denny
Hi Denny,
This is great work! who is Tpt?
Steph.
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić vrandecic@google.com wrote:
Hi all,
as you know, Tpt has been working as an intern this summer at Google. He finished his work a few weeks ago and I am happy to announce today the publication of all scripts and the resulting data he has been working on. Additionally, we publish a few novel visualizations of the data in Wikidata and Freebase. We are still working on the actual report summarizing the effort and providing numbers on its effectiveness and progress. This will take another few weeks.
First, thanks to Tpt for his amazing work! I have not expected to see such rich results. He has exceeded my expectations by far, and produced much more transferable data than I expected. Additionally, he also was working on the primary sources tool directly and helped Marco Fossati to upload a second, sports-related dataset (you can select that by clicking on the gears icon next to the Freebase item link in the sidebar on Wikidata, when you switch on the Primary Sources tool).
The scripts that were created and used can be found here:
https://github.com/google/freebase-wikidata-converter
All scripts are released under the Apache license v2.
The following data files are also released. All data is released under the CC0 license (in order to make this explicit, a comment has been added to the start of each file, stating the copyright and the license. If any script dealing with the files hiccups due to that line, simply remove the first line).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-miss... The actual missing statements, including URLs for sources, are in this file. This was filtered against statements already existing in Wikidata, and the statements are mapped to Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These are created using the mappings below in addition to the mappings already in Wikidata. The quality of these statements is rather mixed.
Additional datasets that we know meet a higher quality bar have been previously released and uploaded directly to Wikidata by Tpt, following community consultation.
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.p... Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not available in Wikidata. These are mappings based on statistical methods and single interwiki links. Unlike the first set of mappings we had created and published previously (which required multiple interwiki links at least), these mappings are expected to have a lower quality - sufficient for a manual process, but probably not sufficient for an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels.... This file includes labels and aliases for Wikidata items which seem to be currently missing. The quality of these labels is undetermined. The file contains about 860k labels in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped, 32MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-mi... This is an interesting file as it includes a quality signal for the statements in Freebase. What you will find here are ordered pairs of Freebase mids and properties, each indicating that the given pair were going through a review process and likely have a higher quality on average. This is only for those pairs that are missing from Wikidata. The file includes about 1.4M pairs, and this can be used for importing part of the data directly (6MB gzipped, 52MB unzipped).
Now anyone can take the statements, analyse them, slice and dice them, upload them, use them for your own tools and games, etc. They remain available through the primary sources tool as well, which has already led to several thousand new statements in the last few weeks.
Additionally, Tpt and I created in the last few days of his internship a few visualizations of the current data in Wikidata and in Freebase.
First, the following is a visualization of the whole of Wikidata:
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png
The visualization needs a bit of explanation, I guess. The y-axis (up/down) represents time, the x-axis (left/right) represents space / geolocation. The further down, the closer you are to the present, the further up the more you go in the past. Time is given in a rational scale - the 20th century gets much more space than the 1st century. The x-axis represents longitude, with the prime meridian in the center of the image.
Every item is being put at its longitude (averaged, if several) and at its earliest point of time mentioned on the item. For items without either, neighbouring items propagate their value to them (averaging, if necessary). This is done repeatedly until the items are saturated.
In order to understand that a bit better, the following image offers a supporting grid: each line from left to right represents a century (up to the first century), and each line from top to bottom represent a meridian (with London in the middle of the graph).
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color....
The same visualizations has also been created for Freebase:
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color....
In order to compare the two graphs, we also overlaid them over each other. I will leave the interpretation to you, but you can easily see the strengths of weaknesses of both knowledge bases.
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freebas...
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikidat...
The programs for creating the visualizations are all available in the Github repository mentioned above (plenty of RAM is recommended to run it).
Enjoy the visualizations, the data and the script! Tpt and I are available to answer questions. I hope this will help with understanding and analysing some of the results of the work that we did this summer.
Cheers, Denny
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
It's me.
https://www.wikidata.org/wiki/User:Tpt https://twitter.com/Tpt93
Cheers,
Thomas
Le 1 oct. 2015 à 21:10, Stéphane Corlosquet scorlosquet@gmail.com a écrit :
Hi Denny,
This is great work! who is Tpt?
Steph.
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić vrandecic@google.com wrote: Hi all,
as you know, Tpt has been working as an intern this summer at Google. He finished his work a few weeks ago and I am happy to announce today the publication of all scripts and the resulting data he has been working on. Additionally, we publish a few novel visualizations of the data in Wikidata and Freebase. We are still working on the actual report summarizing the effort and providing numbers on its effectiveness and progress. This will take another few weeks.
First, thanks to Tpt for his amazing work! I have not expected to see such rich results. He has exceeded my expectations by far, and produced much more transferable data than I expected. Additionally, he also was working on the primary sources tool directly and helped Marco Fossati to upload a second, sports-related dataset (you can select that by clicking on the gears icon next to the Freebase item link in the sidebar on Wikidata, when you switch on the Primary Sources tool).
The scripts that were created and used can be found here:
https://github.com/google/freebase-wikidata-converter
All scripts are released under the Apache license v2.
The following data files are also released. All data is released under the CC0 license (in order to make this explicit, a comment has been added to the start of each file, stating the copyright and the license. If any script dealing with the files hiccups due to that line, simply remove the first line).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-miss... The actual missing statements, including URLs for sources, are in this file. This was filtered against statements already existing in Wikidata, and the statements are mapped to Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These are created using the mappings below in addition to the mappings already in Wikidata. The quality of these statements is rather mixed.
Additional datasets that we know meet a higher quality bar have been previously released and uploaded directly to Wikidata by Tpt, following community consultation.
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.p... Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not available in Wikidata. These are mappings based on statistical methods and single interwiki links. Unlike the first set of mappings we had created and published previously (which required multiple interwiki links at least), these mappings are expected to have a lower quality - sufficient for a manual process, but probably not sufficient for an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels.... This file includes labels and aliases for Wikidata items which seem to be currently missing. The quality of these labels is undetermined. The file contains about 860k labels in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped, 32MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-mi... This is an interesting file as it includes a quality signal for the statements in Freebase. What you will find here are ordered pairs of Freebase mids and properties, each indicating that the given pair were going through a review process and likely have a higher quality on average. This is only for those pairs that are missing from Wikidata. The file includes about 1.4M pairs, and this can be used for importing part of the data directly (6MB gzipped, 52MB unzipped).
Now anyone can take the statements, analyse them, slice and dice them, upload them, use them for your own tools and games, etc. They remain available through the primary sources tool as well, which has already led to several thousand new statements in the last few weeks.
Additionally, Tpt and I created in the last few days of his internship a few visualizations of the current data in Wikidata and in Freebase.
First, the following is a visualization of the whole of Wikidata:
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png
The visualization needs a bit of explanation, I guess. The y-axis (up/down) represents time, the x-axis (left/right) represents space / geolocation. The further down, the closer you are to the present, the further up the more you go in the past. Time is given in a rational scale - the 20th century gets much more space than the 1st century. The x-axis represents longitude, with the prime meridian in the center of the image.
Every item is being put at its longitude (averaged, if several) and at its earliest point of time mentioned on the item. For items without either, neighbouring items propagate their value to them (averaging, if necessary). This is done repeatedly until the items are saturated.
In order to understand that a bit better, the following image offers a supporting grid: each line from left to right represents a century (up to the first century), and each line from top to bottom represent a meridian (with London in the middle of the graph).
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color....
The same visualizations has also been created for Freebase:
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color....
In order to compare the two graphs, we also overlaid them over each other. I will leave the interpretation to you, but you can easily see the strengths of weaknesses of both knowledge bases.
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freebas... https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikidat...
The programs for creating the visualizations are all available in the Github repository mentioned above (plenty of RAM is recommended to run it).
Enjoy the visualizations, the data and the script! Tpt and I are available to answer questions. I hope this will help with understanding and analysing some of the results of the work that we did this summer.
Cheers, Denny
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Steph. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Out of interest, is there still a live Freebase SPARQL endpoint ?
And is it kept up to date with which items have been matched to Wikidata ?
Both of these would be useful, I think.
-- James.
On 01/10/2015 20:25, Thomas Tanon wrote:
It's me.
https://www.wikidata.org/wiki/User:Tpt https://twitter.com/Tpt93
Cheers,
Thomas
Le 1 oct. 2015 à 21:10, Stéphane Corlosquet scorlosquet@gmail.com a écrit :
Hi Denny,
This is great work! who is Tpt?
Steph.
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić vrandecic@google.com wrote: Hi all,
as you know, Tpt has been working as an intern this summer at Google. He finished his work a few weeks ago and I am happy to announce today the publication of all scripts and the resulting data he has been working on. Additionally, we publish a few novel visualizations of the data in Wikidata and Freebase. We are still working on the actual report summarizing the effort and providing numbers on its effectiveness and progress. This will take another few weeks.
First, thanks to Tpt for his amazing work! I have not expected to see such rich results. He has exceeded my expectations by far, and produced much more transferable data than I expected. Additionally, he also was working on the primary sources tool directly and helped Marco Fossati to upload a second, sports-related dataset (you can select that by clicking on the gears icon next to the Freebase item link in the sidebar on Wikidata, when you switch on the Primary Sources tool).
The scripts that were created and used can be found here:
https://github.com/google/freebase-wikidata-converter
All scripts are released under the Apache license v2.
The following data files are also released. All data is released under the CC0 license (in order to make this explicit, a comment has been added to the start of each file, stating the copyright and the license. If any script dealing with the files hiccups due to that line, simply remove the first line).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-miss... The actual missing statements, including URLs for sources, are in this file. This was filtered against statements already existing in Wikidata, and the statements are mapped to Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These are created using the mappings below in addition to the mappings already in Wikidata. The quality of these statements is rather mixed.
Additional datasets that we know meet a higher quality bar have been previously released and uploaded directly to Wikidata by Tpt, following community consultation.
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.p... Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not available in Wikidata. These are mappings based on statistical methods and single interwiki links. Unlike the first set of mappings we had created and published previously (which required multiple interwiki links at least), these mappings are expected to have a lower quality - sufficient for a manual process, but probably not sufficient for an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels.... This file includes labels and aliases for Wikidata items which seem to be currently missing. The quality of these labels is undetermined. The file contains about 860k labels in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped, 32MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-mi... This is an interesting file as it includes a quality signal for the statements in Freebase. What you will find here are ordered pairs of Freebase mids and properties, each indicating that the given pair were going through a review process and likely have a higher quality on average. This is only for those pairs that are missing from Wikidata. The file includes about 1.4M pairs, and this can be used for importing part of the data directly (6MB gzipped, 52MB unzipped).
Now anyone can take the statements, analyse them, slice and dice them, upload them, use them for your own tools and games, etc. They remain available through the primary sources tool as well, which has already led to several thousand new statements in the last few weeks.
Additionally, Tpt and I created in the last few days of his internship a few visualizations of the current data in Wikidata and in Freebase.
First, the following is a visualization of the whole of Wikidata:
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png
The visualization needs a bit of explanation, I guess. The y-axis (up/down) represents time, the x-axis (left/right) represents space / geolocation. The further down, the closer you are to the present, the further up the more you go in the past. Time is given in a rational scale - the 20th century gets much more space than the 1st century. The x-axis represents longitude, with the prime meridian in the center of the image.
Every item is being put at its longitude (averaged, if several) and at its earliest point of time mentioned on the item. For items without either, neighbouring items propagate their value to them (averaging, if necessary). This is done repeatedly until the items are saturated.
In order to understand that a bit better, the following image offers a supporting grid: each line from left to right represents a century (up to the first century), and each line from top to bottom represent a meridian (with London in the middle of the graph).
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color....
The same visualizations has also been created for Freebase:
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color....
In order to compare the two graphs, we also overlaid them over each other. I will leave the interpretation to you, but you can easily see the strengths of weaknesses of both knowledge bases.
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freebas... https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikidat...
The programs for creating the visualizations are all available in the Github repository mentioned above (plenty of RAM is recommended to run it).
Enjoy the visualizations, the data and the script! Tpt and I are available to answer questions. I hope this will help with understanding and analysing some of the results of the work that we did this summer.
Cheers, Denny
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Steph. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On 10/1/15 5:51 PM, James Heald wrote:
Out of interest, is there still a live Freebase SPARQL endpoint ?
And is it kept up to date with which items have been matched to Wikidata ?
Both of these would be useful, I think.
-- James.
http://lod.openlinksw.com/sparql does contain data from Freebase and Wikidata.
Kingsley
On 01/10/2015 20:25, Thomas Tanon wrote:
It's me.
https://www.wikidata.org/wiki/User:Tpt https://twitter.com/Tpt93
Cheers,
Thomas
Le 1 oct. 2015 à 21:10, Stéphane Corlosquet scorlosquet@gmail.com a écrit :
Hi Denny,
This is great work! who is Tpt?
Steph.
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić vrandecic@google.com wrote: Hi all,
as you know, Tpt has been working as an intern this summer at Google. He finished his work a few weeks ago and I am happy to announce today the publication of all scripts and the resulting data he has been working on. Additionally, we publish a few novel visualizations of the data in Wikidata and Freebase. We are still working on the actual report summarizing the effort and providing numbers on its effectiveness and progress. This will take another few weeks.
First, thanks to Tpt for his amazing work! I have not expected to see such rich results. He has exceeded my expectations by far, and produced much more transferable data than I expected. Additionally, he also was working on the primary sources tool directly and helped Marco Fossati to upload a second, sports-related dataset (you can select that by clicking on the gears icon next to the Freebase item link in the sidebar on Wikidata, when you switch on the Primary Sources tool).
The scripts that were created and used can be found here:
https://github.com/google/freebase-wikidata-converter
All scripts are released under the Apache license v2.
The following data files are also released. All data is released under the CC0 license (in order to make this explicit, a comment has been added to the start of each file, stating the copyright and the license. If any script dealing with the files hiccups due to that line, simply remove the first line).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-miss...
The actual missing statements, including URLs for sources, are in this file. This was filtered against statements already existing in Wikidata, and the statements are mapped to Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These are created using the mappings below in addition to the mappings already in Wikidata. The quality of these statements is rather mixed.
Additional datasets that we know meet a higher quality bar have been previously released and uploaded directly to Wikidata by Tpt, following community consultation.
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.p...
Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not available in Wikidata. These are mappings based on statistical methods and single interwiki links. Unlike the first set of mappings we had created and published previously (which required multiple interwiki links at least), these mappings are expected to have a lower quality - sufficient for a manual process, but probably not sufficient for an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels....
This file includes labels and aliases for Wikidata items which seem to be currently missing. The quality of these labels is undetermined. The file contains about 860k labels in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped, 32MB unzipped).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-mi...
This is an interesting file as it includes a quality signal for the statements in Freebase. What you will find here are ordered pairs of Freebase mids and properties, each indicating that the given pair were going through a review process and likely have a higher quality on average. This is only for those pairs that are missing from Wikidata. The file includes about 1.4M pairs, and this can be used for importing part of the data directly (6MB gzipped, 52MB unzipped).
Now anyone can take the statements, analyse them, slice and dice them, upload them, use them for your own tools and games, etc. They remain available through the primary sources tool as well, which has already led to several thousand new statements in the last few weeks.
Additionally, Tpt and I created in the last few days of his internship a few visualizations of the current data in Wikidata and in Freebase.
First, the following is a visualization of the whole of Wikidata:
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-color.png
The visualization needs a bit of explanation, I guess. The y-axis (up/down) represents time, the x-axis (left/right) represents space / geolocation. The further down, the closer you are to the present, the further up the more you go in the past. Time is given in a rational scale - the 20th century gets much more space than the 1st century. The x-axis represents longitude, with the prime meridian in the center of the image.
Every item is being put at its longitude (averaged, if several) and at its earliest point of time mentioned on the item. For items without either, neighbouring items propagate their value to them (averaging, if necessary). This is done repeatedly until the items are saturated.
In order to understand that a bit better, the following image offers a supporting grid: each line from left to right represents a century (up to the first century), and each line from top to bottom represent a meridian (with London in the middle of the graph).
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-grid-color....
The same visualizations has also been created for Freebase:
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-color.png
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-grid-color....
In order to compare the two graphs, we also overlaid them over each other. I will leave the interpretation to you, but you can easily see the strengths of weaknesses of both knowledge bases.
https://tools.wmflabs.org/wikidata-primary-sources/data/wikidata-red-freebas...
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-red-wikidat...
The programs for creating the visualizations are all available in the Github repository mentioned above (plenty of RAM is recommended to run it).
Enjoy the visualizations, the data and the script! Tpt and I are available to answer questions. I hope this will help with understanding and analysing some of the results of the work that we did this summer.
Cheers, Denny
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
-- Steph. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
On Thu, Oct 1, 2015 at 8:09 PM, Denny Vrandečić vrandecic@google.com wrote:
Hi all,
as you know, Tpt has been working as an intern this summer at Google. He finished his work a few weeks ago and I am happy to announce today the publication of all scripts and the resulting data he has been working on. Additionally, we publish a few novel visualizations of the data in Wikidata and Freebase. We are still working on the actual report summarizing the effort and providing numbers on its effectiveness and progress. This will take another few weeks.
First, thanks to Tpt for his amazing work! I have not expected to see such rich results. He has exceeded my expectations by far, and produced much more transferable data than I expected. Additionally, he also was working on the primary sources tool directly and helped Marco Fossati to upload a second, sports-related dataset (you can select that by clicking on the gears icon next to the Freebase item link in the sidebar on Wikidata, when you switch on the Primary Sources tool).
Thanks a lot for all this. It's helping make Wikidata even better. Looking forward to the report.
Cheers Lydia
Denny/Thomas - Thanks for publishing these artefacts. I'll look forward to the report with the metrics. Are there plans for next steps or is this the end of the project as far as the two of you go?
Comments on individual items inline below:
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić vrandecic@google.com wrote:
The scripts that were created and used can be found here:
Oh no! Not PHP!! :-) One thing that concerns me is that the scripts seem to work on the Freebase RDF dump which is derivative artefact subject to a lossy transform. I assumed that one of the reasons for having this work hosted at Google was that it would allow direct access to the Freebase graphd quads. Is that not what happened? There's a bunch of provenance information which is very valuable for quality analysis in the graphd graph which gets lost during the RDF transformation.
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-miss...
The actual missing statements, including URLs for sources, are in this file. This was filtered against statements already existing in Wikidata, and the statements are mapped to Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These are created using the mappings below in addition to the mappings already in Wikidata. The quality of these statements is rather mixed.
From my brief exposure and the comments of others, the quality seems highly
problematic, but the issue seems to mainly be with the URLs proposed, which are of unknown provenance. Presumably in whatever Google database these were derived from, they were tagged with the tool/pipeline that produced them and some type of probability of relevance. Including this information in the data set would help pick the most relevant URLs to present and also help identify low-quality sources as voting feedback is collected. Also, filtering the URLs for known unacceptable citations (485K IMDB references, BBC Music entries which consist solely of EN Wikipedia snippets, etc) would cut down on a lot of the noise.
Some quick stats in addition to the 14.3M statements: 2.3M entities, 183 properties, 284K different web sites.
Additional datasets that we know meet a higher quality bar have been
previously released and uploaded directly to Wikidata by Tpt, following community consultation.
Is there a pointer to these?
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.p...
Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not available in Wikidata. These are mappings based on statistical methods and single interwiki links. Unlike the first set of mappings we had created and published previously (which required multiple interwiki links at least), these mappings are expected to have a lower quality - sufficient for a manual process, but probably not sufficient for an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped).
I was really excited when I saw this because the first step in the Freebase migration project should be to increase the number of topic mappings between the two databases and 3.4M would almost double the number of existing mappings. Then I looked at the first 10K Q numbers and found of the 7,500 "new" mappings, almost 6,700 were already in Wikidata.
Fortunately when I took a bigger sample, things improved. For a 4% sample, it looks just under 30% are already in Wikidata, so if the quality of the remainder is good, that would yield an additional 2.4M mappings, which is great! Interestingly there were also a smattering of Wikidata 404s (25), redirects (71), and values which conflicted with Wikidata (530), a cursory analysis of the latter showed that they were mostly the result of merges on the Freebase end (so the entity now has two MIDs).
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels....
This file includes labels and aliases for Wikidata items which seem to be currently missing. The quality of these labels is undetermined. The file contains about 860k labels in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped, 32MB unzipped).
Their provenance is available in the Freebase graph. The most likely source is other language Wikipedias, but this could be easily confirmed.
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-mi... This is an interesting file as it includes a quality signal for the statements in Freebase. What you will find here are ordered pairs of Freebase mids and properties, each indicating that the given pair were going through a review process and likely have a higher quality on average. This is only for those pairs that are missing from Wikidata. The file includes about 1.4M pairs, and this can be used for importing part of the data directly (6MB gzipped, 52MB unzipped).
This appears to be a dump of the instances for the property /freebase/valuenotation/is_reviewed but it's not usable as is because of the intended semantics of the property. The property indicates that *when the triple was written* the reviewer asserts that the current value of the name property is correct. This means that you need to use the creation date of the triple to extract the right property value from the graph for the named property (and because there's no write protection, probably only reviewers who are members of groups like "Staff IC" or "Staff OD" should be counted).
Additionally, Tpt and I created in the last few days of his internship a
few visualizations of the current data in Wikidata and in Freebase.
What are the visualizations designed to show? What, if any, insights did you derive from them?
Thanks again for the work and the interesting data sets. I'll look forward to the full report.
Tom
Additionally, Tpt and I created in the last few days of his internship a
few visualizations of the current data in Wikidata and in Freebase.
What are the visualizations designed to show? What, if any, insights did you derive from them?
Thanks again for the work and the interesting data sets. I'll look forward to the full report.
Tom
To my eyes, it shows that the Asia continent is still generally void of any useful machine-readable Knowledge, in either Freebase or Wikidata. (or anywhere else) But this is already a known state of affairs and probably will not improve until 1 Million USA students learn Mandarin. :)
Seriously however...it is a sad affair that non-English speaking areas of our world have low availability of machine-readable Knowledge.
Thad +ThadGuidry https://www.google.com/+ThadGuidry
Thad Guidry, 02/10/2015 21:44:
To my eyes, it shows that the Asia continent is still generally void of any useful machine-readable Knowledge, in either Freebase or Wikidata. (or anywhere else) But this is already a known state of affairs and probably will not improve until 1 Million USA students learn Mandarin. :)
It also shows that Wikidata and Freebase have different opinions on what's the centre of Europe (or maybe one of the two has tons of statements on Cape Town! too lazy to manually calculate labels on the axes).
Nemo
On Fri, Oct 2, 2015 at 11:59 AM, Tom Morris tfmorris@gmail.com wrote:
Denny/Thomas - Thanks for publishing these artefacts. I'll look forward to the report with the metrics.
This is now, finally, available: http://static.googleusercontent.com/media/research.google.com/en//pubs/archi...
Are there plans for next steps or is this the end of the project as far as the two of you go?
I'm going to assume that the lack of answer to this question over the last four months, the lack of updates on the project, and the fact no one is even bothering to respond to issues https://github.com/google/primarysources/issues means that this project is dead and abandoned. That's pretty sad. For an internship, it sounds like a cool project and a decent result. As an actual serious attempt to make productive use of the Freebase data, it's a weak, half-hearted effort by Google.
Is there any interest in the Wikidata community for making use of the Freebase data now that Google has abandoned their effort, or is there too much negative sentiment against it to make it worth the effort?
Tom
p.s. I'm surprised that none of the stuff mentioned below is addressed in the paper. Was it already submitted by the beginning of October?
Comments on individual items inline below:
On Thu, Oct 1, 2015 at 2:09 PM, Denny Vrandečić vrandecic@google.com wrote:
The scripts that were created and used can be found here:
Oh no! Not PHP!! :-) One thing that concerns me is that the scripts seem to work on the Freebase RDF dump which is derivative artefact subject to a lossy transform. I assumed that one of the reasons for having this work hosted at Google was that it would allow direct access to the Freebase graphd quads. Is that not what happened? There's a bunch of provenance information which is very valuable for quality analysis in the graphd graph which gets lost during the RDF transformation.
This isn't addressed in the paper and represents a significant loss of provenance information.
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-mapped-miss... The actual missing statements, including URLs for sources, are in this file. This was filtered against statements already existing in Wikidata, and the statements are mapped to Wikidata IDs. This contains about 14.3M statements (214MB gzipped, 831MB unzipped). These are created using the mappings below in addition to the mappings already in Wikidata. The quality of these statements is rather mixed.
From my brief exposure and the comments of others, the quality seems highly problematic, but the issue seems to mainly be with the URLs proposed, which are of unknown provenance. Presumably in whatever Google database these were derived from, they were tagged with the tool/pipeline that produced them and some type of probability of relevance. Including this information in the data set would help pick the most relevant URLs to present and also help identify low-quality sources as voting feedback is collected. Also, filtering the URLs for known unacceptable citations (485K IMDB references, BBC Music entries which consist solely of EN Wikipedia snippets, etc) would cut down on a lot of the noise.
Some quick stats in addition to the 14.3M statements: 2.3M entities, 183 properties, 284K different web sites.
Additional datasets that we know meet a higher quality bar have been
previously released and uploaded directly to Wikidata by Tpt, following community consultation.
Is there a pointer to these?
https://tools.wmflabs.org/wikidata-primary-sources/data/additional-mapping.p... Contains additional mappings between Freebase MIDs and Wikidata QIDs, which are not available in Wikidata. These are mappings based on statistical methods and single interwiki links. Unlike the first set of mappings we had created and published previously (which required multiple interwiki links at least), these mappings are expected to have a lower quality - sufficient for a manual process, but probably not sufficient for an automatic upload. This contains about 3.4M mappings (30 MB gzipped, 64MB unzipped).
I was really excited when I saw this because the first step in the Freebase migration project should be to increase the number of topic mappings between the two databases and 3.4M would almost double the number of existing mappings. Then I looked at the first 10K Q numbers and found of the 7,500 "new" mappings, almost 6,700 were already in Wikidata.
Fortunately when I took a bigger sample, things improved. For a 4% sample, it looks just under 30% are already in Wikidata, so if the quality of the remainder is good, that would yield an additional 2.4M mappings, which is great! Interestingly there were also a smattering of Wikidata 404s (25), redirects (71), and values which conflicted with Wikidata (530), a cursory analysis of the latter showed that they were mostly the result of merges on the Freebase end (so the entity now has two MIDs).
It's not clear to me if these additional mappings are being used in the Primary Sources tool (or anywhere else). Are they?
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-new-labels.... This file includes labels and aliases for Wikidata items which seem to be currently missing. The quality of these labels is undetermined. The file contains about 860k labels in about 160 languages, with 33 languages having more than 10k labels each (14MB gzipped, 32MB unzipped).
Their provenance is available in the Freebase graph. The most likely source is other language Wikipedias, but this could be easily confirmed.
https://tools.wmflabs.org/wikidata-primary-sources/data/freebase-reviewed-mi... This is an interesting file as it includes a quality signal for the statements in Freebase. What you will find here are ordered pairs of Freebase mids and properties, each indicating that the given pair were going through a review process and likely have a higher quality on average. This is only for those pairs that are missing from Wikidata. The file includes about 1.4M pairs, and this can be used for importing part of the data directly (6MB gzipped, 52MB unzipped).
This appears to be a dump of the instances for the property /freebase/valuenotation/is_reviewed but it's not usable as is because of the intended semantics of the property. The property indicates that *when the triple was written* the reviewer asserts that the current value of the name property is correct. This means that you need to use the creation date of the triple to extract the right property value from the graph for the named property (and because there's no write protection, probably only reviewers who are members of groups like "Staff IC" or "Staff OD" should be counted).
Hi Tom, all,
This is now, finally, available: http://static.googleusercontent.com/media/research.google.com/en//pubs/archi...
Yes.
I'm going to assume that the lack of answer to this question over the last four months, the lack of updates on the project, and the fact no one is even bothering to respond to issues means that this project is dead and abandoned. That's pretty sad. For an internship, it sounds like a cool project and a decent result. As an actual serious attempt to make productive use of the Freebase data, it's a weak, half-hearted effort by Google.
You have a very fair point and I apologize for our silence. Thomas P.-T. was working on this full-time during his internship, all other project members based on Google's famous-infamous (1)20% time agreement. Not as an excuse, but as an explanation. I have started triaging, assigning, and working on issues yesterday, and plan to do more work in the coming days.
Is there any interest in the Wikidata community for making use of the Freebase data now that Google has abandoned their effort, or is there too much negative sentiment against it to make it worth the effort?
Please see Marco Fossati's email and my explanations from above.
p.s. I'm surprised that none of the stuff mentioned below is addressed in the paper. Was it already submitted by the beginning of October?
This is the fact indeed, it was initially submitted to the WWW Research Track (http://www2016.ca/calls-for-papers/call-for-research-papers.html) and then re-routed to the Industry Track.
Cheers, Tom