Hi Marco,
On October 1, 2019 11:48:02 PM GMT+02:00, Marco Fossati fossati@spaziodati.eu wrote:
Hi Denny,
Thanks for publishing your Colab notebook! I went through it and would like to share my first thoughts here. We can then move further discussion somewhere else.
- in general, how can we compare datasets with totally different time
stamps? Wikidata is alive, Freebase is dead, and the latest DBpedia dump is old;
DBpedia made monthly releases for the past three months which will continue to improve and grow in an agile Manne, we focused on debugging and integration. Max age would be 30 days. I think that is OK. Denny validated against the live endpoint. This is OK to drive growth, but not reproducible scientifically compared to dumps.
- given that all datasets contain Wikipedia links, perhaps we could
use them as a bridge for the comparison, instead of Wikidata mappings. I'm assuming that Freebase and DBpedia entities with Wikidata mappings are subsets of the whole datasets (but this should be verified); 3. we could use record linkage techniques to connect Wikidata entities with Freebase and DBpedia ones, then assess the agreement in terms of statements per entity. There has been some experimental work (different
use case and goal) in the soweego project: https://soweego.readthedocs.io/en/latest/validator.html
On 10/1/19 1:13 AM, Denny Vrandečić wrote:
Marco, I totally agree with what you said - the project has stalled,
and
there is plenty of opportunity to harvest more data from Freebase and
bring it to Wikidata, and this should be reignited.
Yeah, that would be great. There is known work to do, but it's hard to sustain such a big project without allocated resources: https://phabricator.wikimedia.org/maniphest/query/CPiqkafGs5G./#R
BTW, there is also version 2 of the Wikidata primary sources tool that needs love, although I'm now skeptical that it will be an effective way
to achieve the Freebase harvesting. We should probably rethink the whole thing, and restart small with very
simple use cases, pretty much like the Harvest templates tool you mentioned: https://tools.wmflabs.org/pltools/harvesttemplates/
Cheers,
Marco
P.S.: I *might* have found the freshest relevant DBpedia datasets: https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects I said *might* because it was really painful to find a download button and to guess among multiple versions of the same dataset: https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-objects/2019.09... @Sebastian may know if it's the good one :-)