Hi all,

as we have written a previous email, we are currently applying for a MediaWiki grant to exploit the DBpedia Extraction Software to syncronize between infoboxes between Wikipedias as well as Wikipedia and Wikidata.

During the discussion on the talk page (https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE) the question was raised that we as DBpedia have too much of a bird's eye view on things and it is true, i.e. since we are used to bulk extracting and working with a lot of data instead of individual records.

The main problem here is that for us the prototype first of all shows that we have the data, which we could exploit in several ways, however for Wikipedians and Wikidata users the process of using it is the main focus. We assumed that for Wikipedians an article-centric view would be the best, i.e. you can directly compare one article's infobox with all other articles and wikidata. However, for Wikidata the article/entity-centric view does not seem practical and we would like to have feedback on this. The options for globalfactsync are:

  1. entity-centric view as it is now: same infobox across all wikipedias and wikidata for one article/entity
  2. template-centric (this one will not work, as there are no equivalent infoboxes across Wikipedias or only very few )
  3. template-parameter-centric: this is the current focus of Harvest Templates, i.e. one parameter in one template in one language https://tools.wmflabs.org/pltools/harvesttemplates/
  4. multilingual-template-parameter-centric or wikidata property centric, i.e. one parameter/one Wikidata P across multiple templates across multiple languages. This is supercharging harvesttemplates, but since it is a power tool for syncing, it gets more complex and overview is difficult.

All Feedback welcome, we also created a topic here: https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE#Focus_of_Tool_for_Wikidata

# Motivation, Wikidata adoption report

One goal of Wikidata is to support the infoboxes. We are doing monthly releases now at DBpedia and are able to provide statistics about Wikidata adoption or missing adoption in Wikipedia:

https://docs.google.com/spreadsheets/d/1_aNjgExJW_b0MvDSQs5iSXHYlwnZ8nU2zrQMxZ5edrQ/edit#gid=0

In total 584 million facts are still maintained in Wikipedia, not using Wikidata. In case they are already in Wikidata, this means that there are two or more places the same fact is maintained, multiplying maintenance work (unless the fact is static).

Code used to extract: https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala

Data: http://downloads.dbpedia.org/repo/dev/generic-spark/infobox-properties/2018.11.01/

Stat generation: echo -n "" > res.csv ;  for i in `ls *.bz2` ; do echo -n $i | sed 's/infobox-properties-2018.11.01_//;s/.ttl.bz2/\t/'  >> res.csv ; lbzip2 -dc $i | wc -l >> res.csv  ; done


--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) Competence Center
at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, http://linguistics.okfn.org, https://www.w3.org/community/ld4lt
Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org