Hi all,
as we have written a previous email, we are currently applying
for a MediaWiki grant to exploit the DBpedia Extraction Software
to syncronize between infoboxes between Wikipedias as well as
Wikipedia and Wikidata.
During the discussion on the talk page
(https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE)
the question was raised that we as DBpedia have too much of a
bird's eye view on things and it is true, i.e. since we are used
to bulk extracting and working with a lot of data instead of
individual records.
The main problem here is that for us the prototype first of all shows that we have the data, which we could exploit in several ways, however for Wikipedians and Wikidata users the process of using it is the main focus. We assumed that for Wikipedians an article-centric view would be the best, i.e. you can directly compare one article's infobox with all other articles and wikidata. However, for Wikidata the article/entity-centric view does not seem practical and we would like to have feedback on this. The options for globalfactsync are:
All Feedback welcome, we also created a topic here:
https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE#Focus_of_Tool_for_Wikidata
# Motivation, Wikidata adoption report
One goal of Wikidata is to support the infoboxes. We are doing monthly releases now at DBpedia and are able to provide statistics about Wikidata adoption or missing adoption in Wikipedia:
https://docs.google.com/spreadsheets/d/1_aNjgExJW_b0MvDSQs5iSXHYlwnZ8nU2zrQMxZ5edrQ/edit#gid=0
In total 584 million facts are still maintained in Wikipedia, not
using Wikidata. In case they are already in Wikidata, this means
that there are two or more places the same fact is maintained,
multiplying maintenance work (unless the fact is static).
Code used to extract: https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala
Data:
http://downloads.dbpedia.org/repo/dev/generic-spark/infobox-properties/2018.11.01/
Stat generation: echo -n "" > res.csv ; for i in `ls *.bz2` ;
do echo -n $i | sed
's/infobox-properties-2018.11.01_//;s/.ttl.bz2/\t/' >>
res.csv ; lbzip2 -dc $i | wc -l >> res.csv ; done