[GlobalFactSync] User Script, Data Browser, Reference web service - WMF Grant project - Wikitech-l

15 Aug 2019


      Dear all,
we would like to share consolidated updates for the GlobalFactSync (GFS) 
project with you (copied from 
https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE/News)
We polished everything for our presentation at Wikimania tomorrow: 
https://wikimania.wikimedia.org/wiki/2019:Technology_outreach_%26_innovation...
All feedback welcome!
-- Sebastian (with the team: Tina, Włodzimierz,  Krzysztof, Johannes and 
Marvin)
User Script, Data Browser, Reference web service (15. August 2019)
After the Kick-Off note end of July 
https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE/News#Kick-off_note_(25._Juli_2019), 
which described our first edit and the concept better, we shaped the 
technical microservices and data into more concise tools that are easier 
to use and demo during our Wikimania presentation 
https://wikimania.wikimedia.org/wiki/2019:Technology_outreach_%26_innovation/GlobalFactSync:
1. User Script https://en.wikipedia.org/wiki/User_scripts available
    at User:JohannesFre/global.js
    https://meta.wikimedia.org/wiki/User:JohannesFre/global.js shows
    links from each article and Wikidata to the Data Browser and
    Reference Web Service
    https://meta.wikimedia.org/wiki/User:JohannesFre/global.js
1.
    User Script Linking to the GFS Data Browser
 2. GFS Data Browser https://global.dbpedia.org/ Github
    https://github.com/dbpedia/gfs now accepts any URI in subject from
    Wikipedia, DBpedia or Wikidata, see the Boys Don't Cry example from
    Kick-Off Note
    https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F2nrbo&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FreleaseDate&src=general,
    Berlin/Geo-coords lat
    https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F4pafr&p=http%3A%2F%2Fwww.w3.org%2F2003%2F01%2Fgeo%2Fwgs84_pos%23lat&src=general
    long
    https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F4pafr&p=http%3A%2F%2Fwww.w3.org%2F2003%2F01%2Fgeo%2Fwgs84_pos%23long&src=general,
    Albert Einstein's Religion
    https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F55LmB&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Freligion&src=general.
    *Not Live yet, edits/fixes are not reflected*
 3. Reference Web Service (Albert Einstein:
    http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=htt...)
    extracts (1) all references from a Wikipedia page, (2) matched to
    the infobox parameter and (3) also extracts the fact from it. The
    service will remain stable, so you can use it.
Furthermore, we are designing a friendly fork of HarvestTemplates 
https://github.com/Pascalco/harvesttemplates to effectively import all 
that data into Wikidata.
Kick-off note (25. Juli 2019)
*GlobalFactSync - Synchronizing Wikidata and Wikipedia's infoboxes*
How is data edited in Wikipedia/Wikidata? Where does it come from? And 
how can we synchronize it globally?
The GlobalFactSync 
https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE 
(GFS) Project — funded by the Wikimedia Foundation — started in June 
2019 and has two goals:
* Answer the above-mentioned three questions.
  * Build an information system to synchronize facts between all
    Wikipedia language-editions and Wikidata.
Now we are seven weeks into the project (10+ more months to go) and we 
are releasing our first prototypes to gather feedback.
/How – Synchronization vs Consensus/
We follow an absolute *Human(s)-in-the-loop* approach when we talk about 
synchronization. The final decision whether to synchronize a value or 
not should rest with a human editor who understands consensus and the 
implications. There will be no automatic imports. Our focus is to 
drastically reduce the time to research all references for individual 
facts.
A trivial example is the release date of the single “Boys Don’t Cry” 
(March 16th, 1989) in the English 
https://en.wikipedia.org/wiki/Boys_Don%27t_Cry_(Moulin_Rouge_song), 
Japanese 
https://ja.wikipedia.org/wiki/%E6%B6%99%E3%82%92%E3%81%BF%E3%81%9B%E3%81%AA%E3%81%84%E3%81%A7_%E3%80%9CBoys_Don't_Cry%E3%80%9C, 
and French 
https://fr.wikipedia.org/wiki/Namida_wo_Misenaide_(Boys_Don%27t_Cry) 
Wikipedia, Wikidata https://www.wikidata.org/wiki/Q3020026#P577 and 
finally in the external open database MusicBrainz 
https://musicbrainz.org/artist/e57182dc-2693-46fc-a739-a81c734a4326. A 
human editor might need 15-30 minutes finding and opening all different 
sources, while our current prototype can spot differences and display 
them in 5 seconds.
We already had our first successful edit where a Wikipedia editor fixed 
the discrepancy with our prototype: “I’ve updated Wikidata so that all 
five sources are in agreement.” We are now working on the following tasks:
* Scaling the system to all infoboxes, Wikidata and selected external
    databases (see below on the difficulties there)
  * Making the system:
      o “live” without stale information
      o “reliable” with less technical errors when extracting and
        indexing data
      o “better referenced” by not only synchronizing facts but also
        references
/Contributions and Feedback/
To ensure that GlobalFactSync will serve and help the Wikiverse we 
encourage everyone to try our data and micro-services and leave us some 
feedback, either on our Meta-Wiki page 
https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE 
or via gfs@infai.org mailto:gfs@infai.org. In the following 10+ 
months, we intend to improve and build upon these initial results. At 
the same time, these microservices are available to every developer to 
exploit it and hack useful applications. The most promising 
contributions will be rewarded and receive the book “Engineering Agile 
Big-Data Systems”. Please post feedback or any tool or GUI here. In case 
you need changes to be made to the API, please let us know, too. For the 
ambitious future developers among you, we have some budget left that we 
will dedicate to an internship. In order to apply, just mention it in 
your feedback post.
Finally, to talk to us and other GlobalfactSync-Users you may want to 
visit WikidataCon and Wikimania, where we will present the latest 
developments and the progress of our project.
/Data, APIs & Microservices (Technical prototypes)/
Data Processing and Infobox Extraction:
For GlobalFactSync we use data from Wikipedia infoboxes of different 
languages, as well as Wikidata, and DBpedia and fuse them to receive one 
big, consolidated dataset – a PreFusion dataset 
https://databus.dbpedia.org/dbpedia/prefusion (in JSON-LD). More 
information on the fusion process, which is the engine behind GFS, can 
be found in the FlexiFusion paper 
https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf. One of 
our next steps is to integrate MusicBrainz into this process as an 
external dataset. We hope to implement even more such external datasets 
to increase the amount of available information and references.
*First microservices:*
We deployed a set of microservices to show the current state of our 
toolchain.
* [Initial User Interface] The GFS Data Browser is our GlobalFactSync
    UI prototype (available at http://global.dbpedia.org) which shows
    all extracted information available for one entity for different
    sources. It can be used to analyze the factual consensus between
    different Wikipedia articles for the same thing. Example: Look at
    the variety of population counts for Grimma
    https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F9QwA&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FpopulationTotal&src=general.
* [PreFusion JSON API] While the UI allows simple, fast and easy
    browsing for one entity at a time, we also provide raw access to the
    underlying data (PreFusion dump). The query UI
    (http://global.dbpedia.org:8990 (user: read, pw: gfs) can be
    utilized to run simple analytical queries. Thus, we can determine
    the number of locations having at least one population value
    http://global.dbpedia.org:8990/db/prefusion/provenance?query=%7B%0D%0A++++%22predicate.%40id%22%3A+%22http%3A%2F%2Fdbpedia.org%2Fontology%2FpopulationTotal%22%2C%0D%0A%7D&projection=%7B%0D%0A++%22subject.%40id%22+%3A+1%0D%0A++%22objects.object.%40value%22%3A+1%0D%0A%7D
    (1,194,007) but can also focus on examples with data quality
    problems (e.g. one of the 4,268 locations with more than 10
    population values
    http://global.dbpedia.org:8990/db/prefusion/provenance?query=%7B%0D%0A++++%22predicate.%40id%22%3A+%22http%3A%2F%2Fdbpedia.org%2Fontology%2FpopulationTotal%22%2C%0D%0A++++%24where%3A+%22this.objects.length+%3E++10%22%0D%0A%7D&projection=%7B%0D%0A++%22subject.%40id%22+%3A+1%0D%0A++%22objects.object.%40value%22%3A+1%0D%0A%7D).
    Moreover, documentation about the PreFusion dataset and the download
    link for the data are available on the Databus website
    https://databus.dbpedia.org/dbpedia/prefusion.
* [Reference Data Download] We ran the Reference Extraction Service
    over 10 Wikipedia languages. Download dumps here
    http://dbpedia.informatik.uni-leipzig.de/repo/lewoniewski/gfs/infobox-refs/2019.07.01/.
* [Reference Extraction Service] Good references are crucial for an
    import of facts from Wikipedia to Wikidata. We are currently working
    with colleagues from Poznań University of Economics and Business on
    reference extraction for facts from Wikipedia. A current development
    reference extraction microservice
    http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Facebook&format=json
    shows all references and the location where they were spotted in the
    Infobox – ad hoc – for a given article:
    http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=htt...
    ( ‘&format=tsv’ also available)
* [Infobox Extraction Service] A similar ad hoc extraction of factual
    information from infoboxes and other Wikipedia article information
    is available here. This microservice displays information which can
    be extracted with the help of DBpedia mappings from an infobox e.g.
    from the German Facebook Wikipedia article:
    http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/en/extract?t....
    See here for more options:
    http://dbpedia.informatik.uni-leipzig.de:9999/server/extraction/.
* [ID service] Last but not least, we offer the Global ID Resolution
    Service
    https://global.dbpedia.org/same-thing/lookup/?uri=http://dbpedia.org/resource/Facebook.
    It ties together all available identifiers for one thing (i.e. at
    the moment all DBpedia/Wikipedia and Wikidata identifiers –
    MusicBrainz coming soon…) and shows their stable DBpedia Global ID.
/Finding sync targets/
In order to test out our algorithms, we started by looking at various 
groups of subjects, our so-called sync targets. Based on the different 
subjects a set of problems were identified with varying layers of 
complexity:
* identity check/check for ambiguity — Are we talking about the same
    entity?
  * fixed vs. varying property — Some properties vary depending on
    nationality (e.g., release dates), or point in time (e.g.,
    population count).
  * reference — Depending on the entity’s identity check and the
    property’s fixed or varying state the reference might vary. Also,
    for some targets, no query-able online reference might be available.
  * normalization/conversion of values — Depending on
    language/nationality of the article properties can have varying
    units (e.g., currency, metric vs imperial system).
The check for ambiguity is the most crucial step to ensure that the 
infoboxes that are being compared do refer to the same entity. We found, 
instances where the Wikipedia page and the infobox shown on that page 
were presenting information about different subjects (e.g., see here 
https://en.wikipedia.org/wiki/Boys_Don%27t_Cry_(Moulin_Rouge_song)).
/Examples/
As a good sync target to start with the group ‘NBA players’ was 
identified. There are no ambiguity issues, it is a clearly defined group 
of persons, and the amount of varying properties is very limited. 
Information seems to be derived from mainly two web sites (nba.com and 
basketball-reference.com) and normalization is only a minor issue. 
‘Video games’ also proved to be an easy sync target, with the main 
problem being varying properties such as different release dates for 
different platforms (Microsoft Windows, Linux, MacOS X, XBox) and 
different regions (NA vs EU).
More difficult topics, such as ‘cars’, ’music albums’, and ‘music 
singles’ showed more potential for ambiguity as well as property 
variability. A major concern we found was Wikipedia pages that contain 
multiple infoboxes (often seen for pages referring to a certain type of 
car, such as this one https://en.wikipedia.org/wiki/Volkswagen_Polo). 
Reference and fact extraction can be done for each infobox, but 
currently, we run into trouble once we fuse this data.
Further information about sync targets and their challenges can be found 
on our Meta-Wiki discussion page 
https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE/Timeline/Tasks#Preliminary_study_-_sync_targets, 
where Wikipedians that deal with infoboxes on a regular basis can also 
share their insights on the matter. Some issues were also found 
regarding the mapping of properties. In order to make GlobalFactSync as 
applicable as possible, we rely on the DBpedia community to help us 
improve the mappings. If you are interested in participating, we will 
connect with you at http://mappings.dbpedia.org and in the DBpedia forum 
https://forum.dbpedia.org/.
Bottomline – We value your feedback!