DBpedia Databus (alpha version) - Wikidata

8 May 2018

  DBpedia Databus (alpha version)

The DBpedia Databus is a platform that allows to exchange, curate and 
access data between multiple stakeholders. Any data entering the bus 
will be versioned, cleaned, mapped, linked and its licenses and 
provenance tracked. Hosting in multiple formats will be provided to 
access the data either as dump download or as API. Data governance stays 
with the data contributors.

    Vision

Working with data is hard and repetitive. We envision a hub, where 
everybody can upload data and then useful operations like versioning, 
cleaning, transformation, mapping, linking, merging, hosting is done 
automagically on a central communication system (the bus) and then 
dispersed again in a decentral network to the consumers and applications.

On the databus, data flows from data producers through the platform to 
the consumers (left to right), any errors or feedback flows in the 
opposite direction and reaches the data source to provide a continuous 
integration service and improve the data at the source.

    Open Data vs. Closed (paid) Data

We have studied the data network for 10 years now and we conclude that 
organisations with open data are struggling to work together properly, 
although they could and should, but are hindered by technical and 
organisational barriers. They duplicate work on the same data. On the 
other hand, companies selling data can not do so in a scalable way. The 
loser is the consumer with the choice of inferior open data or buying 
from a djungle-like market.

    Publishing data on the databus

If you are grinding your teeth about how to publish data on the web, you 
can just use the databus to do so. Data loaded on the bus will be highly 
visible, available and queryable. You should think of it as a service:

  *

    Visibility guarantees, that your citations and reputation goes up

  *

    Besides a web download, we can also provide a Linked Data interface,
    SPARQL endpoint, Lookup (autocomplete) or many other means of
    availability (like AWS or Docker images)

  *

    Any distribution we are doing will funnel feedback and collaboration
    opportunities your way to improve your dataset and your internal
    data quality

  *

    You will receive an enriched dataset, which is connected and
    complemented with any other available data (see the same folder
    names in data and fusion folders).

    Data Sellers

If you are selling data, the databus provides numerous opportunities for 
you. You can link your offering to the open entities in the databus. 
This allows consumers to discover your services better by showing it 
with each request.

    Data Consumers

Open data on the databus will be a commodity. We are greatly downing the 
cost for understanding the data, retrieving and reformatting it. We are 
constantly extending ways of using the data and are willing to implement 
any formats and APIs you need.

If you are lacking a certain kind of data, we can also scout for it and 
load it onto the databus.

    How the Databus works at the moment

We are still in an initial state, but we already load 10 datasets (6 
from DBpedia, 4 external) on the bus using these phases:

 1.

    Acquisition: data is downloaded from the source and logged in

 2.

    Conversion: data is converted to N-Triples and cleaned (Syntax
    parsing, datatype validation and SHACL)

 3.

    Mapping: the vocabulary is mapped on the DBpedia Ontology and
    converted (We have been doing this for Wikipedia’s Infoboxes and
    Wikidata, but now we do it for other datasets as well)

 4.

    Linking: Links are mainly collected from the sources, cleaned and
    enriched

 5.

    IDying: All entities found are given a new Databus ID for tracking

 6.

    Clustering: ID’s are merged onto clusters using one of the Databus
    ID’s as cluster representative

 7.

    Data Comparison: Each dataset is compared with all other datasets.
    We have an algorithm that decides on the best value, but the main
    goal here is transparency, i.e. to see which data value was chosen
    and how it compares to the other sources.

 8.

    A main knowledge graph fused from all the sources, i.e. a
    transparent aggregate

 9.

    For each source, we are producing a local fused version called the
    “Databus Complement”. This is a major feedback mechanism for all
    data providers, where they can see what data they are missing, what
    data differs in other sources and what links are available for their
    IDs.

10.

    You can compare all data via a webservice (early prototype, just
    works for Eiffel Tower):

_http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general_

We aim for a real-time system, but at the moment we are doing a monthly 
cycle.

    Is it free?

Maintaining the Databus is a lot of work and servers incurring a high 
cost. As a rule of thumb, we are providing everything for free that we 
can afford to provide for free. DBpedia was providing everything for 
free in the past, but this is not a healthy model, as we can neither 
maintain quality properly, nor grow.

On the Databus everything is provided “As is” without any guarantees or 
warranty. Improvements can be done by the volunteer community. The 
DBpedia Association will provide a business interface to allow 
guarantees, major improvements, stable maintenance and hosting.

    License

Final databases are licensed under ODC-By. This covers our work on 
recomposition of data. Each fact is individually licensed, e.g. 
Wikipedia abstracts are CC-BY-SA, some are CC-BY-NC, some are 
copyrighted. This means that data is available for research, 
informational and educational purposes. We recommend to contact us for 
any professional use of the data (clearing), so we can guarantee that 
legal matters are handled correctly. Otherwise professional use is at 
own risk.

    Current Stats

      Download

The databus data is available at _http://downloads.dbpedia.org/databus/_ 
ordered into three main folders:

  *

    Data: the data that is loaded on the databus at the moment

  *

    Global: a folder that contains provenance data and the mappings to
    the new IDs

  *

    Fusion: the output of the databus

Most notably you can find:

  *

    Provenance mapping of the new ids in
    _global/persistence-core/cluster-iri-provenance-ntriples/_

<http://downloads.dbpedia.org/databus/global/persistence-core/cluster-iri-provenance-ntriples/>
    and _global/persistence-core/global-ids-ntriples/_

<http://downloads.dbpedia.org/databus/global/persistence-core/global-ids-ntriples/>

  *

    The final fused version for the core: _fusion/core/fused/_
    <http://downloads.dbpedia.org/databus/fusion/core/fused/>

  *

    A detailed JSON-LD file for data comparison: _fusion/core/json/_
    <http://downloads.dbpedia.org/databus/fusion/core/json/>

  *

    Complements, i.e. the enriched Dutch DBpedia Version:
    _fusion/core/nl.dbpedia.org/_
    <http://downloads.dbpedia.org/databus/fusion/core/nl.dbpedia.org/>

(Note that the file and folder structure are still subject to change)

      Sources

      Glue

Source

Target

Amount

_de.dbpedia.org_ <http://de.dbpedia.org/>

_www.viaf.org_ <http://www.viaf.org/>

387,106

_diffbot.com_ <http://diffbot.com/>

_www.wikidata.org_ <http://www.wikidata.org/>

516,493

_d-nb.info_ <http://d-nb.info/>

_viaf.org_ <http://viaf.org/>

5,382,783

_d-nb.info_ <http://d-nb.info/>

_dbpedia.org_ <http://dbpedia.org/>

80,497

_d-nb.info_ <http://d-nb.info/>

_sws.geonames.org_ <http://sws.geonames.org/>

50,966

_fr.dbpedia.org_ <http://fr.dbpedia.org/>

_www.viaf.org_ <http://www.viaf.org/>

266

_sws.geonames.org_ <http://sws.geonames.org/>

_dbpedia.org_ <http://dbpedia.org/>

545,815

_kb.nl_ <http://kb.nl/>

_viaf.org_ <http://viaf.org/>

2,607,255

_kb.nl_ <http://kb.nl/>

_www.wikidata.org_ <http://www.wikidata.org/>

121,012

_kb.nl_ <http://kb.nl/>

_dbpedia.org_ <http://dbpedia.org/>

37,676

_www.wikidata.org_ <http://www.wikidata.org/>

_https://permid.org_ <https://permid.org/>

5,133

_wikidata.dbpedia.org_ <http://wikidata.dbpedia.org/>

_www.wikidata.org_ <http://www.wikidata.org/>

45,344,233

_wikidata.dbpedia.org_ <http://wikidata.dbpedia.org/>

_sws.geonames.org_ <http://sws.geonames.org/>

3,495,358

_wikidata.dbpedia.org_ <http://wikidata.dbpedia.org/>

_viaf.org_ <http://viaf.org/>

1,179,550

_wikidata.dbpedia.org_ <http://wikidata.dbpedia.org/>

_d-nb.info_ <http://d-nb.info/>

601,665

    Plan for the next releases

  *

    Include more existing data from DBpedia

  *

    Renew all DBpedia releases in a separate fashion:

      o

        DBpedia Wikidata is running already: _http://78.46.100.7/wikidata/_

      o

        Basic extractors like infobox properties and mapping will be
        active soon

      o

        Text extraction will take a while

  *

    Load all data in the comparison tool:

_http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general_

  *

    Load all data into a SPARQL endpoint

  *

    Create a simple open source software that let’s everybody push data
    on the databus in an automated way