DBpedia Databus (alpha version)


The DBpedia Databus is a platform that allows to exchange, curate and access data between multiple stakeholders. Any data entering the bus will be versioned, cleaned, mapped, linked and its licenses and provenance tracked. Hosting in multiple formats will be provided to access the data either as dump download or as API. Data governance stays with the data contributors.


Vision

Working with data is hard and repetitive. We envision a hub, where everybody can upload data and then useful operations like versioning, cleaning, transformation, mapping, linking, merging, hosting is done automagically on a central communication system (the bus) and then dispersed again in a decentral network to the consumers and applications.

On the databus, data flows from data producers through the platform to the consumers (left to right), any errors or feedback flows in the opposite direction and reaches the data source to provide a continuous integration service and improve the data at the source.


Open Data vs. Closed (paid) Data

We have studied the data network for 10 years now and we conclude that organisations with open data are struggling to work together properly, although they could and should, but are hindered by technical and organisational barriers. They duplicate work on the same data. On the other hand, companies selling data can not do so in a scalable way. The loser is the consumer with the choice of inferior open data or buying from a djungle-like market.

Publishing data on the databus

If you are grinding your teeth about how to publish data on the web, you can just use the databus to do so. Data loaded on the bus will be highly visible, available and queryable. You should think of it as a service:



Data Sellers

If you are selling data, the databus provides numerous opportunities for you. You can link your offering to the open entities in the databus. This allows consumers to discover your services better by showing it with each request.


Data Consumers

Open data on the databus will be a commodity. We are greatly downing the cost for understanding the data, retrieving and reformatting it. We are constantly extending ways of using the data and are willing to implement any formats and APIs you need.

If you are lacking a certain kind of data, we can also scout for it and load it onto the databus.



How the Databus works at the moment

We are still in an initial state, but we already load 10 datasets (6 from DBpedia, 4 external) on the bus using these phases:

  1. Acquisition: data is downloaded from the source and logged in

  2. Conversion: data is converted to N-Triples and cleaned (Syntax parsing, datatype validation and SHACL)

  3. Mapping: the vocabulary is mapped on the DBpedia Ontology and converted (We have been doing this for Wikipedia’s Infoboxes and Wikidata, but now we do it for other datasets as well)

  4. Linking: Links are mainly collected from the sources, cleaned and enriched

  5. IDying: All entities found are given a new Databus ID for tracking

  6. Clustering: ID’s are merged onto clusters using one of the Databus ID’s as cluster representative

  7. Data Comparison: Each dataset is compared with all other datasets. We have an algorithm that decides on the best value, but the main goal here is transparency, i.e. to see which data value was chosen and how it compares to the other sources.

  8. A main knowledge graph fused from all the sources, i.e. a transparent aggregate

  9. For each source, we are producing a local fused version called the “Databus Complement”. This is a major feedback mechanism for all data providers, where they can see what data they are missing, what data differs in other sources and what links are available for their IDs.

  10. You can compare all data via a webservice (early prototype, just works for Eiffel Tower): http://88.99.242.78:9000/?s=http%3A%2F%2Fid.dbpedia.org%2Fglobal%2F12HpzV&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Farchitect&src=general


We aim for a real-time system, but at the moment we are doing a monthly cycle.


Is it free?

Maintaining the Databus is a lot of work and servers incurring a high cost. As a rule of thumb, we are providing everything for free that we can afford to provide for free. DBpedia was providing everything for free in the past, but this is not a healthy model, as we can neither maintain quality properly, nor grow.

On the Databus everything is provided “As is” without any guarantees or warranty. Improvements can be done by the volunteer community. The DBpedia Association will provide a business interface to allow guarantees, major improvements, stable maintenance and hosting.


License

Final databases are licensed under ODC-By. This covers our work on recomposition of data. Each fact is individually licensed, e.g. Wikipedia abstracts are CC-BY-SA, some are CC-BY-NC, some are copyrighted. This means that data is available for research, informational and educational purposes. We recommend to contact us for any professional use of the data (clearing), so we can guarantee that legal matters are handled correctly. Otherwise professional use is at own risk.



Current Stats

Download

The databus data is available at http://downloads.dbpedia.org/databus/ ordered into three main folders:


Most notably you can find:


(Note that the file and folder structure are still subject to change)




Sources


Glue


Source

Target

Amount

de.dbpedia.org

www.viaf.org

387,106

diffbot.com

www.wikidata.org

516,493

d-nb.info

viaf.org

5,382,783

d-nb.info

dbpedia.org

80,497

d-nb.info

sws.geonames.org

50,966

fr.dbpedia.org

www.viaf.org

266

sws.geonames.org

dbpedia.org

545,815

kb.nl

viaf.org

2,607,255

kb.nl

www.wikidata.org

121,012

kb.nl

dbpedia.org

37,676

www.wikidata.org

https://permid.org

5,133

wikidata.dbpedia.org

www.wikidata.org

45,344,233

wikidata.dbpedia.org

sws.geonames.org

3,495,358

wikidata.dbpedia.org

viaf.org

1,179,550

wikidata.dbpedia.org

d-nb.info

601,665



Plan for the next releases