Wikidata Query implementation - another approach - Wikidata-tech

13 Apr 2015

Hi,

I followed the most recent discussion about the implementation of the 
Wikidata Query service. Albeit, you've chosen Blazegraph as database for 
implementing it, I would like to show what we've done in the backend of 
our open-source datamanagement platform D:SWARM [1].

The Wikidata data model and the current state of our data model [2] are 
pretty similar, i.e., we mainly rely on statements that belong to 
resources and we would like to keep qualified attributes about the 
statements. We decided to build our graph data model on top of the 
property graph model and utilise the RDF concepts as well [3], i.e. we 
are RDF compatible. The main difference right now between both data 
models is that our graph data model currently only makes use of a fixed 
set of qualified attributes for the statements (whereby the claims of 
the Wikidata data model can use anything for qualified attributes; 
however, this can be changed/opened in our data model rather easily).

At the current state of our implementation we make use of Neo4j. 
Therefore, we provide an unmanaged extension [4] to offer specific HTTP 
APIs at the Neo4j server to consume and read the data (etc.). 
Furthermore, batch insert is implemented to speed-up the import for huge 
amounts of data [5]. At the data import we create various indices [6] 
that can later be utilised to boost various queries. Finally, we also 
experimentally support versioning for our data [7].

Currently, we are working on improving the performance of the import. 
Most recently, we were able to load 115M RDF statements in ~107 minutes 
on a commodity machine (incl. indexing; 16GB ram, SSD, 8 cores; 
single-threaded for now (!)) [8]. I know that this is no landmark (I'm 
not really a performance guy at all ;) ), since many triples stores are 
much faster. On the other side, the data is now in a property graph. 
Hence, we can make use of the advantages of this approach (rather then 
dealing with the "disadavantages/misconceptions" of the current RDF data 
model (namely reification*) ;) ).

Maybe, we can join forces at certain challenges. From what I've seen so 
far the Wikidata dataset is about 223M statements, or? So it should 
still be possible to load it at a single commodity machine into Neo4j. 
The only "disadvantage" right re. your current decisions for 
implementation is that you need to write Cypher queries instead of 
SPARQL queries (our you need to write a preprocessor to transform SPARQL 
queries into Cypher queries).

Feel free to ask further questions about details our implementation. I'm 
looking forward to your response.

Cheers,

Bo/T

*) RDR is also "only" an experiment right now ;)

[1] http://www.dswarm.org/
[2] https://github.com/dswarm/dswarm-documentation/wiki/Graph-Data-Model
[3] 
https://github.com/dswarm/dswarm-documentation/wiki/Comparison-RDF-and-GDM-…
[4] https://github.com/dswarm/dswarm-graph-neo4j
[5] 
https://github.com/dswarm/dswarm-graph-neo4j/tree/master/src/main/java/org/…
[6] 
https://github.com/dswarm/dswarm-documentation/wiki/Graph-Exploration#use-o…
[7] 
https://github.com/dswarm/dswarm-documentation/wiki/Versioning#implementati…
[8] https://github.com/zazi/dswarm-graph-neo4j/tree/mapdb