Re: [Wikidata-tech] Wikidata Query implementation - another approach

15 Apr 2015


      Hi Markus,
thanks a lot for your reply.
On 4/15/2015 6:31 PM, Markus Krötzsch wrote:
...
Hi Bo,
Thanks for the information. More query services are never a bad thing,
and I agree that property graph is closest to Wikidata in terms of data
model. However, in own tests with Neo4j (at TUD, not at WMF), we were
not so impressed by raw query performance. In particular, there seemed
to be a lack of query optimization (writing queries in different ways
led to very different runtimes). Some queries we could not run at all.
Have you tried Neo4j 2.2 as well? They introduced a new cost-based query 
optimizer [1] there. Maybe you should give it another try. In general, 
it often depends on the applied data model and how it is indexed. 
Furthermore, avoiding OPTIONAL and reducing the query result more and 
more via WITH is often helpful.
...
However, I am sure that Neo4j is improving all the time and maybe it is
ready for Wikidata on a public endpoint now/soon. Do you already have a
public query service available to try out?
No, not yet. The RDF import is more or less a side product of D:SWARM. 
Currently, we are working hardly on getting out Version 0.9 [2] that 
should be released at SLUB in June this year.
Loading Wikidata RDF with our batch importer into Neo4j should already 
be possible. Maybe I can do this over the weekend (but I have no server 
for hosting it).
However, I would rather recommend to create a fork of our Neo4j 
unmanaged extension [3]. Then we should generalize the handling of the 
qualified attributes of a statement. Finally, we need to write a parser 
for your Wikidata JSON format to fully import all the available 
knowledge, since (afaik) the RDF export is a bit shortened, or?
...
How did you translate
everything into Neo4j (the key decision there seems to be which things
to model as graph nodes, and which things to model as property values
[in the sense of "property graph" edge annotations]).
Currently, rather naive and straight forward (from my POV), i.e. a 
statement (subject-predicate-object) is transformed into node-edge-node 
and qualified attributes are edge properties. We create separate indices 
for resources* (URI resources), resource types (classes; but they are 
node labels as well) and values (literal values). These are node 
indices. Furthermore, every statements gets a uuid that is indexed as 
well. This is an relationship (/edge) index. To guarantee that no triple 
will be inserted twice, we calculate a statement hash with the available 
information of statement. This is currently a set that will be persisted 
via MapDB. See also [4, 5] more background about our data model design 
and [6] for an example.
...
We are also trying to compile a list of test case queries that might be
interesting for you to benchmark your solution (both in terms of
performance and in terms of coverage).
Yes, this might be interesting. We can try to translate those queries 
into Cypher queries.
Cheers,
Bo/T
*) we also have a separate index that for finding a certain resource in 
a specific data model (i.e. sub graph/ named graph)
[1] http://neo4j.com/blog/neo4j-2-2-0-scalability-performance/
[2] 
https://jira.slub-dresden.de/secure/RapidBoard.jspa?rapidView=1&view=pla...
[3] https://github.com/zazi/dswarm-graph-neo4j/tree/mapdb (this is the 
latest dev branch)
[4] 
https://github.com/dswarm/dswarm-documentation/wiki/Graph-Data-Model#gdm-in-...
[5] 
https://github.com/dswarm/dswarm-documentation/wiki/Graph-Exploration#availa...
[6] 
https://avgl.mybalsamiq.com/mockups/1802011.png?key=27106ea66faf01c9ad98a275...
...
Regards,
Markus
On 13.04.2015 22:55, Bo Ferri wrote:
...
Hi,
I followed the most recent discussion about the implementation of the
Wikidata Query service. Albeit, you've chosen Blazegraph as database for
implementing it, I would like to show what we've done in the backend of
our open-source datamanagement platform D:SWARM [1].
The Wikidata data model and the current state of our data model [2] are
pretty similar, i.e., we mainly rely on statements that belong to
resources and we would like to keep qualified attributes about the
statements. We decided to build our graph data model on top of the
property graph model and utilise the RDF concepts as well [3], i.e. we
are RDF compatible. The main difference right now between both data
models is that our graph data model currently only makes use of a fixed
set of qualified attributes for the statements (whereby the claims of
the Wikidata data model can use anything for qualified attributes;
however, this can be changed/opened in our data model rather easily).
At the current state of our implementation we make use of Neo4j.
Therefore, we provide an unmanaged extension [4] to offer specific HTTP
APIs at the Neo4j server to consume and read the data (etc.).
Furthermore, batch insert is implemented to speed-up the import for huge
amounts of data [5]. At the data import we create various indices [6]
that can later be utilised to boost various queries. Finally, we also
experimentally support versioning for our data [7].
Currently, we are working on improving the performance of the import.
Most recently, we were able to load 115M RDF statements in ~107 minutes
on a commodity machine (incl. indexing; 16GB ram, SSD, 8 cores;
single-threaded for now (!)) [8]. I know that this is no landmark (I'm
not really a performance guy at all ;) ), since many triples stores are
much faster. On the other side, the data is now in a property graph.
Hence, we can make use of the advantages of this approach (rather then
dealing with the "disadavantages/misconceptions" of the current RDF data
model (namely reification*) ;) ).
Maybe, we can join forces at certain challenges. From what I've seen so
far the Wikidata dataset is about 223M statements, or? So it should
still be possible to load it at a single commodity machine into Neo4j.
The only "disadvantage" right re. your current decisions for
implementation is that you need to write Cypher queries instead of
SPARQL queries (our you need to write a preprocessor to transform SPARQL
queries into Cypher queries).
Feel free to ask further questions about details our implementation. I'm
looking forward to your response.
Cheers,
Bo/T
*) RDR is also "only" an experiment right now ;)
[1] http://www.dswarm.org/
[2] https://github.com/dswarm/dswarm-documentation/wiki/Graph-Data-Model
[3]
https://github.com/dswarm/dswarm-documentation/wiki/Comparison-RDF-and-GDM-m...
[4] https://github.com/dswarm/dswarm-graph-neo4j
[5]
https://github.com/dswarm/dswarm-graph-neo4j/tree/master/src/main/java/org/d...
[6]
https://github.com/dswarm/dswarm-documentation/wiki/Graph-Exploration#use-of...
[7]
https://github.com/dswarm/dswarm-documentation/wiki/Versioning#implementatio...
[8] https://github.com/zazi/dswarm-graph-neo4j/tree/mapdb

Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Wikidata-tech] Wikidata Query implementation - another approach