Hi all,
I was excited to learn about your plans to explore the use of SPARQL
capable stores for providing wikidata. I currently run Bio2RDF (
http://bio2rdf.org), an open source project that transforms and provides
over 30 biomedical databases into 11B triples of Linked Data.
For the past 10 years our project has relied on Virtuoso, primarily
because it performs well under most circumstances (lookup and simple
queries) and is open source. We are pleased to learn of the strides that
BigData has made with its BlazeGraph release - and we are currently
investigating its feasibility to support our project.
Our project currently loads each RDF dataset into a separate SPARQL
endpoint, which induces a high memory overhead, but seems to scale better
on a single server and also makes it vastly easier to update individual
datasets rather than having to delete/update a large triple store. Thus,
users must use SPARQL federation in order to query across the graph, or,
just download the freely available data files and build their own
integrated database.
We have just begun the process of seriously analyzing our user logs in
order to better understand the kinds of queries that our users formulate,
and the content that they are interested in. We hope that our work will
provide insight into access patterns, data quality, and overall
performance. But what I can say is that most queries are relatively simple
(select + 1 triple pattern) and that, unsurprisingly, frequency decreases
exponentially with increased complexity. However, if the goal is to
provide fast access, you might also look into
http://linkeddatafragments.org/ . It's something that we're looking into.
I noticed your discussion about representation, I concur with Jerven that
you should consider using explicit data structures that decompose complex
concepts into computable fragments. We have described our approach in
applying ontology design patterns built from the Semanticscience Integrated
Ontology (SIO) to represent arbitrary knowledge (
http://sio.semanticscience.org/), which is also friendly to reasoning with
OWL ontologies. I would be happy to discuss this in greater detail if
interested.
Finally, given the overlap in Bio2RDF with content in wikidata, I would
like to investigate ways in which we can interlink our repositories. It
would be useful if wikipedia/wikidata users could automatically discover
related content in Bio2RDF, and vice versa. One way is for us to
dynamically ask whether either of us knows about an entity, another is that
share a ]data identifier registry (see
identifiers.org). Would be great to
hear your ideas on this!
Cheers!
m.
Michel Dumontier
Associate Professor of Medicine (Biomedical Informatics), Stanford
University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
http://dumontierlab.com