I was excited to learn about your plans to explore the use of SPARQL capable stores for providing wikidata. I currently run Bio2RDF (http://bio2rdf.org
), an open source project that transforms and provides over 30 biomedical databases into 11B triples of Linked Data.
For the past 10 years our project has relied on Virtuoso, primarily because it performs well under most circumstances (lookup and simple queries) and is open source. We are pleased to learn of the strides that BigData has made with its BlazeGraph release - and we are currently investigating its feasibility to support our project.
Our project currently loads each RDF dataset into a separate SPARQL endpoint, which induces a high memory overhead, but seems to scale better on a single server and also makes it vastly easier to update individual datasets rather than having to delete/update a large triple store. Thus, users must use SPARQL federation in order to query across the graph, or, just download the freely available data files and build their own integrated database.
We have just begun the process of seriously analyzing our user logs in order to better understand the kinds of queries that our users formulate, and the content that they are interested in. We hope that our work will provide insight into access patterns, data quality, and overall performance. But what I can say is that most queries are relatively simple (select + 1 triple pattern) and that, unsurprisingly, frequency decreases exponentially with increased complexity. However, if the goal is to provide fast access, you might also look into http://linkeddatafragments.org/
. It's something that we're looking into.
I noticed your discussion about representation, I concur with Jerven that you should consider using explicit data structures that decompose complex concepts into computable fragments. We have described our approach in applying ontology design patterns built from the Semanticscience Integrated Ontology (SIO) to represent arbitrary knowledge (http://sio.semanticscience.org/
), which is also friendly to reasoning with OWL ontologies. I would be happy to discuss this in greater detail if interested.
Finally, given the overlap in Bio2RDF with content in wikidata, I would like to investigate ways in which we can interlink our repositories. It would be useful if wikipedia/wikidata users could automatically discover related content in Bio2RDF, and vice versa. One way is for us to dynamically ask whether either of us knows about an entity, another is that share a ]data identifier registry (see identifiers.org
). Would be great to hear your ideas on this!
Associate Professor of Medicine (Biomedical Informatics), Stanford University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group