I'm wondering if we still need to support PHP 5.3.
I'd rather bump the minimum version up to PHP 5.5, and know people have
been talking about doing the same for MediaWiki itself. My question is if
this can already be done without making Wikibase undeployable on WMF
servers. Is everything running HHVM yet, or is there some stuff relevant to
Wikibase that still runs an unsupported version of PHP?
Jeroen De Dauw - http://www.bn2vs.com
Software craftsmanship advocate
Evil software architect at Wikimedia Germany
After the initial enthusiasm, I have grown increasingly wary of the prospect of
exposing a SPARQL endpoint as Wikidata's canonical query interface. I decided to
share my (personal and unfinished) thoughts about this on this list, as food for
thought and a basis for discussion.
Basically, I fear that exposing SPARQL will lock us in with respect to the
backend technology we use. Once it's there, people will rely on it, and taking
it away would be very harsh. That would make it practically impossible to move
to, say, Neo4J in the future. This is even more true if if expose vendor
specific extensions like RDR/SPARQL*.
Also, exposing SPARQL as our primary query interface probably means abruptly
discontinuing support for WDQ. It's pretty clear that the original WDQ service
is not going to be maintained once the WMF offers infrastructure for wikidata
queries. So, when SPARQL appears, WDQ would go away, and dozens of tools will
need major modifications, or would just die.
So, my proposal is to expose a WDQ-like service as our primary query interface.
This follows the general principle having narrow interfaces to make it easy to
swap out the implementation.
But the power of SPARQL should not be lost: A (sandboxed) SPARQL endpoint could
be exposed to Labs, just like we provide access to replicated SQL databases
there: on Labs, you get "raw" access, with added performance and flexibility,
but no guarantees about interface stability.
In terms of development resources and timeline, exposing WDQ may actually get us
a public query endpoint more quickly: sandboxing full SPARQL may likely turn out
to be a lot harder than sandboxing the more limited set of queries WDQ allows.
Finally, why WDQ and not something else, say, MQL? Because WDQ is specifically
tailored to our domain and use case, and there already is an ecosystem of tools
that use it. We'd want to refine it a bit I suppose, but by and large, it's
pretty much exactly what we need, because it was built around the actual demand
for querying wikidata.
So far my current thoughts. Note that this is not a decision or recommendation
by the Wikidata team, just my personal take.
Senior Software Developer
Gesellschaft zur Förderung Freien Wissens e.V.
I was excited to learn about your plans to explore the use of SPARQL
capable stores for providing wikidata. I currently run Bio2RDF (
http://bio2rdf.org), an open source project that transforms and provides
over 30 biomedical databases into 11B triples of Linked Data.
For the past 10 years our project has relied on Virtuoso, primarily
because it performs well under most circumstances (lookup and simple
queries) and is open source. We are pleased to learn of the strides that
BigData has made with its BlazeGraph release - and we are currently
investigating its feasibility to support our project.
Our project currently loads each RDF dataset into a separate SPARQL
endpoint, which induces a high memory overhead, but seems to scale better
on a single server and also makes it vastly easier to update individual
datasets rather than having to delete/update a large triple store. Thus,
users must use SPARQL federation in order to query across the graph, or,
just download the freely available data files and build their own
We have just begun the process of seriously analyzing our user logs in
order to better understand the kinds of queries that our users formulate,
and the content that they are interested in. We hope that our work will
provide insight into access patterns, data quality, and overall
performance. But what I can say is that most queries are relatively simple
(select + 1 triple pattern) and that, unsurprisingly, frequency decreases
exponentially with increased complexity. However, if the goal is to
provide fast access, you might also look into
http://linkeddatafragments.org/ . It's something that we're looking into.
I noticed your discussion about representation, I concur with Jerven that
you should consider using explicit data structures that decompose complex
concepts into computable fragments. We have described our approach in
applying ontology design patterns built from the Semanticscience Integrated
Ontology (SIO) to represent arbitrary knowledge (
http://sio.semanticscience.org/), which is also friendly to reasoning with
OWL ontologies. I would be happy to discuss this in greater detail if
Finally, given the overlap in Bio2RDF with content in wikidata, I would
like to investigate ways in which we can interlink our repositories. It
would be useful if wikipedia/wikidata users could automatically discover
related content in Bio2RDF, and vice versa. One way is for us to
dynamically ask whether either of us knows about an entity, another is that
share a ]data identifier registry (see identifiers.org). Would be great to
hear your ideas on this!
Associate Professor of Medicine (Biomedical Informatics), Stanford
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
Dear WikiData Developers,
I am very pleased to read your intention to deploy a SPARQL based API
for WikiData. I am one of the developers behind sparql.uniprot.org a
public free to use SPARQL endpoint (with content under CC-ND, soon to be
made more liberal). UniProt can be seen as an encyclopaedia of facts
about proteins, and we have about a 90 million pages of entries e.g.
http://www.uniprot.org/uniprot/P49407 (corresponds to
http://en.wikipedia.org/wiki/Arrestin). These 90 million entries in
UniProtKB (and another 200 million or so in our supporting related
datasets) lead to a bit more than 19 billion unique triples.
We currently use Virtuoso, but I completely understand why you select
BlazeGraph, it seems to me a better fit for your mission and deployment
scenario. I evaluated what was then BigData 3 years ago and would have
selected it if we did not need to go for vertical scaling due to having
just 1U in our computer room available.
I also understand the worries that you have about being able to run a
service that is resilient to DOS (intentional or not). We have not yet
needed to run any mitigation strategies that are more complex than
banning a IP+user-agent string. However, we do have a few ideas about
how to make that automatic. Even if we don’t have the time to work on
these ideas yet. Our query timeouts are generously above what a sparsely
used HTTP connection supports in practice. You should also not be
worried about GRAPH traversals, in our practice simple DISTINCT queries
are much more painful. e.g. the common COUNT DISTINCT(?subject) WHERE
(?subject ?predicate ?object) that some people like to send daily can be
a challenge on large databases. We could institute draconian timeouts
but we don’t because we want to get the difficult queries, the simple
our users can do on our main site, but the analytical ones require a
solution such as SPARQL.
Maintaining a SPARQL endpoint for public use requires you to focus on
client management not query management. You can have one client with
relatively easy queries, but one that happily sends you 40,000 of these
per second. Others send you a query that looks complicated and expensive
but is actually very selective and ends up taking near to no time at
One of the solutions that we will look into long term is a query manager
in front of our endpoint. I will try to explain the idea behind that but
if its fuzzy just ask and I will try to explain better ;) Assume, you
have a CPU budget available for hosting your SPARQL endpoint. Lets say
100 ticks per day. If you have one user that user should be able to get
all 100 ticks. If you have one hundred users each user gets 1 tick. If
you limit your one user to 1 tick you have wasted 99 ticks. If you let
the fist user use 100 ticks and the other ninety-nine users their 1, you
have broken the budget by 99 ticks. One solution is therefore to allow
clients to optimistically run a query for longer than their ticks, but
the moment that other clients arrive they get kicked off. e.g. complex
queries run ok on sunday evening but not on monday 10am.
You can do this by having a service in front of your API that takes
SPARQL queries, queues them for execution in a priority queue. Sends
back a 303 response with a retry-after and location header so that the
HTTP responses don’t die. The priority queue ensures everyone can get
a turn to run their queries, by reducing the priority of people who are
already running a query etc… The query management service can then see
if a query returned in its tick and if anyone else has scheduled a
query. If someone else has a query outstanding the query manager tells
the SPARQL engine to stop working on the query and send back any results
if possible. I don’t know if BlazeGraph allows this but I know how to
do it for GraphDB and have an idea about how to do it in Virtuoso, I am
however sure that Systap can add this to BlazeGraph on short notice.
Such a query manager will deal with all common forms of DOS attacks.
Now to the advantages of SPARQL over the other options Gremlin included
is that it is a open standard that is deployed in the wild for use of
HTTP by complete strangers. Not just in academia but also e.g. at
http://open-data.europa.eu/en/linked-data so its more likely that you
can combine data in primary resources with data in WikiData. Also you
won’t be the only ones worrying about attacks on your public endpoint
and will have a larger community to share knowledge and code with.
On the Reification business, we use it extensively and in our case about
15% of our triples are in reification quads. But that is the surface
representation how that is actually stored in disk (or materialised on
demand) is completely different and not something you should worry about
at this time. Most of the time you don’t need reification and can
often be avoided by better modelling. For example in our archive of
protein sequences we used to use reification to note that a protein
sequence can no longer be found in a database entry. Now we model a
database entry just like all the other active entries we just give it a
new unique URI and a rdf:type to say its obsolete and that it
corresponds to a version of an active database entry. When teaching
SPARQL to scientist I call this way of thinking “model the measurement
result, not the conclusion”. e.g. instead of saying
example:Jerven example:length_in_cm 197 .
[ example:length_measurement_result_in_cm 197 ;
example:measurement_date “2015-02-27”^^xsd:date . ] ;
[ example:length_measurement_result_in_cm 60 ;
example:measurement_date “1983-06-21”^^xsd:date . ] .
As you can see in the second model you don’t have to worry about
invalidating data as the measurement stays correct. You would have been
forced to model like this in practice with Titan, and you should do the
same in RDF. Just because reification is available does not mean you
must use it ;)
I hope that this mail was constructive and that some of your worries
will be less knowing that there are (possible) solutions to maintain a
PS. There are more opensource options for SPARQL eg. RDF4J
Jerven Tjalling Bolleman
SIB | Swiss Institute of Bioinformatics
CMU - 1, rue Michel Servet - 1211 Geneva 4
t: +41 22 379 58 85 - f: +41 22 379 58 58
Jerven.Bolleman(a)isb-sib.ch - http://www.isb-sib.ch
TL/DR: We're selected BlazeGraph to back the next Wikidata Query Service.
After Titan evaporated about a month ago we went back to the drawing board
on back ends for a new Wikidata Query Service. We took four weeks
(including a planed trip to Berlin) to settle on a backend. As you can see
from the spreadsheet
we've really blown out the number of options. As you can also see we
didn't finish filling them all out. But we've still pretty much settled on
BlazeGraph <http://www.blazegraph.com/> anyway. Let me first explain what
BlazeGraph is and then defend our decision to stop spreadsheet work.
BlazeGraph is a GPLed RDF triple store that natively supports SPARQL 1.1,
RDFS, some OWL, and some extensions. Those are all semantic web terms and
they translate into a "its a graph database with an expressive, mostly
standardized query language and support for inferring stuff as data is
added and removed to the graph". It also has some features that you'd
recognize from nice relational databases: join order rewriting, smart query
planner, hash and nested loop joins, query rewrite rules, group by, order
by, and aggregate functions.
These are all cool features - really the kind of things that we thought we
need but they come with an "interesting" price. Semantic Web is a very old
thing that's had a really odd degree of success. If you have an hour and
half Jim Hendler can explain <https://www.youtube.com/watch?v=oKiXpO2rbJM>
it to you. The upshot is that _tons_ of people have _tons_ of opinions.
The W3C standardizes RDF, SPARQL, RDFS, OWL, and about a billion other
things. There are (mostly non-W3C) standards for talking about people
<http://xmlns.com/foaf/spec/>, social connections
<http://rdfs.org/sioc/spec/>, and music
<http://musicontology.com/specification/>. And they all have rules. And
Wikidata doesn't. Not like these rules. One thing I've learned from this
project is that this lack of prescribed rules is one of Wikidata's founding
principles. Its worth it to allow openness. So you _can_ set gender to
"Bacon" or put GeoCoordinants on Amber
<https://www.wikidata.org/wiki/Q1053330>. Anyway! I argue that, at least
for now, we should ignore many of these standards. We need to think of
Wikidata Query Service as a tool to answer questions instead of as a some
grand statement about the semantic web. Mapping existing ontologies onto
Wikidata is a task for another day.
I feel like these semantic web technologies and BlazeGraph in particular
are good fits for this project mostly because the quality of our "but what
about X?" questions is very very high. "How much inference should we do
instead of query rewriting?" instead of "Can we do inference? Can we do
query rewriting?" And "Which standard vocabularies should think about
mapping to Wikidata?" Holy cow! In any other system there aren't
"standard vocabularies" to even talk about mapping, much less a mechanism
for mapping them. Much less two! Its almost an overwhelming wealth and as
I elude to above it can be easy to bikeshed.
We've been reasonably careful to reach out people we know are familiar with
this space. We're well aware of projects like the Wikidata Toolkit and its
RDF exports. We've been using those for testing. We've talked to so many
people about so many things. Its really consumed a lot more time then I'd
expected and made the search for the next backend very long. But I feel
comfortable that we're in a good place. We don't know all the answers but
we're sure there _are_ answers.
The BlazeGraph upstream has been super active with us. They've spent hours
with us over hangouts, had me out to their office (a house an hour and half
from mine) to talk about data modeling, and spent a ton of time commenting
on Phabricator tickets. They've offered to donate a formal support
agreement as well. And to get together with us about writing any features
we might need to add to BlazeGraph. And they've added me as a committer (I
told them I had some typos to fix but I have yet to actually commit them).
And their code is well documented.
So by now you've realized I'm a fan. I believe that we should stop on the
spreadsheet and just start work against BlazeGraph because I think we have
phenomenal momentum with upstream. And its a pretty clear winner on the
spreadsheet at this point. But there are two other triple stores which we
haven't fully filled out that might be viable: OpenLink Virtuoso Open
Source and Apache Jena. Virtuoso is open core so I'm really loath to go
too deep into it at his point. Their HA features are not open source which
implies that we'd have trouble with them as an upstream. Apache Jena just
isn't known <http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29>
to scale to data as large as BlazeGraph and Virtuoso. So I argue that
these are systems that, in the unlikely event that BlazeGraph goes the way
of Titan, we should start our third round of investigation against. As it
stands now I think we have a winner.
We created a phabricator task <https://phabricator.wikimedia.org/T90101>
with lots of children to run down our remaining questions. The biggest
remaining questions revolve around three areas:
1. Operational issues like "how should the cluster be deployed?" "do we
use HA at all?" "how are rolling restarts done in HA?"
2. How should we represent the data in the database? BlazeGraph (and only
BlazeGraph) has an extension that *could* us called RDR. Should we use it?
3. Some folks have identified update rate as a risk. Not upstream, but
others familiar with triple stores in general.
Our plans is to work on #2 over the next weeks because it really informs #1
because there are lots of working set size vs cpu time tradeoffs to
investigate. We'll start on #1 shortly as well. #3 is a potential risk
area so we'll be sure to investigate it soon.
I admit I'm not super happy to leave the spreadsheet in the format its
current unfilled-out state but I'm excited to have something to work with
and think its the right thing to do right now.
So thanks for reading all of this. Please reply with comments.