On Mar 5, 2015 8:50 PM, "Nikolas Everett" <neverett(a)wikimedia.org> wrote:
TL/DR: We're selected BlazeGraph to back the next Wikidata Query Service.
After Titan evaporated about a month ago we went back to the drawing
board on back
ends for a new Wikidata Query Service. We took four weeks
(including a planed trip to Berlin) to settle on a backend. As you can see
from the spreadsheet we've really blown out the number of options. As you
can also see we didn't finish filling them all out. But we've still pretty
much settled on BlazeGraph anyway. Let me first explain what BlazeGraph is
and then defend our decision to stop spreadsheet work.
BlazeGraph is a GPLed RDF triple store that natively supports SPARQL 1.1,
some OWL, and some extensions. Those are all semantic web terms and
they translate into a "its a graph database with an expressive, mostly
standardized query language and support for inferring stuff as data is
added and removed to the graph". It also has some features that you'd
recognize from nice relational databases: join order rewriting, smart query
planner, hash and nested loop joins, query rewrite rules, group by, order
by, and aggregate functions.
These are all cool features - really the kind of things that we thought
we need but
they come with an "interesting" price. Semantic Web is a very
old thing that's had a really odd degree of success. If you have an hour
and half Jim Hendler can explain it to you. The upshot is that _tons_ of
people have _tons_ of opinions. The W3C standardizes RDF, SPARQL, RDFS,
OWL, and about a billion other things. There are (mostly non-W3C)
standards for talking about people, social connections, and music. And
they all have rules. And Wikidata doesn't. Not like these rules. One
thing I've learned from this project is that this lack of prescribed rules
is one of Wikidata's founding principles. Its worth it to allow openness.
So you _can_ set gender to "Bacon" or put GeoCoordinants on Amber.
Anyway! I argue that, at least for now, we should ignore many of these
standards. We need to think of Wikidata Query Service as a tool to answer
questions instead of as a some grand statement about the semantic web.
Mapping existing ontologies onto Wikidata is a task for another day.
I feel like these semantic web technologies and BlazeGraph in particular
fits for this project mostly because the quality of our "but what
about X?" questions is very very high. "How much inference should we do
instead of query rewriting?" instead of "Can we do inference? Can we do
query rewriting?" And "Which standard vocabularies should think about
mapping to Wikidata?" Holy cow! In any other system there aren't
"standard vocabularies" to even talk about mapping, much less a mechanism
for mapping them. Much less two! Its almost an overwhelming wealth and as
I elude to above it can be easy to bikeshed.
We've been reasonably careful to reach out people we know are familiar
this space. We're well aware of projects like the Wikidata Toolkit
and its RDF exports. We've been using those for testing. We've talked to
so many people about so many things. Its really consumed a lot more time
then I'd expected and made the search for the next backend very long. But
I feel comfortable that we're in a good place. We don't know all the
answers but we're sure there _are_ answers.
The BlazeGraph upstream has been super active with us. They've spent
with us over hangouts, had me out to their office (a house an hour
and half from mine) to talk about data modeling, and spent a ton of time
commenting on Phabricator tickets. They've offered to donate a formal
support agreement as well. And to get together with us about writing any
features we might need to add to BlazeGraph. And they've added me as a
committer (I told them I had some typos to fix but I have yet to actually
commit them). And their code is well documented.
So by now you've realized I'm a fan. I believe that we should stop on
spreadsheet and just start work against BlazeGraph because I think we
have phenomenal momentum with upstream. And its a pretty clear winner on
the spreadsheet at this point. But there are two other triple stores which
we haven't fully filled out that might be viable: OpenLink Virtuoso Open
Source and Apache Jena. Virtuoso is open core so I'm really loath to go
too deep into it at his point. Their HA features are not open source which
implies that we'd have trouble with them as an upstream. Apache Jena just
isn't known to scale to data as large as BlazeGraph and Virtuoso. So I
argue that these are systems that, in the unlikely event that BlazeGraph
goes the way of Titan, we should start our third round of investigation
against. As it stands now I think we have a winner.
We created a phabricator task with lots of children to run down our
questions. The biggest remaining questions revolve around three
1. Operational issues like "how should the
cluster be deployed?" "do we
use HA at all?" "how are rolling
restarts done in HA?"
2. How should we represent the data in the database?
only BlazeGraph) has an extension that *could* us called RDR.
I don't think RDR is compatible with the existing reification techniques
chosen, at least for the Wikidata toolkit RDF exports.
3. Some folks have identified update rate as a risk.
Not upstream, but
others familiar with triple stores in general.
Our plans is to work on #2 over the next weeks because it really informs
there are lots of working set size vs cpu time tradeoffs to
investigate. We'll start on #1 shortly as well. #3 is a potential risk
area so we'll be sure to investigate it soon.
I admit I'm not super happy to leave the spreadsheet in the format its
unfilled-out state but I'm excited to have something to work with
and think its the right thing to do right now.
So thanks for reading all of this. Please reply with comments.
Wikidata-tech mailing list