Hello wikidata-tech!
tl/dr:
we're going to move forward building a query service against
Titan/Cassandra rather than OrientDB or ArangoDB or anything else.
We
finally finished up the exploratory phase of building a Wikidata Query
Service we can run on the cluster. As a reminder, these were our goals:
1. Horizontal scalability for handling more queries and large data sets
2. Active user community outside of WMF
The system has to be able to answer questions like "find me all the humans without a date of death born more than 115 years ago" and "list the 10 biggest cities in Europe with female mayors" or "find me all writers who are not authors".
We're
perfectly happy if its something we have to hack on, but we don't want
to be the _only_ ones using it. That lands you in lsearchd like
situations where the open source world moves in a different direction
than you do and then no one knows how to support your software. You
become afraid to restart it, much less release new versions of it.
(BTW, Chad just pulled the last remaining trickle of traffic from it
today, party!)
So we identified three good candidates:
Titan (backed by Cassandra), OrientDB and ArangoDB. We prototyped
against both Titan and OrientDB and both worked pretty well. We didn't
have time to prototype against ArangoDB and we also had communications
mixup with upstream. Anyway, when it came down to it we had two working
prototypes so taking time to build a third felt a bit redundant. You
can see much of our notes
here.
The
trouble with two working prototypes is that you can't just flip a coin
to pick one. I guess you could, but instead we made a
spreadsheet.
We rated Titan and OrientDB in 25ish categories, weighted the
categories, and added the results and picked a winner. The process
wasn't perfect, but it was a thing of beauty to watch four people simultaneously edit the spreadsheet, leaving comments explaining most of
the numbers. Titan eventually won by a fairly wide margin so we're
proceeding with work on it.
I expect lots of people will
have comments. Please reply here and/or comment directly on the
spreadsheet. Everyone in WMF can leave comments on the sheet and most
interested parties at WMDE have been given rights to do so. I'm loath
to set the document to world comment-able for some reason.