Next steps for complex queries on the cluster - Wikidata-tech

6 Jan 2015


      Hello wikidata-tech!
tl/dr: we're going to move forward building a query service against
Titan/Cassandra rather than OrientDB or ArangoDB or anything else.
We finally finished up the exploratory phase of building a Wikidata Query
Service we can run on the cluster.  As a reminder, these were our goals:
1.  Horizontal scalability for handling more queries and large data sets
2.  Active user community outside of WMF
The system has to be able to answer questions like "find me all the humans
without a date of death born more than 115 years ago" and "list the 10
biggest cities in Europe with female mayors" or "find me all writers who
are not authors".
We're perfectly happy if its something we have to hack on, but we don't
want to be the _only_ ones using it.  That lands you in lsearchd like
situations where the open source world moves in a different direction than
you do and then no one knows how to support your software.  You become
afraid to restart it, much less release new versions of it.  (BTW, Chad
just pulled the last remaining trickle of traffic from it today, party!)
So we identified three good candidates: Titan (backed by Cassandra),
OrientDB and ArangoDB.  We prototyped against both Titan and OrientDB and
both worked pretty well.  We didn't have time to prototype against ArangoDB
and we also had communications mixup with upstream.  Anyway, when it came
down to it we had two working prototypes so taking time to build a third
felt a bit redundant.  You can see much of our notes here
https://www.mediawiki.org/wiki/Wikibase/Indexing.
The trouble with two working prototypes is that you can't just flip a coin
to pick one.  I guess you could, but instead we made a spreadsheet
https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing.
We rated Titan and OrientDB in 25ish categories, weighted the categories,
and added the results and picked a winner.  The process wasn't perfect, but
it was a thing of beauty to watch four people simultaneously edit the
spreadsheet, leaving comments explaining most of the numbers.  Titan
eventually won by a fairly wide margin so we're proceeding with work on it.
I expect lots of people will have comments.  Please reply here and/or
comment directly on the spreadsheet.  Everyone in WMF can leave comments on
the sheet and most interested parties at WMDE have been given rights to do
so.  I'm loath to set the document to world comment-able for some reason.
I don't think its particularly likely we'll end up reworking the
spreadsheet to the point where Titan isn't still the victor, but if we do
lets try to do so quickly so we can stop work on it.
We've started to try and use our workboard
https://phabricator.wikimedia.org/project/board/891/ to keep track of
work left to do.  The next big steps are:
* Draw up an architecture so we know how many and what kind of servers ask
ops for
* To port what we have from Titan 0.5.0 to Titan 0.9.0-M1 so we're on the
most current development line (also, 0.9.0-M1 supports reverse-i-search,
and who can live without that?)
* To implement incremental updates
* Start prototyping a public query API
So, any questions?
Note: I've also sent this email to WMF-ops but can't add both lists to the
same conversation because ops email are moderated and any conversation here
would create a moderation nightmare.
Nik