Hello wikidata-tech!
tl/dr: we're going to move forward building a query service against Titan/Cassandra rather than OrientDB or ArangoDB or anything else.
We finally finished up the exploratory phase of building a Wikidata Query Service we can run on the cluster. As a reminder, these were our goals: 1. Horizontal scalability for handling more queries and large data sets 2. Active user community outside of WMF
The system has to be able to answer questions like "find me all the humans without a date of death born more than 115 years ago" and "list the 10 biggest cities in Europe with female mayors" or "find me all writers who are not authors".
We're perfectly happy if its something we have to hack on, but we don't want to be the _only_ ones using it. That lands you in lsearchd like situations where the open source world moves in a different direction than you do and then no one knows how to support your software. You become afraid to restart it, much less release new versions of it. (BTW, Chad just pulled the last remaining trickle of traffic from it today, party!)
So we identified three good candidates: Titan (backed by Cassandra), OrientDB and ArangoDB. We prototyped against both Titan and OrientDB and both worked pretty well. We didn't have time to prototype against ArangoDB and we also had communications mixup with upstream. Anyway, when it came down to it we had two working prototypes so taking time to build a third felt a bit redundant. You can see much of our notes here https://www.mediawiki.org/wiki/Wikibase/Indexing.
The trouble with two working prototypes is that you can't just flip a coin to pick one. I guess you could, but instead we made a spreadsheet https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing. We rated Titan and OrientDB in 25ish categories, weighted the categories, and added the results and picked a winner. The process wasn't perfect, but it was a thing of beauty to watch four people simultaneously edit the spreadsheet, leaving comments explaining most of the numbers. Titan eventually won by a fairly wide margin so we're proceeding with work on it.
I expect lots of people will have comments. Please reply here and/or comment directly on the spreadsheet. Everyone in WMF can leave comments on the sheet and most interested parties at WMDE have been given rights to do so. I'm loath to set the document to world comment-able for some reason.
I don't think its particularly likely we'll end up reworking the spreadsheet to the point where Titan isn't still the victor, but if we do lets try to do so quickly so we can stop work on it.
We've started to try and use our workboard https://phabricator.wikimedia.org/project/board/891/ to keep track of work left to do. The next big steps are: * Draw up an architecture so we know how many and what kind of servers ask ops for * To port what we have from Titan 0.5.0 to Titan 0.9.0-M1 so we're on the most current development line (also, 0.9.0-M1 supports reverse-i-search, and who can live without that?) * To implement incremental updates * Start prototyping a public query API
So, any questions?
Note: I've also sent this email to WMF-ops but can't add both lists to the same conversation because ops email are moderated and any conversation here would create a moderation nightmare.
Nik