Hello wikidata-tech!
tl/dr: we're going to move forward building a query service against Titan/Cassandra rather than OrientDB or ArangoDB or anything else.
We finally finished up the exploratory phase of building a Wikidata Query Service we can run on the cluster. As a reminder, these were our goals: 1. Horizontal scalability for handling more queries and large data sets 2. Active user community outside of WMF
The system has to be able to answer questions like "find me all the humans without a date of death born more than 115 years ago" and "list the 10 biggest cities in Europe with female mayors" or "find me all writers who are not authors".
We're perfectly happy if its something we have to hack on, but we don't want to be the _only_ ones using it. That lands you in lsearchd like situations where the open source world moves in a different direction than you do and then no one knows how to support your software. You become afraid to restart it, much less release new versions of it. (BTW, Chad just pulled the last remaining trickle of traffic from it today, party!)
So we identified three good candidates: Titan (backed by Cassandra), OrientDB and ArangoDB. We prototyped against both Titan and OrientDB and both worked pretty well. We didn't have time to prototype against ArangoDB and we also had communications mixup with upstream. Anyway, when it came down to it we had two working prototypes so taking time to build a third felt a bit redundant. You can see much of our notes here https://www.mediawiki.org/wiki/Wikibase/Indexing.
The trouble with two working prototypes is that you can't just flip a coin to pick one. I guess you could, but instead we made a spreadsheet https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing. We rated Titan and OrientDB in 25ish categories, weighted the categories, and added the results and picked a winner. The process wasn't perfect, but it was a thing of beauty to watch four people simultaneously edit the spreadsheet, leaving comments explaining most of the numbers. Titan eventually won by a fairly wide margin so we're proceeding with work on it.
I expect lots of people will have comments. Please reply here and/or comment directly on the spreadsheet. Everyone in WMF can leave comments on the sheet and most interested parties at WMDE have been given rights to do so. I'm loath to set the document to world comment-able for some reason.
I don't think its particularly likely we'll end up reworking the spreadsheet to the point where Titan isn't still the victor, but if we do lets try to do so quickly so we can stop work on it.
We've started to try and use our workboard https://phabricator.wikimedia.org/project/board/891/ to keep track of work left to do. The next big steps are: * Draw up an architecture so we know how many and what kind of servers ask ops for * To port what we have from Titan 0.5.0 to Titan 0.9.0-M1 so we're on the most current development line (also, 0.9.0-M1 supports reverse-i-search, and who can live without that?) * To implement incremental updates * Start prototyping a public query API
So, any questions?
Note: I've also sent this email to WMF-ops but can't add both lists to the same conversation because ops email are moderated and any conversation here would create a moderation nightmare.
Nik
Hi Nik!
Nikolas Everett schreef op 7-1-2015 om 0:01:
tl/dr: we're going to move forward building a query service against Titan/Cassandra rather than OrientDB or ArangoDB or anything else.
Great to hear that you're making progress! On thing you didn't mention in your email or in the spreadsheet is that we already have a query service prototype at http://wdq.wmflabs.org/ . You do know about that, right?
Whatever you come up with will be compared with WikidataQuery. You need to manage expectations here and communicate. WikidataQuery is your baseline. If you lack certain features that are in WikidataQuery, the new query service sucks from a user perspective. If on the other hand you have things that are better (reliability? speed?), the new query service is a good next step from a user perspective.
I would strongly suggest you make a comparison of WikidataQuery and the new query service so the users know what to expect and can comment in an early stage. If you don't do this you risk ending up with a technical outstanding service that nobody is using.
Maarten
Hi Marteen,
they have evaluated, and rejected (rightly so) WDQ as one alternative. They also seem to have all my query commands covered, AFAICT. See the notes Nik linked to.
Personally, I will be relieved once running such a vital service is not on my part-time shoulders anymore :-) WDQ has been challenging and fun, but it should really be an official WMF service offering a Wikidata query API, just like Wikidata proper is.
Cheers, Magnus
On Thu, Jan 8, 2015 at 9:51 PM, Maarten Dammers maarten@mdammers.nl wrote:
Hi Nik!
Nikolas Everett schreef op 7-1-2015 om 0:01:
tl/dr: we're going to move forward building a query service against Titan/Cassandra rather than OrientDB or ArangoDB or anything else.
Great to hear that you're making progress! On thing you didn't mention in your email or in the spreadsheet is that we already have a query service prototype at http://wdq.wmflabs.org/ . You do know about that, right?
Whatever you come up with will be compared with WikidataQuery. You need to manage expectations here and communicate. WikidataQuery is your baseline. If you lack certain features that are in WikidataQuery, the new query service sucks from a user perspective. If on the other hand you have things that are better (reliability? speed?), the new query service is a good next step from a user perspective.
I would strongly suggest you make a comparison of WikidataQuery and the new query service so the users know what to expect and can comment in an early stage. If you don't do this you risk ending up with a technical outstanding service that nobody is using.
Maarten
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Hi Magnus,
Magnus Manske schreef op 8-1-2015 om 22:07:
Hi Marteen,
they have evaluated, and rejected (rightly so) WDQ as one alternative. They also seem to have all my query commands covered, AFAICT. See the notes Nik linked to.
I don't see WDQ as an alternative. Maybe I wasn't clear enough, I'm talking about the *functionality* of WikidataQuery. So for example WDQ has the feature that I can get a list of items for which certain claims are true. That's a feature.
Personally, I will be relieved once running such a vital service is not on my part-time shoulders anymore :-) WDQ has been challenging and fun, but it should really be an official WMF service offering a Wikidata query API, just like Wikidata proper is.
Of course, but if the new service is lacking features or just not better, people will continue using WDQ. It would be like the Toolserver -> Toollabs experience all over again :-(
Maarten
Cheers, Magnus
On Thu, Jan 8, 2015 at 9:51 PM, Maarten Dammers <maarten@mdammers.nl mailto:maarten@mdammers.nl> wrote:
Hi Nik! Nikolas Everett schreef op 7-1-2015 om 0:01: tl/dr: we're going to move forward building a query service against Titan/Cassandra rather than OrientDB or ArangoDB or anything else. Great to hear that you're making progress! On thing you didn't mention in your email or in the spreadsheet is that we already have a query service prototype at http://wdq.wmflabs.org/ . You do know about that, right? Whatever you come up with will be compared with WikidataQuery. You need to manage expectations here and communicate. WikidataQuery is your baseline. If you lack certain features that are in WikidataQuery, the new query service sucks from a user perspective. If on the other hand you have things that are better (reliability? speed?), the new query service is a good next step from a user perspective. I would strongly suggest you make a comparison of WikidataQuery and the new query service so the users know what to expect and can comment in an early stage. If you don't do this you risk ending up with a technical outstanding service that nobody is using. Maarten _______________________________________________ Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org <mailto:Wikidata-tech@lists.wikimedia.org> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Hi!
I don't see WDQ as an alternative. Maybe I wasn't clear enough, I'm talking about the *functionality* of WikidataQuery. So for example WDQ has the feature that I can get a list of items for which certain claims are true. That's a feature.
Matching the features of WDQ is part of what we're looking for in the new service. We do not know yet what the public API would look like, but on discussion page for https://www.mediawiki.org/wiki/Wikibase/Indexing we have some things that the internal engine can do and how they match what WDQ does. So far we're pretty confident we can do the same things WDQ does, but we don't know yet how exactly it would look like :)
Thanks, Stas
wikidata-tech@lists.wikimedia.org