TL/DR: We're selected BlazeGraph to back the next Wikidata Query Service.
After Titan evaporated about a month ago we went back to the drawing board on back ends for a new Wikidata Query Service. We took four weeks (including a planed trip to Berlin) to settle on a backend. As you can see from the spreadsheet https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0 we've really blown out the number of options. As you can also see we didn't finish filling them all out. But we've still pretty much settled on BlazeGraph http://www.blazegraph.com/ anyway. Let me first explain what BlazeGraph is and then defend our decision to stop spreadsheet work.
BlazeGraph is a GPLed RDF triple store that natively supports SPARQL 1.1, RDFS, some OWL, and some extensions. Those are all semantic web terms and they translate into a "its a graph database with an expressive, mostly standardized query language and support for inferring stuff as data is added and removed to the graph". It also has some features that you'd recognize from nice relational databases: join order rewriting, smart query planner, hash and nested loop joins, query rewrite rules, group by, order by, and aggregate functions.
These are all cool features - really the kind of things that we thought we need but they come with an "interesting" price. Semantic Web is a very old thing that's had a really odd degree of success. If you have an hour and half Jim Hendler can explain https://www.youtube.com/watch?v=oKiXpO2rbJM it to you. The upshot is that _tons_ of people have _tons_ of opinions. The W3C standardizes RDF, SPARQL, RDFS, OWL, and about a billion other things. There are (mostly non-W3C) standards for talking about people http://xmlns.com/foaf/spec/, social connections http://rdfs.org/sioc/spec/, and music http://musicontology.com/specification/. And they all have rules. And Wikidata doesn't. Not like these rules. One thing I've learned from this project is that this lack of prescribed rules is one of Wikidata's founding principles. Its worth it to allow openness. So you _can_ set gender to "Bacon" or put GeoCoordinants on Amber https://www.wikidata.org/wiki/Q1053330. Anyway! I argue that, at least for now, we should ignore many of these standards. We need to think of Wikidata Query Service as a tool to answer questions instead of as a some grand statement about the semantic web. Mapping existing ontologies onto Wikidata is a task for another day.
I feel like these semantic web technologies and BlazeGraph in particular are good fits for this project mostly because the quality of our "but what about X?" questions is very very high. "How much inference should we do instead of query rewriting?" instead of "Can we do inference? Can we do query rewriting?" And "Which standard vocabularies should think about mapping to Wikidata?" Holy cow! In any other system there aren't "standard vocabularies" to even talk about mapping, much less a mechanism for mapping them. Much less two! Its almost an overwhelming wealth and as I elude to above it can be easy to bikeshed.
We've been reasonably careful to reach out people we know are familiar with this space. We're well aware of projects like the Wikidata Toolkit and its RDF exports. We've been using those for testing. We've talked to so many people about so many things. Its really consumed a lot more time then I'd expected and made the search for the next backend very long. But I feel comfortable that we're in a good place. We don't know all the answers but we're sure there _are_ answers.
The BlazeGraph upstream has been super active with us. They've spent hours with us over hangouts, had me out to their office (a house an hour and half from mine) to talk about data modeling, and spent a ton of time commenting on Phabricator tickets. They've offered to donate a formal support agreement as well. And to get together with us about writing any features we might need to add to BlazeGraph. And they've added me as a committer (I told them I had some typos to fix but I have yet to actually commit them). And their code is well documented.
So by now you've realized I'm a fan. I believe that we should stop on the spreadsheet and just start work against BlazeGraph because I think we have phenomenal momentum with upstream. And its a pretty clear winner on the spreadsheet at this point. But there are two other triple stores which we haven't fully filled out that might be viable: OpenLink Virtuoso Open Source and Apache Jena. Virtuoso is open core so I'm really loath to go too deep into it at his point. Their HA features are not open source which implies that we'd have trouble with them as an upstream. Apache Jena just isn't known http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29 to scale to data as large as BlazeGraph and Virtuoso. So I argue that these are systems that, in the unlikely event that BlazeGraph goes the way of Titan, we should start our third round of investigation against. As it stands now I think we have a winner.
We created a phabricator task https://phabricator.wikimedia.org/T90101 with lots of children to run down our remaining questions. The biggest remaining questions revolve around three areas: 1. Operational issues like "how should the cluster be deployed?" "do we use HA at all?" "how are rolling restarts done in HA?" 2. How should we represent the data in the database? BlazeGraph (and only BlazeGraph) has an extension that *could* us called RDR. Should we use it? 3. Some folks have identified update rate as a risk. Not upstream, but others familiar with triple stores in general.
Our plans is to work on #2 over the next weeks because it really informs #1 because there are lots of working set size vs cpu time tradeoffs to investigate. We'll start on #1 shortly as well. #3 is a potential risk area so we'll be sure to investigate it soon.
I admit I'm not super happy to leave the spreadsheet in the format its current unfilled-out state but I'm excited to have something to work with and think its the right thing to do right now.
So thanks for reading all of this. Please reply with comments.
Thanks again,
Nik
Hi Nik,
(I am a Data Architect by trade, fyi)
At first I saw JAVA with Blazegraph and thought, "oh dear", don't they realize what a PITA the JVM GC can be for HPC ?
But then I read deeper and saw that BlazeGraph actually bypasses the GC, and they use C malloc through NIO. http://blog.blazegraph.com/?p=339
Then I thought, "Sweetness". They'll have no problems....as long as Wikidata puts an investment into arrayed system boards, like from Intel here: http://ark.intel.com/compare/61022,61021 or similar.
Blazegraph should be blazing as long as you give it the necessary hardware to perform C malloc through NIO with minimal contention across DIMMS. (such as those system boards above, or something similar)
Hope this helps your system designers,
Thad +ThadGuidry https://www.google.com/+ThadGuidry
Nik,
Will you be incorporating MapGraph, as well, with GPU hardware as part of the scope of the Wikidata Query Service ? Or is that out of scope until you know what the load limits will be and just use BlazeGraph as is with CPU-bound memory ?
What are the scalability plans for also using MapGraph with GPU's and their memory in the future, in case the need for faster graph traversal arises ?
Thad +ThadGuidry https://www.google.com/+ThadGuidry
On Thu, Mar 5, 2015 at 6:47 PM, Thad Guidry thadguidry@gmail.com wrote:
Nik,
Will you be incorporating MapGraph, as well, with GPU hardware as part of the scope of the Wikidata Query Service ? Or is that out of scope until you know what the load limits will be and just use BlazeGraph as is with CPU-bound memory ?
MapGraph isn't open source so we won't be using it.
What are the scalability plans for also using MapGraph with GPU's and their memory in the future, in case the need for faster graph traversal arises ?
So MapGraph is out but otherwise scalability plans are pretty standard stuff: 1. Instrument for slow stuff 2. Fix bugs that make it slow 3. Buy more servers to scale out when #2 gets too slow to keep up
These servers would just be replicas. This fails when the working set grows too large and that is something we'll be watching out for. BlazeGraph has some horizontal scaling features that we'll invoke if we get there.
Furthermore this'll all be easonably easy to run outside of the cluster so if folks need to take it locally and do things with it that we can't (like MapGraph) then it should work well.
I'm certainly weary of Java. I've worked in Java for years and I'm really familiar with all of its baggage. BlazeGraph does a very reasonable job with it. It feels like half of the graph databases are written in Java and I've always wondered why. Locking down the SPARQL endpoint so its "impossible" to overwhelm the system is high on our list of things to do and Java makes that harder. BlazeGraph's analytic query mode should help there. Ultimately I see the JVM as a risk to mitigate in this case.
Nik
Good to know.
Thanks Nik. Nice work and forward plans, Team !
Thad +ThadGuidry https://www.google.com/+ThadGuidry
On Thu, Mar 5, 2015 at 6:13 PM, Nikolas Everett neverett@wikimedia.org wrote:
On Thu, Mar 5, 2015 at 6:47 PM, Thad Guidry thadguidry@gmail.com wrote:
Nik,
Will you be incorporating MapGraph, as well, with GPU hardware as part of the scope of the Wikidata Query Service ? Or is that out of scope until you know what the load limits will be and just use BlazeGraph as is with CPU-bound memory ?
MapGraph isn't open source so we won't be using it.
What are the scalability plans for also using MapGraph with GPU's and their memory in the future, in case the need for faster graph traversal arises ?
So MapGraph is out but otherwise scalability plans are pretty standard stuff:
- Instrument for slow stuff
- Fix bugs that make it slow
- Buy more servers to scale out when #2 gets too slow to keep up
These servers would just be replicas. This fails when the working set grows too large and that is something we'll be watching out for. BlazeGraph has some horizontal scaling features that we'll invoke if we get there.
Furthermore this'll all be easonably easy to run outside of the cluster so if folks need to take it locally and do things with it that we can't (like MapGraph) then it should work well.
I'm certainly weary of Java. I've worked in Java for years and I'm really familiar with all of its baggage. BlazeGraph does a very reasonable job with it. It feels like half of the graph databases are written in Java and I've always wondered why. Locking down the SPARQL endpoint so its "impossible" to overwhelm the system is high on our list of things to do and Java makes that harder. BlazeGraph's analytic query mode should help there. Ultimately I see the JVM as a risk to mitigate in this case.
Nik
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Thanks, Nik for the writeup and thanks so much to you, Stas and everyone who helped you. It's great to see we're making progress on such an important piece of Wikidata.
Cheers Lydia
Hi,
Thanks for all the work. I think this is a sensible decision. What confused me at first is that I did not know BlazeGraph (and when you google for it, the first thing is an unrelated sourceforge project). An important insight for me thus was that "BlazeGraph" is the project that has up until very recently been called "Bigdata", and as such is not the new, unknown project that I first thought it was.
It seems clear that there are a few issues to address. In particular, among hundreds of known public SPARQL services [1], there does not seem to be one that identifies itself as using BlazeGraph/Bigdata. However, there is clearly potential here and it would be exciting to see the project maturing into a robust free RDF store and query engine.
Cheers,
Markus
[1] http://sparqles.okfn.org/discoverability
On 05.03.2015 19:49, Nikolas Everett wrote:
TL/DR: We're selected BlazeGraph to back the next Wikidata Query Service.
After Titan evaporated about a month ago we went back to the drawing board on back ends for a new Wikidata Query Service. We took four weeks (including a planed trip to Berlin) to settle on a backend. As you can see from the spreadsheet https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0 we've really blown out the number of options. As you can also see we didn't finish filling them all out. But we've still pretty much settled on BlazeGraph http://www.blazegraph.com/ anyway. Let me first explain what BlazeGraph is and then defend our decision to stop spreadsheet work.
BlazeGraph is a GPLed RDF triple store that natively supports SPARQL 1.1, RDFS, some OWL, and some extensions. Those are all semantic web terms and they translate into a "its a graph database with an expressive, mostly standardized query language and support for inferring stuff as data is added and removed to the graph". It also has some features that you'd recognize from nice relational databases: join order rewriting, smart query planner, hash and nested loop joins, query rewrite rules, group by, order by, and aggregate functions.
These are all cool features - really the kind of things that we thought we need but they come with an "interesting" price. Semantic Web is a very old thing that's had a really odd degree of success. If you have an hour and half Jim Hendler can explain https://www.youtube.com/watch?v=oKiXpO2rbJM it to you. The upshot is that _tons_ of people have _tons_ of opinions. The W3C standardizes RDF, SPARQL, RDFS, OWL, and about a billion other things. There are (mostly non-W3C) standards for talking about people http://xmlns.com/foaf/spec/, social connections http://rdfs.org/sioc/spec/, and music http://musicontology.com/specification/. And they all have rules. And Wikidata doesn't. Not like these rules. One thing I've learned from this project is that this lack of prescribed rules is one of Wikidata's founding principles. Its worth it to allow openness. So you _can_ set gender to "Bacon" or put GeoCoordinants on Amber https://www.wikidata.org/wiki/Q1053330. Anyway! I argue that, at least for now, we should ignore many of these standards. We need to think of Wikidata Query Service as a tool to answer questions instead of as a some grand statement about the semantic web. Mapping existing ontologies onto Wikidata is a task for another day.
I feel like these semantic web technologies and BlazeGraph in particular are good fits for this project mostly because the quality of our "but what about X?" questions is very very high. "How much inference should we do instead of query rewriting?" instead of "Can we do inference? Can we do query rewriting?" And "Which standard vocabularies should think about mapping to Wikidata?" Holy cow! In any other system there aren't "standard vocabularies" to even talk about mapping, much less a mechanism for mapping them. Much less two! Its almost an overwhelming wealth and as I elude to above it can be easy to bikeshed.
We've been reasonably careful to reach out people we know are familiar with this space. We're well aware of projects like the Wikidata Toolkit and its RDF exports. We've been using those for testing. We've talked to so many people about so many things. Its really consumed a lot more time then I'd expected and made the search for the next backend very long. But I feel comfortable that we're in a good place. We don't know all the answers but we're sure there _are_ answers.
The BlazeGraph upstream has been super active with us. They've spent hours with us over hangouts, had me out to their office (a house an hour and half from mine) to talk about data modeling, and spent a ton of time commenting on Phabricator tickets. They've offered to donate a formal support agreement as well. And to get together with us about writing any features we might need to add to BlazeGraph. And they've added me as a committer (I told them I had some typos to fix but I have yet to actually commit them). And their code is well documented.
So by now you've realized I'm a fan. I believe that we should stop on the spreadsheet and just start work against BlazeGraph because I think we have phenomenal momentum with upstream. And its a pretty clear winner on the spreadsheet at this point. But there are two other triple stores which we haven't fully filled out that might be viable: OpenLink Virtuoso Open Source and Apache Jena. Virtuoso is open core so I'm really loath to go too deep into it at his point. Their HA features are not open source which implies that we'd have trouble with them as an upstream. Apache Jena just isn't known http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29 to scale to data as large as BlazeGraph and Virtuoso. So I argue that these are systems that, in the unlikely event that BlazeGraph goes the way of Titan, we should start our third round of investigation against. As it stands now I think we have a winner.
We created a phabricator task https://phabricator.wikimedia.org/T90101 with lots of children to run down our remaining questions. The biggest remaining questions revolve around three areas:
- Operational issues like "how should the cluster be deployed?" "do we
use HA at all?" "how are rolling restarts done in HA?" 2. How should we represent the data in the database? BlazeGraph (and only BlazeGraph) has an extension that *could* us called RDR. Should we use it? 3. Some folks have identified update rate as a risk. Not upstream, but others familiar with triple stores in general.
Our plans is to work on #2 over the next weeks because it really informs #1 because there are lots of working set size vs cpu time tradeoffs to investigate. We'll start on #1 shortly as well. #3 is a potential risk area so we'll be sure to investigate it soon.
I admit I'm not super happy to leave the spreadsheet in the format its current unfilled-out state but I'm excited to have something to work with and think its the right thing to do right now.
So thanks for reading all of this. Please reply with comments.
Thanks again,
Nik
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Yay progress! :-)
I'll try to keep WDQ alive until you have a production version up-and-running. Don't take too long...
On Fri, Mar 6, 2015 at 9:02 AM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi,
Thanks for all the work. I think this is a sensible decision. What confused me at first is that I did not know BlazeGraph (and when you google for it, the first thing is an unrelated sourceforge project). An important insight for me thus was that "BlazeGraph" is the project that has up until very recently been called "Bigdata", and as such is not the new, unknown project that I first thought it was.
It seems clear that there are a few issues to address. In particular, among hundreds of known public SPARQL services [1], there does not seem to be one that identifies itself as using BlazeGraph/Bigdata. However, there is clearly potential here and it would be exciting to see the project maturing into a robust free RDF store and query engine.
Cheers,
Markus
[1] http://sparqles.okfn.org/discoverability
On 05.03.2015 19:49, Nikolas Everett wrote:
TL/DR: We're selected BlazeGraph to back the next Wikidata Query Service.
After Titan evaporated about a month ago we went back to the drawing board on back ends for a new Wikidata Query Service. We took four weeks (including a planed trip to Berlin) to settle on a backend. As you can see from the spreadsheet <https://docs.google.com/a/wikimedia.org/spreadsheets/d/
1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0>
we've really blown out the number of options. As you can also see we didn't finish filling them all out. But we've still pretty much settled on BlazeGraph http://www.blazegraph.com/ anyway. Let me first explain what BlazeGraph is and then defend our decision to stop spreadsheet work.
BlazeGraph is a GPLed RDF triple store that natively supports SPARQL 1.1, RDFS, some OWL, and some extensions. Those are all semantic web terms and they translate into a "its a graph database with an expressive, mostly standardized query language and support for inferring stuff as data is added and removed to the graph". It also has some features that you'd recognize from nice relational databases: join order rewriting, smart query planner, hash and nested loop joins, query rewrite rules, group by, order by, and aggregate functions.
These are all cool features - really the kind of things that we thought we need but they come with an "interesting" price. Semantic Web is a very old thing that's had a really odd degree of success. If you have an hour and half Jim Hendler can explain https://www.youtube.com/watch?v=oKiXpO2rbJM it to you. The upshot is that _tons_ of people have _tons_ of opinions. The W3C standardizes RDF, SPARQL, RDFS, OWL, and about a billion other things. There are (mostly non-W3C) standards for talking about people http://xmlns.com/foaf/spec/, social connections http://rdfs.org/sioc/spec/, and music http://musicontology.com/specification/. And they all have rules. And Wikidata doesn't. Not like these rules. One thing I've learned from this project is that this lack of prescribed rules is one of Wikidata's founding principles. Its worth it to allow openness. So you _can_ set gender to "Bacon" or put GeoCoordinants on Amber https://www.wikidata.org/wiki/Q1053330. Anyway! I argue that, at least for now, we should ignore many of these standards. We need to think of Wikidata Query Service as a tool to answer questions instead of as a some grand statement about the semantic web. Mapping existing ontologies onto Wikidata is a task for another day.
I feel like these semantic web technologies and BlazeGraph in particular are good fits for this project mostly because the quality of our "but what about X?" questions is very very high. "How much inference should we do instead of query rewriting?" instead of "Can we do inference? Can we do query rewriting?" And "Which standard vocabularies should think about mapping to Wikidata?" Holy cow! In any other system there aren't "standard vocabularies" to even talk about mapping, much less a mechanism for mapping them. Much less two! Its almost an overwhelming wealth and as I elude to above it can be easy to bikeshed.
We've been reasonably careful to reach out people we know are familiar with this space. We're well aware of projects like the Wikidata Toolkit and its RDF exports. We've been using those for testing. We've talked to so many people about so many things. Its really consumed a lot more time then I'd expected and made the search for the next backend very long. But I feel comfortable that we're in a good place. We don't know all the answers but we're sure there _are_ answers.
The BlazeGraph upstream has been super active with us. They've spent hours with us over hangouts, had me out to their office (a house an hour and half from mine) to talk about data modeling, and spent a ton of time commenting on Phabricator tickets. They've offered to donate a formal support agreement as well. And to get together with us about writing any features we might need to add to BlazeGraph. And they've added me as a committer (I told them I had some typos to fix but I have yet to actually commit them). And their code is well documented.
So by now you've realized I'm a fan. I believe that we should stop on the spreadsheet and just start work against BlazeGraph because I think we have phenomenal momentum with upstream. And its a pretty clear winner on the spreadsheet at this point. But there are two other triple stores which we haven't fully filled out that might be viable: OpenLink Virtuoso Open Source and Apache Jena. Virtuoso is open core so I'm really loath to go too deep into it at his point. Their HA features are not open source which implies that we'd have trouble with them as an upstream. Apache Jena just isn't known http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29 to scale to data as large as BlazeGraph and Virtuoso. So I argue that these are systems that, in the unlikely event that BlazeGraph goes the way of Titan, we should start our third round of investigation against. As it stands now I think we have a winner.
We created a phabricator task https://phabricator.wikimedia.org/T90101 with lots of children to run down our remaining questions. The biggest remaining questions revolve around three areas:
- Operational issues like "how should the cluster be deployed?" "do we
use HA at all?" "how are rolling restarts done in HA?" 2. How should we represent the data in the database? BlazeGraph (and only BlazeGraph) has an extension that *could* us called RDR. Should we use it? 3. Some folks have identified update rate as a risk. Not upstream, but others familiar with triple stores in general.
Our plans is to work on #2 over the next weeks because it really informs #1 because there are lots of working set size vs cpu time tradeoffs to investigate. We'll start on #1 shortly as well. #3 is a potential risk area so we'll be sure to investigate it soon.
I admit I'm not super happy to leave the spreadsheet in the format its current unfilled-out state but I'm excited to have something to work with and think its the right thing to do right now.
So thanks for reading all of this. Please reply with comments.
Thanks again,
Nik
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Thanks Magnus! You've been heroic. We'll target getting a version up and running in labs as soon as we can so people can poke holes in it.
Regarding Markus' points: BigData == BlazeGraph ----------------------------------- I agree this is confusing, sorry I didn't mention it. When we started the evaluation it was BigData. They must have realized that the name was totally generic and ungooglable. The rename is pretty far from complete too. The code is all still in the com.bigdata package.... I'm considering this a interesting quirk.
No Public Endpoint Report BigData or BlazeGraph ------------------------------------------------------------------------ BlazeGraph isn't ready to be exposed publicly. It'll take code, probably written by WMF and upstreamed, to do that. It was never one of BlazeGraph's focuses before but it has the right hooks to make it a reasonable task. I admit this doesn't bother me as much as it should because we expected to have to do a lot of work in this area any way. In fact it looks much easier with BlazeGraph than with Titan/Gremlin which we were so enamoured with the first time around. The BlazeGraph code is very well documented and upstream is going to support us here as well. I'm under no illusion that this isn't going to be a pile of work though.
The obvious question that comes from this point is "why not use Virtuoso? it is exposed publicly all over the place, you can talk to the dbpedia folks, they do it" and this is a very compelling argument (in fact it reminds me I need to send yet more email (never ending....)). And its an objection I can't refute. I can only say that I feel like its a worthy trade for the upstream support we're getting. That's just my gut talking and its not logical. And I'm certainly willing to be convinced I'm wrong. But I think the only way to really convince me would be to have the Virtuoso folks contact me and show the same kind of support we're getting from BlazeGraph.
Nik
On Fri, Mar 6, 2015 at 4:05 AM, Magnus Manske magnusmanske@googlemail.com wrote:
Yay progress! :-)
I'll try to keep WDQ alive until you have a production version up-and-running. Don't take too long...
On Fri, Mar 6, 2015 at 9:02 AM Markus Krötzsch < markus@semantic-mediawiki.org> wrote:
Hi,
Thanks for all the work. I think this is a sensible decision. What confused me at first is that I did not know BlazeGraph (and when you google for it, the first thing is an unrelated sourceforge project). An important insight for me thus was that "BlazeGraph" is the project that has up until very recently been called "Bigdata", and as such is not the new, unknown project that I first thought it was.
It seems clear that there are a few issues to address. In particular, among hundreds of known public SPARQL services [1], there does not seem to be one that identifies itself as using BlazeGraph/Bigdata. However, there is clearly potential here and it would be exciting to see the project maturing into a robust free RDF store and query engine.
Cheers,
Markus
[1] http://sparqles.okfn.org/discoverability
On 05.03.2015 19:49, Nikolas Everett wrote:
TL/DR: We're selected BlazeGraph to back the next Wikidata Query
Service.
After Titan evaporated about a month ago we went back to the drawing board on back ends for a new Wikidata Query Service. We took four weeks (including a planed trip to Berlin) to settle on a backend. As you can see from the spreadsheet <https://docs.google.com/a/wikimedia.org/spreadsheets/d/
1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0>
we've really blown out the number of options. As you can also see we didn't finish filling them all out. But we've still pretty much settled on BlazeGraph http://www.blazegraph.com/ anyway. Let me first
explain
what BlazeGraph is and then defend our decision to stop spreadsheet
work.
BlazeGraph is a GPLed RDF triple store that natively supports SPARQL 1.1, RDFS, some OWL, and some extensions. Those are all semantic web terms and they translate into a "its a graph database with an expressive, mostly standardized query language and support for inferring stuff as data is added and removed to the graph". It also has some features that you'd recognize from nice relational databases: join order rewriting, smart query planner, hash and nested loop joins, query rewrite rules, group by, order by, and aggregate functions.
These are all cool features - really the kind of things that we thought we need but they come with an "interesting" price. Semantic Web is a very old thing that's had a really odd degree of success. If you have an hour and half Jim Hendler can explain https://www.youtube.com/watch?v=oKiXpO2rbJM it to you. The upshot is that _tons_ of people have _tons_ of opinions. The W3C standardizes RDF, SPARQL, RDFS, OWL, and about a billion other things. There are (mostly non-W3C) standards for talking about people http://xmlns.com/foaf/spec/, social connections http://rdfs.org/sioc/spec/, and music http://musicontology.com/specification/. And they all have rules.
And
Wikidata doesn't. Not like these rules. One thing I've learned from this project is that this lack of prescribed rules is one of Wikidata's founding principles. Its worth it to allow openness. So you _can_ set gender to "Bacon" or put GeoCoordinants on Amber https://www.wikidata.org/wiki/Q1053330. Anyway! I argue that, at least for now, we should ignore many of these standards. We need to think of Wikidata Query Service as a tool to answer questions instead of as a some grand statement about the semantic web. Mapping existing ontologies onto Wikidata is a task for another day.
I feel like these semantic web technologies and BlazeGraph in particular are good fits for this project mostly because the quality of our "but what about X?" questions is very very high. "How much inference should we do instead of query rewriting?" instead of "Can we do inference? Can we do query rewriting?" And "Which standard vocabularies should think about mapping to Wikidata?" Holy cow! In any other system there aren't "standard vocabularies" to even talk about mapping, much less a mechanism for mapping them. Much less two! Its almost an overwhelming wealth and as I elude to above it can be easy to bikeshed.
We've been reasonably careful to reach out people we know are familiar with this space. We're well aware of projects like the Wikidata Toolkit and its RDF exports. We've been using those for testing. We've talked to so many people about so many things. Its really consumed a lot more time then I'd expected and made the search for the next backend very long. But I feel comfortable that we're in a good place. We don't know all the answers but we're sure there _are_ answers.
The BlazeGraph upstream has been super active with us. They've spent hours with us over hangouts, had me out to their office (a house an hour and half from mine) to talk about data modeling, and spent a ton of time commenting on Phabricator tickets. They've offered to donate a formal support agreement as well. And to get together with us about writing any features we might need to add to BlazeGraph. And they've added me as a committer (I told them I had some typos to fix but I have yet to actually commit them). And their code is well documented.
So by now you've realized I'm a fan. I believe that we should stop on the spreadsheet and just start work against BlazeGraph because I think we have phenomenal momentum with upstream. And its a pretty clear winner on the spreadsheet at this point. But there are two other triple stores which we haven't fully filled out that might be viable: OpenLink Virtuoso Open Source and Apache Jena. Virtuoso is open core so I'm really loath to go too deep into it at his point. Their HA features are not open source which implies that we'd have trouble with them as an upstream. Apache Jena just isn't known http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29 to scale to data as large as BlazeGraph and Virtuoso. So I argue that these are systems that, in the unlikely event that BlazeGraph goes the way of Titan, we should start our third round of investigation against. As it stands now I think we have a winner.
We created a phabricator task <https://phabricator.wikimedia.org/T90101
with lots of children to run down our remaining questions. The biggest remaining questions revolve around three areas:
- Operational issues like "how should the cluster be deployed?" "do we
use HA at all?" "how are rolling restarts done in HA?" 2. How should we represent the data in the database? BlazeGraph (and only BlazeGraph) has an extension that *could* us called RDR. Should we use it? 3. Some folks have identified update rate as a risk. Not upstream, but others familiar with triple stores in general.
Our plans is to work on #2 over the next weeks because it really informs #1 because there are lots of working set size vs cpu time tradeoffs to investigate. We'll start on #1 shortly as well. #3 is a potential risk area so we'll be sure to investigate it soon.
I admit I'm not super happy to leave the spreadsheet in the format its current unfilled-out state but I'm excited to have something to work with and think its the right thing to do right now.
So thanks for reading all of this. Please reply with comments.
Thanks again,
Nik
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
On 06.03.2015 15:05, Nikolas Everett wrote: ...
Regarding Markus' points:
...
The obvious question that comes from this point is "why not use Virtuoso? it is exposed publicly all over the place, you can talk to the dbpedia folks, they do it" and this is a very compelling argument
As long as you are convinced that BlazeGraph can deliver at least medium level query performance (I guess your evaluation involved performance tests), then I am convinced that it is a good choice for us. Our use case is relatively small in terms of data but large in terms of users. In this setting, high availability is more important than raw query performance. Moreover, it is fair to assume that the query loads that today's public SPARQL endpoints are getting are quite different in nature than what we would have for Wikidata's query service anyway, so there is no guarantee that any system will work for us without adjustments. And of course we may hope that the use of a standard data format and query language would still allow us to consider other systems in the future if necessary, and many of the work done up until then will still be meaningful. For now, the most important thing is to get started :-)
Cheers,
Markus
On Mar 5, 2015 8:50 PM, "Nikolas Everett" neverett@wikimedia.org wrote:
TL/DR: We're selected BlazeGraph to back the next Wikidata Query Service.
After Titan evaporated about a month ago we went back to the drawing
board on back ends for a new Wikidata Query Service. We took four weeks (including a planed trip to Berlin) to settle on a backend. As you can see from the spreadsheet we've really blown out the number of options. As you can also see we didn't finish filling them all out. But we've still pretty much settled on BlazeGraph anyway. Let me first explain what BlazeGraph is and then defend our decision to stop spreadsheet work.
BlazeGraph is a GPLed RDF triple store that natively supports SPARQL 1.1,
RDFS, some OWL, and some extensions. Those are all semantic web terms and they translate into a "its a graph database with an expressive, mostly standardized query language and support for inferring stuff as data is added and removed to the graph". It also has some features that you'd recognize from nice relational databases: join order rewriting, smart query planner, hash and nested loop joins, query rewrite rules, group by, order by, and aggregate functions.
These are all cool features - really the kind of things that we thought
we need but they come with an "interesting" price. Semantic Web is a very old thing that's had a really odd degree of success. If you have an hour and half Jim Hendler can explain it to you. The upshot is that _tons_ of people have _tons_ of opinions. The W3C standardizes RDF, SPARQL, RDFS, OWL, and about a billion other things. There are (mostly non-W3C) standards for talking about people, social connections, and music. And they all have rules. And Wikidata doesn't. Not like these rules. One thing I've learned from this project is that this lack of prescribed rules is one of Wikidata's founding principles. Its worth it to allow openness. So you _can_ set gender to "Bacon" or put GeoCoordinants on Amber. Anyway! I argue that, at least for now, we should ignore many of these standards. We need to think of Wikidata Query Service as a tool to answer questions instead of as a some grand statement about the semantic web. Mapping existing ontologies onto Wikidata is a task for another day.
I feel like these semantic web technologies and BlazeGraph in particular
are good fits for this project mostly because the quality of our "but what about X?" questions is very very high. "How much inference should we do instead of query rewriting?" instead of "Can we do inference? Can we do query rewriting?" And "Which standard vocabularies should think about mapping to Wikidata?" Holy cow! In any other system there aren't "standard vocabularies" to even talk about mapping, much less a mechanism for mapping them. Much less two! Its almost an overwhelming wealth and as I elude to above it can be easy to bikeshed.
We've been reasonably careful to reach out people we know are familiar
with this space. We're well aware of projects like the Wikidata Toolkit and its RDF exports. We've been using those for testing. We've talked to so many people about so many things. Its really consumed a lot more time then I'd expected and made the search for the next backend very long. But I feel comfortable that we're in a good place. We don't know all the answers but we're sure there _are_ answers.
The BlazeGraph upstream has been super active with us. They've spent
hours with us over hangouts, had me out to their office (a house an hour and half from mine) to talk about data modeling, and spent a ton of time commenting on Phabricator tickets. They've offered to donate a formal support agreement as well. And to get together with us about writing any features we might need to add to BlazeGraph. And they've added me as a committer (I told them I had some typos to fix but I have yet to actually commit them). And their code is well documented.
So by now you've realized I'm a fan. I believe that we should stop on
the spreadsheet and just start work against BlazeGraph because I think we have phenomenal momentum with upstream. And its a pretty clear winner on the spreadsheet at this point. But there are two other triple stores which we haven't fully filled out that might be viable: OpenLink Virtuoso Open Source and Apache Jena. Virtuoso is open core so I'm really loath to go too deep into it at his point. Their HA features are not open source which implies that we'd have trouble with them as an upstream. Apache Jena just isn't known to scale to data as large as BlazeGraph and Virtuoso. So I argue that these are systems that, in the unlikely event that BlazeGraph goes the way of Titan, we should start our third round of investigation against. As it stands now I think we have a winner.
We created a phabricator task with lots of children to run down our
remaining questions. The biggest remaining questions revolve around three areas:
- Operational issues like "how should the cluster be deployed?" "do we
use HA at all?" "how are rolling restarts done in HA?"
- How should we represent the data in the database? BlazeGraph (and
only BlazeGraph) has an extension that *could* us called RDR. Should we use it?
I don't think RDR is compatible with the existing reification techniques chosen, at least for the Wikidata toolkit RDF exports.
- Some folks have identified update rate as a risk. Not upstream, but
others familiar with triple stores in general.
Our plans is to work on #2 over the next weeks because it really informs
#1 because there are lots of working set size vs cpu time tradeoffs to investigate. We'll start on #1 shortly as well. #3 is a potential risk area so we'll be sure to investigate it soon.
I admit I'm not super happy to leave the spreadsheet in the format its
current unfilled-out state but I'm excited to have something to work with and think its the right thing to do right now.
So thanks for reading all of this. Please reply with comments.
Thanks again,
Nik
Wikidata-tech mailing list Wikidata-tech@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
On Fri, Mar 6, 2015 at 11:28 AM, Dimitris Kontokostas jimkont@gmail.com wrote:
On Mar 5, 2015 8:50 PM, "Nikolas Everett" neverett@wikimedia.org wrote:
- How should we represent the data in the database? BlazeGraph (and
only BlazeGraph) has an extension that *could* us called RDR. Should we use it?
I don't think RDR is compatible with the existing reification techniques chosen, at least for the Wikidata toolkit RDF exports.
Yup, mostly around values. We'd have to take that into account. Markus and I've talked there. Wikidata Toolkit is very complete. I want queries to be relatively simple to write but still have access to the same data that Wikidata Toolkit exports. We're experimenting with lots of things - changing the export, inference, and query rewrites. And we think that some combination of those will be required. Changing the export is both the hardest and easiest of the three to do.
The BlazeGraph developers think of RDR as semi-independent things: 1. A storage layer optimization that can be performed against regular triples and SPARQL. 2. A syntax extension that looks like << :foo :bar :baz >> :quz :norf that explicitly invokes that optimization.
The syntax is extension is one of our options for exposing qualifies and reference to the query language. Its just one of the options.
Some days I wake up and think "We can just use RDR. The lock in isn't that bad because we could reimplement the syntax it elsewhere." Other days I don't.
We're working to experiment with different RDF exports now and we'll get a better feeling about it soon.
I can certainly expand on what I mean by "queries must be relatively simple to write" later.
Nik
wikidata-tech@lists.wikimedia.org