Thanks Magnus!  You've been heroic.  We'll target getting a version up and running in labs as soon as we can so people can poke holes in it.

Regarding Markus' points:
BigData == BlazeGraph
-----------------------------------
I agree this is confusing, sorry I didn't mention it.  When we started the evaluation it was BigData.  They must have realized that the name was totally generic and ungooglable.  The rename is pretty far from complete too.  The code is all still in the com.bigdata package....  I'm considering this a interesting quirk.

No Public Endpoint Report BigData or BlazeGraph
------------------------------------------------------------------------
BlazeGraph isn't ready to be exposed publicly.  It'll take code, probably written by WMF and upstreamed, to do that.  It was never one of BlazeGraph's focuses before but it has the right hooks to make it a reasonable task.  I admit this doesn't bother me as much as it should because we expected to have to do a lot of work in this area any way.  In fact it looks much easier with BlazeGraph than with Titan/Gremlin which we were so enamoured with the first time around.  The BlazeGraph code is very well documented and upstream is going to support us here as well.  I'm under no illusion that this isn't going to be a pile of work though.

The obvious question that comes from this point is "why not use Virtuoso?  it is exposed publicly all over the place, you can talk to the dbpedia folks, they do it" and this is a very compelling argument (in fact it reminds me I need to send yet more email (never ending....)).  And its an objection I can't refute.  I can only say that I feel like its a worthy trade for the upstream support we're getting.  That's just my gut talking and its not logical.   And I'm certainly willing to be convinced I'm wrong.  But I think the only way to really convince me would be to have the Virtuoso folks contact me and show the same kind of support we're getting from BlazeGraph.

Nik

On Fri, Mar 6, 2015 at 4:05 AM, Magnus Manske <magnusmanske@googlemail.com> wrote:
Yay progress! :-)

I'll try to keep WDQ alive until you have a production version up-and-running. Don't take too long...

On Fri, Mar 6, 2015 at 9:02 AM Markus Krötzsch <markus@semantic-mediawiki.org> wrote:
Hi,

Thanks for all the work. I think this is a sensible decision. What
confused me at first is that I did not know BlazeGraph (and when you
google for it, the first thing is an unrelated sourceforge project). An
important insight for me thus was that "BlazeGraph" is the project that
has up until very recently been called "Bigdata", and as such is not the
new, unknown project that I first thought it was.

It seems clear that there are a few issues to address. In particular,
among hundreds of known public SPARQL services [1], there does not seem
to be one that identifies itself as using BlazeGraph/Bigdata. However,
there is clearly potential here and it would be exciting to see the
project maturing into a robust free RDF store and query engine.

Cheers,

Markus

[1] http://sparqles.okfn.org/discoverability

On 05.03.2015 19:49, Nikolas Everett wrote:
> TL/DR: We're selected BlazeGraph to back the next Wikidata Query Service.
>
> After Titan evaporated about a month ago we went back to the drawing
> board on back ends for a new Wikidata Query Service.  We took four weeks
> (including a planed trip to Berlin) to settle on a backend.  As you can
> see from the spreadsheet
> <https://docs.google.com/a/wikimedia.org/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit#gid=0>
> we've really blown out the number of options.  As you can also see we
> didn't finish filling them all out.  But we've still pretty much settled
> on BlazeGraph <http://www.blazegraph.com/> anyway.  Let me first explain
> what BlazeGraph is and then defend our decision to stop spreadsheet work.
>
> BlazeGraph is a GPLed RDF triple store that natively supports SPARQL
> 1.1, RDFS, some OWL, and some extensions.  Those are all semantic web
> terms and they translate into a "its a graph database with an
> expressive, mostly standardized query language and support for inferring
> stuff as data is added and removed to the graph".  It also has some
> features that you'd recognize from nice relational databases: join order
> rewriting, smart query planner, hash and nested loop joins,  query
> rewrite rules, group by, order by, and aggregate functions.
>
> These are all cool features - really the kind of things that we thought
> we need but they come with an "interesting" price.  Semantic Web is a
> very old thing that's had a really odd degree of success.  If you have
> an hour and half Jim Hendler can explain
> <https://www.youtube.com/watch?v=oKiXpO2rbJM> it to you.  The upshot is
> that _tons_ of people have _tons_ of opinions.  The W3C standardizes
> RDF, SPARQL, RDFS, OWL, and about a billion other things.  There are
> (mostly non-W3C) standards for talking about people
> <http://xmlns.com/foaf/spec/>, social connections
> <http://rdfs.org/sioc/spec/>, and music
> <http://musicontology.com/specification/>. And they all have rules.  And
> Wikidata doesn't.  Not like these rules.  One thing I've learned from
> this project is that this lack of prescribed rules is one of Wikidata's
> founding principles.  Its worth it to allow openness.  So you _can_ set
> gender to "Bacon" or put GeoCoordinants on Amber
> <https://www.wikidata.org/wiki/Q1053330>.  Anyway!  I argue that, at
> least for now, we should ignore many of these standards.  We need to
> think of Wikidata Query Service as a tool to answer questions instead of
> as a some grand statement about the semantic web.  Mapping existing
> ontologies onto Wikidata is a task for another day.
>
> I feel like these semantic web technologies and BlazeGraph in particular
> are good fits for this project mostly because the quality of our "but
> what about X?" questions is very very high.  "How much inference should
> we do instead of query rewriting?" instead of "Can we do inference?  Can
> we do query rewriting?"  And "Which standard vocabularies should think
> about mapping to Wikidata?"  Holy cow!  In any other system there aren't
> "standard vocabularies" to even talk about mapping, much less a
> mechanism for mapping them.  Much less two!  Its almost an overwhelming
> wealth and as I elude to above it can be easy to bikeshed.
>
> We've been reasonably careful to reach out people we know are familiar
> with this space.  We're well aware of projects like the Wikidata Toolkit
> and its RDF exports.  We've been using those for testing.  We've talked
> to so many people about so many things.  Its really consumed a lot more
> time then I'd expected and made the search for the next backend very
> long.  But I feel comfortable that we're in a good place.  We don't know
> all the answers but we're sure there _are_ answers.
>
> The BlazeGraph upstream has been super active with us.  They've spent
> hours with us over hangouts, had me out to their office (a house an hour
> and half from mine) to talk about data modeling, and spent a ton of time
> commenting on Phabricator tickets.  They've offered to donate a formal
> support agreement as well.  And to get together with us about writing
> any features we might need to add to BlazeGraph.  And they've added me
> as a committer (I told them I had some typos to fix but I have yet to
> actually commit them).  And their code is well documented.
>
> So by now you've realized I'm a fan.  I believe that we should stop on
> the spreadsheet and just start work against BlazeGraph because I think
> we have phenomenal momentum with upstream.  And its a pretty clear
> winner on the spreadsheet at this point.  But there are two other triple
> stores which we haven't fully filled out that might be viable: OpenLink
> Virtuoso Open Source and Apache Jena.  Virtuoso is open core so I'm
> really loath to go too deep into it at his point.  Their HA features are
> not open source which implies that we'd have trouble with them as an
> upstream.  Apache Jena just isn't known
> <http://www.w3.org/wiki/LargeTripleStores#Jena_TDB_.281.7B.29> to scale
> to data as large as BlazeGraph and Virtuoso.  So I argue that these are
> systems that, in the unlikely event that BlazeGraph goes the way of
> Titan, we should start our third round of investigation against.  As it
> stands now I think we have a winner.
>
> We created a phabricator task <https://phabricator.wikimedia.org/T90101>
> with lots of children to run down our remaining questions.  The biggest
> remaining questions revolve around three areas:
> 1. Operational issues like "how should the cluster be deployed?" "do we
> use HA at all?" "how are rolling restarts done in HA?"
> 2.  How should we represent the data in the database? BlazeGraph (and
> only BlazeGraph) has an extension that *could* us called RDR.  Should we
> use it?
> 3.  Some folks have identified update rate as a risk.  Not upstream, but
> others familiar with triple stores in general.
>
>
> Our plans is to work on #2 over the next weeks because it really informs
> #1 because there are lots of working set size vs cpu time tradeoffs to
> investigate.  We'll start on #1 shortly as well.  #3 is a potential risk
> area so we'll be sure to investigate it soon.
>
> I admit I'm not super happy to leave the spreadsheet in the format its
> current unfilled-out state but I'm excited to have something to work
> with and think its the right thing to do right now.
>
> So thanks for reading all of this.  Please reply with comments.
>
> Thanks again,
>
> Nik
>
>
> _______________________________________________
> Wikidata-tech mailing list
> Wikidata-tech@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
>


_______________________________________________
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

_______________________________________________
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech