Hi Everyone,
I'm at WikiConference NA today, and I was chatting with someone from OCLC https://en.wikipedia.org/wiki/OCLC, and he mentioned that BlazeGraph can be configured to call out to a full-text search engine. It looks like it only works with SOLR out of the box, but the documentation https://wiki.blazegraph.com/wiki/index.php/ExternalFullTextSearch mentions that Elasticsearch is a candidate search endpoint.
Obviously it wouldn't be worth doing any real work on investigating this until the BlazeGraph/Amazon situation is clearer, and maybe Stas or others have looked at it in the past and already know why it isn't worth the added complexity, but there are some interesting use cases where combining full text and SPARQL would be useful—for example if you are looking for a person, you know part of their name, and some facts about them. In general, any full-text search with additional structured data constraints.
Anyone already know anything about the capacity of BlazeGraph?
Thanks, —Trey
Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation
On Thu, Oct 18, 2018 at 4:48 PM Trey Jones tjones@wikimedia.org wrote:
Hi Everyone,
I'm at WikiConference NA today, and I was chatting with someone from OCLC, and he mentioned that BlazeGraph can be configured to call out to a full-text search engine. It looks like it only works with SOLR out of the box, but the documentation mentions that Elasticsearch is a candidate search endpoint.
Obviously it wouldn't be worth doing any real work on investigating this until the BlazeGraph/Amazon situation is clearer, and maybe Stas or others have looked at it in the past and already know why it isn't worth the added complexity, but there are some interesting use cases where combining full text and SPARQL would be useful—for example if you are looking for a person, you know part of their name, and some facts about them. In general, any full-text search with additional structured data constraints.
Anyone already know anything about the capacity of BlazeGraph?
It all depends on what you mean by "capacity" and by "blazegraph". If by capacity you mean do we have enough hardware, the answer is not entirely easy.
The cluster servicing the public wdqs endpoint (which probably means "blazegraph" in this context) has widely varying load patterns, is sometime overloaded and is overall difficult to size correctly (especially since we don't have a good definition of what a good SLO would be, see [1]).
The internal wdqs endpoint is in a much better situation, with a more controlled load and a reasonable amount of headroom. I don't have a good visibility on the projects that might start using this internal cluster more, so that headroom might be consumed fairly quickly depending of what load we add to the cluster.
Last point: I have no idea what that blazegraph / elasticsearch integration looks like, but it sounds like it might be possible to generate arbitrary elasticsearch queries from SPARQL. If that's the case, we don't want to expose such a functionality on the public wdqs endpoint, or at least not with our current production elasticsearch backend as the target. That being said, it sounds like a very interesting idea!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T199228
Thanks, —Trey
Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation _______________________________________________ Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Instead of "the capacity" I meant "this capacity", but should have said "this feature", referring to Elasticsearch integration—though the information on system capacity was still interesting.
On Fri, Oct 19, 2018 at 3:57 AM, Guillaume Lederrey <glederrey@wikimedia.org
wrote:
On Thu, Oct 18, 2018 at 4:48 PM Trey Jones tjones@wikimedia.org wrote:
Hi Everyone,
I'm at WikiConference NA today, and I was chatting with someone from
OCLC, and he mentioned that BlazeGraph can be configured to call out to a full-text search engine. It looks like it only works with SOLR out of the box, but the documentation mentions that Elasticsearch is a candidate search endpoint.
Obviously it wouldn't be worth doing any real work on investigating this
until the BlazeGraph/Amazon situation is clearer, and maybe Stas or others have looked at it in the past and already know why it isn't worth the added complexity, but there are some interesting use cases where combining full text and SPARQL would be useful—for example if you are looking for a person, you know part of their name, and some facts about them. In general, any full-text search with additional structured data constraints.
Anyone already know anything about the capacity of BlazeGraph?
It all depends on what you mean by "capacity" and by "blazegraph". If by capacity you mean do we have enough hardware, the answer is not entirely easy.
The cluster servicing the public wdqs endpoint (which probably means "blazegraph" in this context) has widely varying load patterns, is sometime overloaded and is overall difficult to size correctly (especially since we don't have a good definition of what a good SLO would be, see [1]).
The internal wdqs endpoint is in a much better situation, with a more controlled load and a reasonable amount of headroom. I don't have a good visibility on the projects that might start using this internal cluster more, so that headroom might be consumed fairly quickly depending of what load we add to the cluster.
Last point: I have no idea what that blazegraph / elasticsearch integration looks like, but it sounds like it might be possible to generate arbitrary elasticsearch queries from SPARQL. If that's the case, we don't want to expose such a functionality on the public wdqs endpoint, or at least not with our current production elasticsearch backend as the target. That being said, it sounds like a very interesting idea!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T199228
Thanks, —Trey
Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation _______________________________________________ Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Guillaume Lederrey Operations Engineer, Search Platform Wikimedia Foundation UTC+2 / CEST
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
On Fri, Oct 19, 2018 at 3:40 PM Trey Jones tjones@wikimedia.org wrote:
Instead of "the capacity" I meant "this capacity", but should have said "this feature", referring to Elasticsearch integration—though the information on system capacity was still interesting.
Isn't that "capability" more than "capacity" (I'm trying to improve my English here). Though I knew that is sounded ambiguous!
On Fri, Oct 19, 2018 at 3:57 AM, Guillaume Lederrey glederrey@wikimedia.org wrote:
On Thu, Oct 18, 2018 at 4:48 PM Trey Jones tjones@wikimedia.org wrote:
Hi Everyone,
I'm at WikiConference NA today, and I was chatting with someone from OCLC, and he mentioned that BlazeGraph can be configured to call out to a full-text search engine. It looks like it only works with SOLR out of the box, but the documentation mentions that Elasticsearch is a candidate search endpoint.
Obviously it wouldn't be worth doing any real work on investigating this until the BlazeGraph/Amazon situation is clearer, and maybe Stas or others have looked at it in the past and already know why it isn't worth the added complexity, but there are some interesting use cases where combining full text and SPARQL would be useful—for example if you are looking for a person, you know part of their name, and some facts about them. In general, any full-text search with additional structured data constraints.
Anyone already know anything about the capacity of BlazeGraph?
It all depends on what you mean by "capacity" and by "blazegraph". If by capacity you mean do we have enough hardware, the answer is not entirely easy.
The cluster servicing the public wdqs endpoint (which probably means "blazegraph" in this context) has widely varying load patterns, is sometime overloaded and is overall difficult to size correctly (especially since we don't have a good definition of what a good SLO would be, see [1]).
The internal wdqs endpoint is in a much better situation, with a more controlled load and a reasonable amount of headroom. I don't have a good visibility on the projects that might start using this internal cluster more, so that headroom might be consumed fairly quickly depending of what load we add to the cluster.
Last point: I have no idea what that blazegraph / elasticsearch integration looks like, but it sounds like it might be possible to generate arbitrary elasticsearch queries from SPARQL. If that's the case, we don't want to expose such a functionality on the public wdqs endpoint, or at least not with our current production elasticsearch backend as the target. That being said, it sounds like a very interesting idea!
Have fun!
Guillaume
[1] https://phabricator.wikimedia.org/T199228
Thanks, —Trey
Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation _______________________________________________ Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Guillaume Lederrey Operations Engineer, Search Platform Wikimedia Foundation UTC+2 / CEST
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
On Fri, Oct 19, 2018 at 11:25 AM, Guillaume Lederrey < glederrey@wikimedia.org> wrote:
On Fri, Oct 19, 2018 at 3:40 PM Trey Jones tjones@wikimedia.org wrote:
Anyone already know anything about the capacity of BlazeGraph?
Instead of "the capacity" I meant "this capacity", but should have said
"this feature", referring to Elasticsearch integration—though the information on system capacity was still interesting.
Isn't that "capability" more than "capacity" (I'm trying to improve my English here). Though I knew that is sounded ambiguous!
English isn't content with having too many words, it also has to give many of them too many meanings, especially related shades of meanings that have to be inferred from context and/or reading the mind of the speaker. So, "capacity" can also mean "capability" or "role", and I think I was going for something of a blend of those two—so it was both the perfect word and a poor choice. ;)
Hi,
I remember Stas playing with it a bit, see https://phabricator.wikimedia.org/T141813
On Thu, Oct 18, 2018 at 4:48 PM Trey Jones tjones@wikimedia.org wrote:
Hi Everyone,
I'm at WikiConference NA today, and I was chatting with someone from OCLC https://en.wikipedia.org/wiki/OCLC, and he mentioned that BlazeGraph can be configured to call out to a full-text search engine. It looks like it only works with SOLR out of the box, but the documentation https://wiki.blazegraph.com/wiki/index.php/ExternalFullTextSearch mentions that Elasticsearch is a candidate search endpoint.
Obviously it wouldn't be worth doing any real work on investigating this until the BlazeGraph/Amazon situation is clearer, and maybe Stas or others have looked at it in the past and already know why it isn't worth the added complexity, but there are some interesting use cases where combining full text and SPARQL would be useful—for example if you are looking for a person, you know part of their name, and some facts about them. In general, any full-text search with additional structured data constraints.
Anyone already know anything about the capacity of BlazeGraph?
Thanks, —Trey
Trey Jones Sr. Software Engineer, Search Platform Wikimedia Foundation _______________________________________________ Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Hi!
I'm at WikiConference NA today, and I was chatting with someone from OCLC https://en.wikipedia.org/wiki/OCLC, and he mentioned that BlazeGraph can be configured to call out to a full-text search engine. It looks like it only works with SOLR out of the box, but the documentation https://wiki.blazegraph.com/wiki/index.php/ExternalFullTextSearch mentions that Elasticsearch is a candidate search endpoint.
Technically it is possible, and I looked into it, but given that we have a gateway to Mediawiki API (which can do the same search, essentially) I decided not to pursue this for now. We'd have basically to duplicate the work we've done in Mediawiki to compose proper Elastic queries, parse results, etc. and the best we'd have is the same thing we already have with Mediawiki API search. So I decided not to duplicate efforts for now.
the best we'd have is the same thing we already have with Mediawiki API search.
Ah, so there isn't a way to combine full-text results and SPAQRL results? That was the point of my original discussion with the fellow from OCLC, so if that's not possible, then, yeah, there's no point.
On Fri, Oct 19, 2018 at 1:18 PM, Stas Malyshev smalyshev@wikimedia.org wrote:
Hi!
I'm at WikiConference NA today, and I was chatting with someone from OCLC https://en.wikipedia.org/wiki/OCLC, and he mentioned that BlazeGraph can be configured to call out to a full-text search engine. It looks like it only works with SOLR out of the box, but the documentation https://wiki.blazegraph.com/wiki/index.php/ExternalFullTextSearch
mentions
that Elasticsearch is a candidate search endpoint.
Technically it is possible, and I looked into it, but given that we have a gateway to Mediawiki API (which can do the same search, essentially) I decided not to pursue this for now. We'd have basically to duplicate the work we've done in Mediawiki to compose proper Elastic queries, parse results, etc. and the best we'd have is the same thing we already have with Mediawiki API search. So I decided not to duplicate efforts for now.
-- Stas Malyshev smalyshev@wikimedia.org
Hi!
On 10/19/18 11:02 AM, Trey Jones wrote:
the best we'd have is the same thing we already have with Mediawiki API search.
Ah, so there isn't a way to combine full-text results and SPAQRL results? That was the point of my original discussion with the fellow from OCLC, so if that's not possible, then, yeah, there's no point.
Ah yes, you can combine! Just call Mediawiki API from inside SPRARQL query and combine with other clauses: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI
On Thu, Oct 25, 2018 at 5:45 AM Kunal Mehta legoktm@member.fsf.org wrote:
Hi Trey,
On 10/18/18 7:47 AM, Trey Jones wrote:
Obviously it wouldn't be worth doing any real work on investigating this until the BlazeGraph/Amazon situation is clearer...
I might have missed something, but what is this situation?
Amazon has acquired Blazegraph. It looks like they don't want to kill it, and the team itself is willing to continue to support Blazegraph. That being said, there has not been much activity in the last 2 years on their github repo [1].
[1] https://github.com/blazegraph/database
-- Legoktm
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
Kunal, see also https://phabricator.wikimedia.org/T206560
------------------------------------ Erika Bjune Director of Engineering - Search Platform & Fundraising Tech Wikimedia Foundation
On Thu, Oct 25, 2018 at 7:25 AM Guillaume Lederrey glederrey@wikimedia.org wrote:
On Thu, Oct 25, 2018 at 5:45 AM Kunal Mehta legoktm@member.fsf.org wrote:
Hi Trey,
On 10/18/18 7:47 AM, Trey Jones wrote:
Obviously it wouldn't be worth doing any real work on investigating
this
until the BlazeGraph/Amazon situation is clearer...
I might have missed something, but what is this situation?
Amazon has acquired Blazegraph. It looks like they don't want to kill it, and the team itself is willing to continue to support Blazegraph. That being said, there has not been much activity in the last 2 years on their github repo [1].
[1] https://github.com/blazegraph/database
-- Legoktm
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery
-- Guillaume Lederrey Operations Engineer, Search Platform Wikimedia Foundation UTC+2 / CEST
Discovery mailing list Discovery@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/discovery