On 06-08-2016 17:56, Stas Malyshev wrote:
Hi!
On a side note, the results we presented for
BlazeGraph could improve
dramatically if one could isolate queries that timed out. Once one query
in a sequence timed-out (we used server-side timeouts), we observed that
a run of queries would then timeout, possibly a locking problem or
Could you please give a bit more details about this failure scenario? Is
is that several queries are run in parallel and one query, timing out,
hurts performance of others? Does it happen even after the long query
times out? Or it was a sequential run and after one query timed out, the
next query had worse performance than the same query when run not
preceded by the timing-out query, i.e. timeout query had persistent
effect beyond its initial run?
The latter was the case, yes. We ran the queries in a given batch
sequentially (waiting for one to finish before the next was run) and
when one timed out, the next would almost surely time-out and the engine
would not recover.
We tried a few things on this, like waiting an extra 60 seconds before
running the next query, and also changing memory configurations to avoid
GC issues. I believe Daniel was also in contact with the devs.
Ultimately we figured we probably couldn't resolve the issue without
touching the source code, which would obviously not be fair.
BTW, what was the timeout setting in your experiments?
I see in the
article that it says "timeouts are counted as 60 seconds" - does it mean
that Blazegraph had internal timeout setting set to 60 seconds, or that
the setting was different, but when processing results, the actual run
time was replaced by 60 seconds?
Yup, the settings are here:
http://users.dcc.uchile.cl/~dhernand/wquery/#configure-blazegraph
My understanding is that with those settings, we set an internal timeout
on BlazeGraph of 60 seconds.
Also, did you use analytic mode for the queries?
https://wiki.blazegraph.com/wiki/index.php/QueryEvaluation#Analytic_Query_E…
https://wiki.blazegraph.com/wiki/index.php/AnalyticQuery
This is the mode that is turned on automatically for the Wikidata Query
Service, and it uses AFAIK different memory management which may
influence how the cases you had problems with are handled.
This I am not aware of. I would have to ask Daniel to be sure (I know he
spent quite a lot of time playing around with different settings in the
case of BlazeGraph).
I would appreciate as much detail as you could give on
this, as this may
also be useful on current query engine work. Also, if you're interested
in the work done on WDQS, our experiences and the reasons for certain
decisions and setups we did, I'd be glad to answer any questions.
I guess to start with you should have a look at the documentation here:
http://users.dcc.uchile.cl/~dhernand/wquery/
If there's some details missing from that, or if you have any further
questions, I can put you in contact with Daniel who did all the scripts,
ran the experiments, was in discussion with the devs, etc. in the
context of BlazeGraph. (I don't think he's on this list.)
I could also ask him perhaps to try create a minimal-ish test-case that
reproduces the problem.
resource leak.
Also Daniel mentioned that from discussion with the devs,
they claim that the current implementation works best on SSD hard
drives; our experiments were on a standard SATA.
Yes, we run it on SSD, judging from our tests on test servers, running
on virtualized SATA machines, the difference is indeed dramatic (orders
of magnitude and more for some queries). Then again, this is highly
unscientific anecdotal evidence, we didn't make anything resembling
formal benchmarks since the test hardware is clearly inferior to the
production one and is meant to be so. But the point is that SSD is
likely a must for Blazegraph to work well on this data set. Might also
improve results for other engines, so not sure how it influences the
comparison between the engines.
Yes, I think this was the message we got from the mailing lists when we
were trying to troubleshoot these issues: it would be better to use an
SSD. But we did not have one, and of course we didn't want to tailor our
hardware to suit one particular engine.
Unfortunately I think all such empirical experiments are in some sense
anecdotal; even ours. We cannot deduce, for example, what would happen,
relatively speaking, on a machine with an SSD, or more cores, or with
multiple instances. But still, one can learn a lot from good anecdotes.
Cheers,
Aidan