In the enterprise, most folks use either Java
Mission Control, or just
Java VisualVM profiler. Seeing sleeping Threads is often good to start
with, and just taking a snapshot or even Heap Dump when things are really
grinding slow would be useful, you can later share those snapshots/heap
dump with the community or Java profiling experts to analyze later.
On Thu, Nov 14, 2019 at 1:46 PM Guillaume Lederrey <
glederrey(a)wikimedia.org> wrote:
Hello!
Thanks for the suggestions!
On Thu, Nov 14, 2019 at 5:02 PM Thad Guidry <thadguidry(a)gmail.com>
wrote:
> Is the Write Retention Queue adequate?
> Is the branching factor for the lexicon indices too large, resulting
> in a non-linear slowdown in the write rate over tim?
> Did you look into Small Slot Optimization?
> Are the Write Cache Buffers adequate?
> Is there a lot of Heap pressure?
> Is the MemoryManager have the maximum amount of RAM it can handle?
> 4TB?
> Is the RWStore handling the recycling well?
> Is the SAIL Buffer Capacity adequate?
> Are you not using exact range counts where you could be using fast
> range counts?
>
>
Start at the Hardware side first however.
> Is the disk activity for writes really low...and CPU is very high?
> You have identified a bottleneck in that case, discover WHY that would be
> the case looking into any of the above.
>
Sounds like good questions, but outside of my area of expertise. I've
created
https://phabricator.wikimedia.org/T238362 to track it, and
I'll see if someone can have a look. I know that we did multiple passes at
tuning Blazegraph properties, with limited success so far.
> and a 100+ other things that should be looked at that all affect WRITE
> performance during UPDATES.
>
>
https://wiki.blazegraph.com/wiki/index.php/IOOptimization
>
https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization
>
> I would also suggest you start monitoring some of the internals of
> Blazegraph (JAVA) while in production with tools such as XRebel or
> AppDynamics.
>
Both XRebel and AppDynamics are proprietary, so no way that we'll
deploy them in our environment. We are tracking a few JMX based metrics,
but so far, we don't really know what to look for.
Thanks!
Guillaume
Thad
>
https://www.linkedin.com/in/thadguidry/
>
>
> On Thu, Nov 14, 2019 at 7:31 AM Guillaume Lederrey <
> glederrey(a)wikimedia.org> wrote:
>
>> Thanks for the feedback!
>>
>> On Thu, Nov 14, 2019 at 11:11 AM <fn(a)imm.dtu.dk> wrote:
>>
>>>
>>> Besides waiting for the new updater, it may be useful to tell us,
>>> what
>>> we as users can do too. It is unclear to me what the problem is. For
>>> instance, at one point I was worried that the many parallel requests
>>> to
>>> the SPARQL endpoint that we make in Scholia is a problem. As far as
>>> I
>>> understand it is not a problem at all. Another issue could be the
>>> way
>>> that we use Magnus Manske's Quickstatements and approve bots for
>>> high
>>> frequency editing. Perhaps a better overview and constraints on
>>> large-scale editing could be discussed?
>>>
>>
>> To be (again) completely honest, we don't entirely understand the
>> issue either. There are clearly multiple related issues. In high level
>> terms, we have at least:
>>
>> * Some part of the update process on Blazegraph is CPU bound and
>> single threaded. Even with low query load, if we have a high edit rate,
>> Blazegraph can't keep up, and saturates a single CPU (with plenty of
>> available resources on other CPUs). This is a hard issue to fix, requiring
>> either splitting the processing over multiple CPU or sharding the data over
>> multiple servers. Neither of which Blazegraph supports (at least not in our
>> current configuration).
>> * There is a race for resources between edits and queries: a high
>> query load will impact the update rate. This could to some extent be
>> mitigated by reducing the query load: if no one is using the service, it
>> works great! Obviously that's not much of a solution.
>>
>> What you can do (short term):
>>
>> * Keep bots usage well behaved (don't do parallel queries, provide a
>> meaningful user agent, smooth the load over time if possible, ...). As far
>> as I can see, most usage are already well behaved.
>> * Optimize your queries: better queries will use less resources,
>> which should help. Time to completion is a good approximation of the
>> resources used. I don't really have any more specific advice, SPARQL is not
>> my area of expertise.
>>
>> What you can do (longer term):
>>
>> * Help us think out of the box. Can we identify higher level use
>> cases? Could we implement some of our workflows on a higher level API than
>> SPARQL, which might allow for more internal optimizations?
>> * Help us better understand the constraints. Document use cases on
>> [1].
>>
>> Sadly, we don't have the bandwidth right now to engage meaningfully
>> in this conversation. Feel free to send thoughts already, but don't expect
>> any timely response.
>>
>> Yet another thought is the large discrepancy between Virginia and
>>> Texas
>>> data centers as I could see on Grafana [1]. As far as I understand
>>> the
>>> hardware (and software) are the same. So why is there this large
>>> difference? Rather than editing or BlazeGraph, could the issue be
>>> some
>>> form of network issue?
>>>
>>
>> As pointed out by Lucas, this is expected. Due to how our GeoDNS
>> works, we see more traffic on eqiad than on codfw.
>>
>> Thanks for the help!
>>
>> Guillaume
>>
>> [1]
https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage
>>
>>
>>
>>>
>>>
>>> [1]
>>>
>>>
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&…
>>>
>>> /Finn
>>>
>>>
>>>
>>> On 14/11/2019 10:50, Guillaume Lederrey wrote:
>>> > Hello all!
>>> >
>>> > As you've probably noticed, the update lag on the public WDQS
>>> endpoint
>>> > [1] is not doing well [2], with lag climbing to > 12h for some
>>> servers.
>>> > We are tracking this on phabricator [3], subscribe to that task if
>>> you
>>> > want to stay informed.
>>> >
>>> > To be perfectly honest, we don't have a good short term solution.
>>> The
>>> > graph database that we are using at the moment (Blazegraph [4])
>>> does not
>>> > easily support sharding, so even throwing hardware at the problem
>>> isn't
>>> > really an option.
>>> >
>>> > We are working on a few medium term improvements:
>>> >
>>> > * A dedicated updater service in Blazegraph, which should help
>>> increase
>>> > the update throughput [5]. Finger crossed, this should be ready
>>> for
>>> > initial deployment and testing by next week (no promise, we're
>>> doing the
>>> > best we can).
>>> > * Some improvement in the parallelism of the updater [6]. This has
>>> just
>>> > been identified. While it will probably also provide some
>>> improvement in
>>> > throughput, we haven't actually started working on that and we
>>> don't
>>> > have any numbers at this point.
>>> >
>>> > Longer term:
>>> >
>>> > We are hiring a new team member to work on WDQS. It will take some
>>> time
>>> > to get this person up to speed, but we should have more capacity
>>> to
>>> > address the deeper issues of WDQS by January.
>>> >
>>> > The 2 main points we want to address are:
>>> >
>>> > * Finding a triple store that scales better than our current
>>> solution.
>>> > * Better understand what are the use cases on WDQS and see if we
>>> can
>>> > provide a technical solution that is better suited. Our intuition
>>> is
>>> > that some of the use cases that require synchronous (or quasi
>>> > synchronous) updates would be better implemented outside of a
>>> triple
>>> > store. Honestly, we have no idea yet if this makes sense and what
>>> those
>>> > alternate solutions might be.
>>> >
>>> > Thanks a lot for your patience during this tough time!
>>> >
>>> > Guillaume
>>> >
>>> >
>>> > [1]
https://query.wikidata.org/
>>> > [2]
>>> >
>>>
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&am…
>>> > [3]
https://phabricator.wikimedia.org/T238229
>>> > [4]
https://blazegraph.com/
>>> > [5]
https://phabricator.wikimedia.org/T212826
>>> > [6]
https://phabricator.wikimedia.org/T238045
>>> >
>>> > --
>>> > Guillaume Lederrey
>>> > Engineering Manager, Search Platform
>>> > Wikimedia Foundation
>>> > UTC+1 / CET
>>> >
>>> > _______________________________________________
>>> > Wikidata mailing list
>>> > Wikidata(a)lists.wikimedia.org
>>> >
https://lists.wikimedia.org/mailman/listinfo/wikidata
>>> >
>>>
>>> _______________________________________________
>>> Wikidata mailing list
>>> Wikidata(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>
>>
>> --
>> Guillaume Lederrey
>> Engineering Manager, Search Platform
>> Wikimedia Foundation
>> UTC+1 / CET
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
> _______________________________________________
> Wikidata mailing list
> Wikidata(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikidata
>
--
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org