On Thu, Nov 14, 2019 at 2:39 PM Thad Guidry
<thadguidry(a)gmail.com>
wrote:
In the enterprise, most folks use either Java
Mission Control, or just
Java VisualVM profiler. Seeing sleeping Threads is often good to start
with, and just taking a snapshot or even Heap Dump when things are really
grinding slow would be useful, you can later share those snapshots/heap
dump with the community or Java profiling experts to analyze later.
https://visualvm.github.io/index.html
Thad
https://www.linkedin.com/in/thadguidry/
On Thu, Nov 14, 2019 at 1:46 PM Guillaume Lederrey <
glederrey(a)wikimedia.org> wrote:
> Hello!
>
> Thanks for the suggestions!
>
> On Thu, Nov 14, 2019 at 5:02 PM Thad Guidry <thadguidry(a)gmail.com>
> wrote:
>
>> Is the Write Retention Queue adequate?
>> Is the branching factor for the lexicon indices too large, resulting
>> in a non-linear slowdown in the write rate over tim?
>> Did you look into Small Slot Optimization?
>> Are the Write Cache Buffers adequate?
>> Is there a lot of Heap pressure?
>> Is the MemoryManager have the maximum amount of RAM it can handle?
>> 4TB?
>> Is the RWStore handling the recycling well?
>> Is the SAIL Buffer Capacity adequate?
>> Are you not using exact range counts where you could be using fast
>> range counts?
>>
>>
> Start at the Hardware side first however.
>> Is the disk activity for writes really low...and CPU is very high?
>> You have identified a bottleneck in that case, discover WHY that would be
>> the case looking into any of the above.
>>
>
> Sounds like good questions, but outside of my area of expertise. I've
> created
https://phabricator.wikimedia.org/T238362 to track it, and
> I'll see if someone can have a look. I know that we did multiple passes at
> tuning Blazegraph properties, with limited success so far.
>
>
>> and a 100+ other things that should be looked at that all affect
>> WRITE performance during UPDATES.
>>
>>
https://wiki.blazegraph.com/wiki/index.php/IOOptimization
>>
https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization
>>
>> I would also suggest you start monitoring some of the internals of
>> Blazegraph (JAVA) while in production with tools such as XRebel or
>> AppDynamics.
>>
>
> Both XRebel and AppDynamics are proprietary, so no way that we'll
> deploy them in our environment. We are tracking a few JMX based metrics,
> but so far, we don't really know what to look for.
>
> Thanks!
>
> Guillaume
>
> Thad
>>
https://www.linkedin.com/in/thadguidry/
>>
>>
>> On Thu, Nov 14, 2019 at 7:31 AM Guillaume Lederrey <
>> glederrey(a)wikimedia.org> wrote:
>>
>>> Thanks for the feedback!
>>>
>>> On Thu, Nov 14, 2019 at 11:11 AM <fn(a)imm.dtu.dk> wrote:
>>>
>>>>
>>>> Besides waiting for the new updater, it may be useful to tell us,
>>>> what
>>>> we as users can do too. It is unclear to me what the problem is.
>>>> For
>>>> instance, at one point I was worried that the many parallel
>>>> requests to
>>>> the SPARQL endpoint that we make in Scholia is a problem. As far as
>>>> I
>>>> understand it is not a problem at all. Another issue could be the
>>>> way
>>>> that we use Magnus Manske's Quickstatements and approve bots for
>>>> high
>>>> frequency editing. Perhaps a better overview and constraints on
>>>> large-scale editing could be discussed?
>>>>
>>>
>>> To be (again) completely honest, we don't entirely understand the
>>> issue either. There are clearly multiple related issues. In high level
>>> terms, we have at least:
>>>
>>> * Some part of the update process on Blazegraph is CPU bound and
>>> single threaded. Even with low query load, if we have a high edit rate,
>>> Blazegraph can't keep up, and saturates a single CPU (with plenty of
>>> available resources on other CPUs). This is a hard issue to fix, requiring
>>> either splitting the processing over multiple CPU or sharding the data over
>>> multiple servers. Neither of which Blazegraph supports (at least not in our
>>> current configuration).
>>> * There is a race for resources between edits and queries: a high
>>> query load will impact the update rate. This could to some extent be
>>> mitigated by reducing the query load: if no one is using the service, it
>>> works great! Obviously that's not much of a solution.
>>>
>>> What you can do (short term):
>>>
>>> * Keep bots usage well behaved (don't do parallel queries, provide a
>>> meaningful user agent, smooth the load over time if possible, ...). As far
>>> as I can see, most usage are already well behaved.
>>> * Optimize your queries: better queries will use less resources,
>>> which should help. Time to completion is a good approximation of the
>>> resources used. I don't really have any more specific advice, SPARQL is
not
>>> my area of expertise.
>>>
>>> What you can do (longer term):
>>>
>>> * Help us think out of the box. Can we identify higher level use
>>> cases? Could we implement some of our workflows on a higher level API than
>>> SPARQL, which might allow for more internal optimizations?
>>> * Help us better understand the constraints. Document use cases on
>>> [1].
>>>
>>> Sadly, we don't have the bandwidth right now to engage meaningfully
>>> in this conversation. Feel free to send thoughts already, but don't
expect
>>> any timely response.
>>>
>>> Yet another thought is the large discrepancy between Virginia and
>>>> Texas
>>>> data centers as I could see on Grafana [1]. As far as I understand
>>>> the
>>>> hardware (and software) are the same. So why is there this large
>>>> difference? Rather than editing or BlazeGraph, could the issue be
>>>> some
>>>> form of network issue?
>>>>
>>>
>>> As pointed out by Lucas, this is expected. Due to how our GeoDNS
>>> works, we see more traffic on eqiad than on codfw.
>>>
>>> Thanks for the help!
>>>
>>> Guillaume
>>>
>>> [1]
https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage
>>>
>>>
>>>
>>>>
>>>>
>>>> [1]
>>>>
>>>>
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&…
>>>>
>>>> /Finn
>>>>
>>>>
>>>>
>>>> On 14/11/2019 10:50, Guillaume Lederrey wrote:
>>>> > Hello all!
>>>> >
>>>> > As you've probably noticed, the update lag on the public WDQS
>>>> endpoint
>>>> > [1] is not doing well [2], with lag climbing to > 12h for some
>>>> servers.
>>>> > We are tracking this on phabricator [3], subscribe to that task
>>>> if you
>>>> > want to stay informed.
>>>> >
>>>> > To be perfectly honest, we don't have a good short term
solution.
>>>> The
>>>> > graph database that we are using at the moment (Blazegraph [4])
>>>> does not
>>>> > easily support sharding, so even throwing hardware at the problem
>>>> isn't
>>>> > really an option.
>>>> >
>>>> > We are working on a few medium term improvements:
>>>> >
>>>> > * A dedicated updater service in Blazegraph, which should help
>>>> increase
>>>> > the update throughput [5]. Finger crossed, this should be ready
>>>> for
>>>> > initial deployment and testing by next week (no promise, we're
>>>> doing the
>>>> > best we can).
>>>> > * Some improvement in the parallelism of the updater [6]. This
>>>> has just
>>>> > been identified. While it will probably also provide some
>>>> improvement in
>>>> > throughput, we haven't actually started working on that and we
>>>> don't
>>>> > have any numbers at this point.
>>>> >
>>>> > Longer term:
>>>> >
>>>> > We are hiring a new team member to work on WDQS. It will take
>>>> some time
>>>> > to get this person up to speed, but we should have more capacity
>>>> to
>>>> > address the deeper issues of WDQS by January.
>>>> >
>>>> > The 2 main points we want to address are:
>>>> >
>>>> > * Finding a triple store that scales better than our current
>>>> solution.
>>>> > * Better understand what are the use cases on WDQS and see if we
>>>> can
>>>> > provide a technical solution that is better suited. Our intuition
>>>> is
>>>> > that some of the use cases that require synchronous (or quasi
>>>> > synchronous) updates would be better implemented outside of a
>>>> triple
>>>> > store. Honestly, we have no idea yet if this makes sense and what
>>>> those
>>>> > alternate solutions might be.
>>>> >
>>>> > Thanks a lot for your patience during this tough time!
>>>> >
>>>> > Guillaume
>>>> >
>>>> >
>>>> > [1]
https://query.wikidata.org/
>>>> > [2]
>>>> >
>>>>
https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&am…
>>>> > [3]
https://phabricator.wikimedia.org/T238229
>>>> > [4]
https://blazegraph.com/
>>>> > [5]
https://phabricator.wikimedia.org/T212826
>>>> > [6]
https://phabricator.wikimedia.org/T238045
>>>> >
>>>> > --
>>>> > Guillaume Lederrey
>>>> > Engineering Manager, Search Platform
>>>> > Wikimedia Foundation
>>>> > UTC+1 / CET
>>>> >
>>>> > _______________________________________________
>>>> > Wikidata mailing list
>>>> > Wikidata(a)lists.wikimedia.org
>>>> >
https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>> >
>>>>
>>>> _______________________________________________
>>>> Wikidata mailing list
>>>> Wikidata(a)lists.wikimedia.org
>>>>
https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>>
>>>
>>>
>>> --
>>> Guillaume Lederrey
>>> Engineering Manager, Search Platform
>>> Wikimedia Foundation
>>> UTC+1 / CET
>>> _______________________________________________
>>> Wikidata mailing list
>>> Wikidata(a)lists.wikimedia.org
>>>
https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata(a)lists.wikimedia.org
>>
https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
> --
> Guillaume Lederrey
> Engineering Manager, Search Platform
> Wikimedia Foundation
> UTC+1 / CET
> _______________________________________________
> Wikidata mailing list
> Wikidata(a)lists.wikimedia.org
>
https://lists.wikimedia.org/mailman/listinfo/wikidata
>
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
Wikidata(a)lists.wikimedia.org