Re: [Wikidata] Wikidata Query Service update lag

14 Nov 2019

Is the Write Retention Queue adequate?
Is the branching factor for the lexicon indices too large, resulting in a
non-linear slowdown in the write rate over tim?
Did you look into Small Slot Optimization?
Are the Write Cache Buffers adequate?
Is there a lot of Heap pressure?
Is the MemoryManager have the maximum amount of RAM it can handle?  4TB?
Is the RWStore handling the recycling well?
Is the SAIL Buffer Capacity adequate?
Are you not using exact range counts where you could be using fast range
counts?

Start at the Hardware side first however.
Is the disk activity for writes really low...and CPU is very high?  You
have identified a bottleneck in that case, discover WHY that would be the
case looking into any of the above.

and a 100+ other things that should be looked at that all affect WRITE
performance during UPDATES.

https://wiki.blazegraph.com/wiki/index.php/IOOptimization
https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization

I would also suggest you start monitoring some of the internals of
Blazegraph (JAVA) while in production with tools such as XRebel or
AppDynamics.

Thad
https://www.linkedin.com/in/thadguidry/

On Thu, Nov 14, 2019 at 7:31 AM Guillaume Lederrey &lt;glederrey(a)wikimedia.org&gt;
wrote:

...
  Thanks for the feedback!

 On Thu, Nov 14, 2019 at 11:11 AM &lt;fn(a)imm.dtu.dk&gt; wrote:

 Besides waiting for the new updater, it may be useful to tell us, what
 we as users can do too. It is unclear to me what the problem is. For
 instance, at one point I was worried that the many parallel requests to
 the SPARQL endpoint that we make in Scholia is a problem. As far as I
 understand it is not a problem at all. Another issue could be the way
 that we use Magnus Manske's Quickstatements and approve bots for high
 frequency editing. Perhaps a better overview and constraints on
 large-scale editing could be discussed?

 To be (again) completely honest, we don't entirely understand the issue
 either. There are clearly multiple related issues. In high level terms, we
 have at least:

 * Some part of the update process on Blazegraph is CPU bound and single
 threaded. Even with low query load, if we have a high edit rate, Blazegraph
 can't keep up, and saturates a single CPU (with plenty of available
 resources on other CPUs). This is a hard issue to fix, requiring either
 splitting the processing over multiple CPU or sharding the data over
 multiple servers. Neither of which Blazegraph supports (at least not in our
 current configuration).
 * There is a race for resources between edits and queries: a high query
 load will impact the update rate. This could to some extent be mitigated by
 reducing the query load: if no one is using the service, it works great!
 Obviously that's not much of a solution.

 What you can do (short term):

 * Keep bots usage well behaved (don't do parallel queries, provide a
 meaningful user agent, smooth the load over time if possible, ...). As far
 as I can see, most usage are already well behaved.
 * Optimize your queries: better queries will use less resources, which
 should help. Time to completion is a good approximation of the resources
 used. I don't really have any more specific advice, SPARQL is not my area
 of expertise.

 What you can do (longer term):

 * Help us think out of the box. Can we identify higher level use cases?
 Could we implement some of our workflows on a higher level API than SPARQL,
 which might allow for more internal optimizations?
 * Help us better understand the constraints. Document use cases on [1].

 Sadly, we don't have the bandwidth right now to engage meaningfully in
 this conversation. Feel free to send thoughts already, but don't expect any
 timely response.

 Yet another thought is the large discrepancy between Virginia and Texas
  data centers as I could see on Grafana [1]. As
far as I understand the
 hardware (and software) are the same. So why is there this large
 difference? Rather than editing or BlazeGraph, could the issue be some
 form of network issue?

 As pointed out by Lucas, this is expected. Due to how our GeoDNS works, we
 see more traffic on eqiad than on codfw.

 Thanks for the help!

    Guillaume

 [1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage

 [1]

https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&…

 /Finn

 On 14/11/2019 10:50, Guillaume Lederrey wrote:
  Hello all!

 As you've probably noticed, the update lag on the public WDQS endpoint
 [1] is not doing well [2], with lag climbing to > 12h for some servers.
 We are tracking this on phabricator [3], subscribe to that task if you
 want to stay informed.

 To be perfectly honest, we don't have a good short term solution. The
 graph database that we are using at the moment (Blazegraph [4]) does  not
  easily support sharding, so even throwing
hardware at the problem isn't
 really an option.

 We are working on a few medium term improvements:

 * A dedicated updater service in Blazegraph, which should help increase
 the update throughput [5]. Finger crossed, this should be ready for
 initial deployment and testing by next week (no promise, we're doing  the
  best we can).
 * Some improvement in the parallelism of the updater [6]. This has just
 been identified. While it will probably also provide some improvement  in
  throughput, we haven't actually started
working on that and we don't
 have any numbers at this point.

 Longer term:

 We are hiring a new team member to work on WDQS. It will take some time
 to get this person up to speed, but we should have more capacity to
 address the deeper issues of WDQS by January.

 The 2 main points we want to address are:

 * Finding a triple store that scales better than our current solution.
 * Better understand what are the use cases on WDQS and see if we can
 provide a technical solution that is better suited. Our intuition is
 that some of the use cases that require synchronous (or quasi
 synchronous) updates would be better implemented outside of a triple
 store. Honestly, we have no idea yet if this makes sense and what those
 alternate solutions might be.

 Thanks a lot for your patience during this tough time!

     Guillaume

 [1] https://query.wikidata.org/
 [2]

https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&am…
  [3] https://phabricator.wikimedia.org/T238229
 [4] https://blazegraph.com/
 [5] https://phabricator.wikimedia.org/T212826
 [6] https://phabricator.wikimedia.org/T238045

 --
 Guillaume Lederrey
 Engineering Manager, Search Platform
 Wikimedia Foundation
 UTC+1 / CET

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

 --
 Guillaume Lederrey
 Engineering Manager, Search Platform
 Wikimedia Foundation
 UTC+1 / CET
 _______________________________________________
 Wikidata mailing list
 Wikidata(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: [Wikidata] Wikidata Query Service update lag