Re: [discovery] [Ops] Using Wikidata Query Service in a production context - Discovery

13 Oct 2016

On Thu, Oct 13, 2016 at 11:36 AM, Jaime Crespo &lt;jcrespo(a)wikimedia.org&gt; wrote:
...
  3) Move the service to labs, not providing any firm
guarantee of service
 level ?

 Labs is not the place where bad services go to die. Production is the place
 where only very stable services reach so they can be properly managed.

 "WDQS do not go through any critical systems"
 "all direct clients of WDQS are well protected by circuit breakers"

 Why using the production network, then?

 I think there was one exception, which is services that needed a lot of
 resources so they could not run on vms, but don't we have a prototype of
 "labs on real hardware"? 
I'm not sure why WDQS is in the production network (this predates me
joining WMF). It is probably there as you suggest for the real
hardware needs. There was also probably a wish to make WDQS a
production level service with all the availability garanties that goes
with it, even if that goal is probably not possible with the current
way WDQS works.

...
  Letting user run arbitrary queries is a problem for
security, but not in the
 common sense (sql injection), but in terms of exactly the situation that you
 are describing- running easily out of resources (DOS). Even quarry, which I
 have publicly complained about in the past, for what you say, has a better
 resource management than wqs (30-minute limit execution, concurrency
 control, etc.). 
(I did not know about quarry, I need to have a look!)

The main power of wdqs is that do allow users to write arbitrary
queries synchronously. With that power comes the ability to break the
service. Removing this ability greatly reduces the value of wdqs. We
can (and should) work on putting in place constraints to protect the
service, but there are limits to what is possible. I'm pretty sure
that whatever we put in place, it will still be possible to break that
service (unless we invest crazy amount of time, energy, ...). No, I
don't know for sure...

...
  I do not think maps is a problem, as after all it is
static tile requests
 mostly (the worst it could happen is having a lot of requests)- the only
 complain there is that it is constantly creating noise on icinga. But
 running an unstable service (wdqs) on top of another unstable service
 (wikidata data handling) will never be stable. Everytime a bot starts
 writing to wikidata 600 times per second, s5 dbs shake (that is why we are
 creating s8) and wqs goes down. :-) 
I don't think this assumption is true. I have some experience running
available services on top of unavailable services. At JOB^1, we did
use quite a few external service which were not all that great in term
of robustness. Payment processors and credit check services are good
example of external notoriously flaky services once you put some load
on them. There are strategies to make that work, and in the end taking
into account that your dependencies can fail is a great way to build
much more robust services. I would even go as far as saying that
making sure that your dependencies fail often is a good way to ensure
that your system is robust. No, wdqs is not robust enough, but it is
something that can (and should) be fixed without changing the way we
integrate.

...
  I would suggest using wqs on labs (or anywhere,
non-production) with regular
 imports rather than real-time updates. Less headaches. I am literally aiming
 for that for labsdbs, too. 
In the specific case of wdqs integration with wikidata, I don't think
that the integration pattern itself is wrong (on the fly import of
wikidata to wdqs). It does needs some work to improve robustness
(https://phabricator.wikimedia.org/T139445 comes to mind). And it does
fulfil one of the important use case of wdqs: quite a few wikidata
editors use wdqs to live check edits / imports to wikidata.

I agree with you that in its current state, WDQS is probably closer to
a labs service than to a production service (as far as I understand
the definition of labs and production here). The question I'm trying
to ask is how do we start using wdqs in a production context. I fully
understand that there is work to do here. This is not something that
will happen in a few days. But there is value in this idea, so we
should start looking at what path we want to take (or make sure that
there is no path worth taking, this is a perfectly acceptable answer,
as long we look hard enough first).

My assumption is that it makes more sense to learn how to integrate
low reliability services in a production context than it does to make
sure wdqs becomes highly reliable. We should still work on improving
wdqs reliability, but we should accept that by its nature it will be
less reliable than most of the production services that we have.

By the way, thanks Jaime for the great writing! It does help me to
structure quite a bit the random thoughts I have between my two ears!

...
  On Tue, Oct 11, 2016 at 10:37 PM, Guillaume Lederrey
 &lt;glederrey(a)wikimedia.org&gt; wrote:

 Hello!

 There is some discussion of starting to use WDQS in conjunction with
 maps and graphs. Here are a few thoughts, just to put them out there
 and to start getting some feedback. This is an attempt to put some
 order in my thoughts, there are not complete yet...

 WDQS exposes a SPARQL endpoint to users. This can be compared to
 giving the ability to our users to write arbitrary SQL queries. This
 is fairly close to the concept of the labs replica databases. Giving
 direct access to a SPARQL endpoint is at the same time a wonderful
 idea (it allows users to use WDQS in ways we would never have imagine)
 and a very scary idea (users can write complex queries which will
 consume all resources on our servers - which does happen from time to
 time).

 At the moment, WDQS is used by researcher, bots and power users. Those
 users understand this constraint well, and the fluctuation of
 performance of WDQS is not a major issue.

 Making WDQS robust enough while letting user run arbitrary queries is
 most probably extremely hard. I think that we should instead
 investigate how to use an unstable service from a stable one.

 Ideas...

 1) We can accept service degradation of specific functionalities. We
 accept that WDQS is down, or slow some times. In this case, we degrade
 user experience, graphs will not work, maps will not display data
 layers. In term of implementation, we need to ensure that data flows
 involving WDQS do not go through any critical systems, and that all
 direct clients of WDQS are well protected by circuit breakers.

 2) We want to conserve user experience. We go fully async. Graphs and
 maps are pre-generated and updated regularly outside of user
 interaction. We probably still need synchronous access for editors, to
 allow them to test their edits. Refresh can be relatively low
 frequency (1/day or maybe less). We can probably optimize this based
 on how often a specific graph / map is viewed. I'm not sure how easy
 it would be to scale such an approach...

 3) Something else?

 Time to get some sleep...

   MrG

 --
 Guillaume Lederrey
 Operations Engineer, Discovery
 Wikimedia Foundation
 UTC+2 / CEST

 _______________________________________________
 Ops mailing list
 Ops(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/ops 

 --
 Jaime Crespo
 <http://wikimedia.org> 

-- 
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST