On Thu, Oct 13, 2016 at 11:36 AM, Jaime Crespo jcrespo@wikimedia.org wrote:
- Move the service to labs, not providing any firm guarantee of service
level ?
Labs is not the place where bad services go to die. Production is the place where only very stable services reach so they can be properly managed.
"WDQS do not go through any critical systems" "all direct clients of WDQS are well protected by circuit breakers"
Why using the production network, then?
I think there was one exception, which is services that needed a lot of resources so they could not run on vms, but don't we have a prototype of "labs on real hardware"?
I'm not sure why WDQS is in the production network (this predates me joining WMF). It is probably there as you suggest for the real hardware needs. There was also probably a wish to make WDQS a production level service with all the availability garanties that goes with it, even if that goal is probably not possible with the current way WDQS works.
Letting user run arbitrary queries is a problem for security, but not in the common sense (sql injection), but in terms of exactly the situation that you are describing- running easily out of resources (DOS). Even quarry, which I have publicly complained about in the past, for what you say, has a better resource management than wqs (30-minute limit execution, concurrency control, etc.).
(I did not know about quarry, I need to have a look!)
The main power of wdqs is that do allow users to write arbitrary queries synchronously. With that power comes the ability to break the service. Removing this ability greatly reduces the value of wdqs. We can (and should) work on putting in place constraints to protect the service, but there are limits to what is possible. I'm pretty sure that whatever we put in place, it will still be possible to break that service (unless we invest crazy amount of time, energy, ...). No, I don't know for sure...
I do not think maps is a problem, as after all it is static tile requests mostly (the worst it could happen is having a lot of requests)- the only complain there is that it is constantly creating noise on icinga. But running an unstable service (wdqs) on top of another unstable service (wikidata data handling) will never be stable. Everytime a bot starts writing to wikidata 600 times per second, s5 dbs shake (that is why we are creating s8) and wqs goes down. :-)
I don't think this assumption is true. I have some experience running available services on top of unavailable services. At JOB^1, we did use quite a few external service which were not all that great in term of robustness. Payment processors and credit check services are good example of external notoriously flaky services once you put some load on them. There are strategies to make that work, and in the end taking into account that your dependencies can fail is a great way to build much more robust services. I would even go as far as saying that making sure that your dependencies fail often is a good way to ensure that your system is robust. No, wdqs is not robust enough, but it is something that can (and should) be fixed without changing the way we integrate.
I would suggest using wqs on labs (or anywhere, non-production) with regular imports rather than real-time updates. Less headaches. I am literally aiming for that for labsdbs, too.
In the specific case of wdqs integration with wikidata, I don't think that the integration pattern itself is wrong (on the fly import of wikidata to wdqs). It does needs some work to improve robustness (https://phabricator.wikimedia.org/T139445 comes to mind). And it does fulfil one of the important use case of wdqs: quite a few wikidata editors use wdqs to live check edits / imports to wikidata.
I agree with you that in its current state, WDQS is probably closer to a labs service than to a production service (as far as I understand the definition of labs and production here). The question I'm trying to ask is how do we start using wdqs in a production context. I fully understand that there is work to do here. This is not something that will happen in a few days. But there is value in this idea, so we should start looking at what path we want to take (or make sure that there is no path worth taking, this is a perfectly acceptable answer, as long we look hard enough first).
My assumption is that it makes more sense to learn how to integrate low reliability services in a production context than it does to make sure wdqs becomes highly reliable. We should still work on improving wdqs reliability, but we should accept that by its nature it will be less reliable than most of the production services that we have.
By the way, thanks Jaime for the great writing! It does help me to structure quite a bit the random thoughts I have between my two ears!
On Tue, Oct 11, 2016 at 10:37 PM, Guillaume Lederrey glederrey@wikimedia.org wrote:
Hello!
There is some discussion of starting to use WDQS in conjunction with maps and graphs. Here are a few thoughts, just to put them out there and to start getting some feedback. This is an attempt to put some order in my thoughts, there are not complete yet...
WDQS exposes a SPARQL endpoint to users. This can be compared to giving the ability to our users to write arbitrary SQL queries. This is fairly close to the concept of the labs replica databases. Giving direct access to a SPARQL endpoint is at the same time a wonderful idea (it allows users to use WDQS in ways we would never have imagine) and a very scary idea (users can write complex queries which will consume all resources on our servers - which does happen from time to time).
At the moment, WDQS is used by researcher, bots and power users. Those users understand this constraint well, and the fluctuation of performance of WDQS is not a major issue.
Making WDQS robust enough while letting user run arbitrary queries is most probably extremely hard. I think that we should instead investigate how to use an unstable service from a stable one.
Ideas...
- We can accept service degradation of specific functionalities. We
accept that WDQS is down, or slow some times. In this case, we degrade user experience, graphs will not work, maps will not display data layers. In term of implementation, we need to ensure that data flows involving WDQS do not go through any critical systems, and that all direct clients of WDQS are well protected by circuit breakers.
- We want to conserve user experience. We go fully async. Graphs and
maps are pre-generated and updated regularly outside of user interaction. We probably still need synchronous access for editors, to allow them to test their edits. Refresh can be relatively low frequency (1/day or maybe less). We can probably optimize this based on how often a specific graph / map is viewed. I'm not sure how easy it would be to scale such an approach...
- Something else?
Time to get some sleep...
MrG
-- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation UTC+2 / CEST
Ops mailing list Ops@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops
-- Jaime Crespo http://wikimedia.org