Re: [Multimedia] [Ops] Brief image scalers outage, Mon Apr 21 03:12 UTC

24 Apr 2014

Thanks for the hotfix, Aaron, and the reply to my questions, Faidon. The
multimedia team has allocated time this sprint to find a short-term and
long-term plan for the general Image Scaler situation, and each subsequent
sprint we'll have time allocated as well to act on that plan and implement
it.

I'd like to have an engineering meeting with both of you and our team
(invite sent), to help us fully understand the various parts involved. At
the moment I think we all (multimedia folks) have a lot of catching up to
do in terms of knowledge of that code, I don't think any of us would have
been capable of writing the hotfix Aaron did. Our team needs to acquire
that knowledge ASAP.

Now, let's not wait for the meeting to start discussing the issue and
proposed solutions. As I've stated earlier in this thread, I would like to
revisit the fact that the thumbnails are being generated in real time,
because it's the bigger problem and Image Scalers will still be subject to
going down if we don't fix that. I suspect that a solution to that broader
problem would solve the GWToolset issue as well.

Aaron worked on wrapping thumb calls in PoolCounter on the MediaWiki side
...
  lately as well

Am I correct to assume that the PoolCounter wrapping is used to dedupe
requests from different image scaler servers to generate the same thumbnail?

I've thought about the problem since the incident and I came up with the
following rough idea, that I'd like everyone to criticize/validate against
their knowledge of the existing setup. Note that it's based on a surface
understanding of things, I still don't know the existing code very well,
which is why your knowledge applied against my idea will be very useful.

Instead of each image scaler server generating a thumbnail immediately when
a new size is requested, the following would happen in the script handling
the thumbnail generation request:
- the client's connection would be kept open, potentially for a while. All
the way from the end-user to the thumb script
- a data store (what kind is TBD, whatever is most appropriate, redis might
be a good candidate) would be connected to and a virtual resource request
would be added to a queue. Basically the script would be queueing an "I
need one unit of thumbnail generation" request object
- in a loop, the same data store would be read every X milliseconds to
check the request object's position in the queue
- if the request object's position is below a certain configurable limit,
the actual job starts, scaling the image and storing the thumb
- when the script is done generating the thumbnail, it removes its request
object from the queue, essentially liberating a virtual unit of thumbnail
generation ork for another thumb request to use. And of course it returns
the generated thumb as it currently does.
- if the script dies for whatever reason, an expiry setting on the queued
request object would kick in at the data store level, liberating the
virtual unit automatically
- if the client requesting the thumb gives up and closes the connection
because there's a high load and they don't see the thumb appear fast enough
to their taste, the script would keep running, wait for its turn in the
queue and ultimately render the thumb anyway. This is crucial, because at
times of high load when things start taking time, we definitely don't want
users refreshing the page to result in aborting and restarting the exact
same thumb generation work. Every thumb generation request should complete,
even if much later and with no end-user with an open connection to see the
result anymore.

Pros:
- The limit on virtual units of work means that the image scaler server
load should never go over a certain point. No more going down depending on
the mix or quantity of thumb requests coming in. The servers would just be
handling as much work as they actually can and no more.
- This would be agnostic of how long a given thumbnail generation takes.
Which means that someone uploading lots of large files that are very
time-consuming to generate thumbs for, such as last weekend's incident,
wouldn't take down image scalers. They'd just slow down thumb generation
across the board.
- The queue size could be configurable. The best strategy is to start low
(eg. as many units as there are servers)
- The queue could be smarter than a plain queue and have weight and
client-based priority strategies. For example we could make it so that
people uploading a lot of large images don't hog the queue to themselves. I
don't see that as a requirement to solve the reliability issues, but it
would be nice to have, and such a prioritization would be very relevant to
the GWToolset user behavior. It would basically that someone uploading a
lot of large images would only have their thumbs take longer to generate
than usual, not everyone else's as well.
- We could write "attack bots" that would attempt DoSing the image scalers
in various ways and let us verify that this new system behaves well under
heavy load. We know the weaknesses, we should be testing them and making
sure that they're solved, instead of waiting for someone with bad
intentions or some accident coming from unusual usage pattern to do that
for us.

Cons:
- Under high load, HTTP connections could potentially be held for a while.
Maybe that's not such a big deal, I guess Ops has the answer. I think it's
fair to set a time limit on that too, and still have the thumbnail
generation happen even if we closed the connection to the client. From the
end-user's perspective, the experience would be that they wait a while for
a thumb to appear, the connection dies, but chances are that if they
refresh the page a few seconds or minutes later, the thumb will be there.
- During high load/DoS attack/whatever people viewing thumbs at new sizes
will experience long image load times. Considering that most of the time,
the thumb will end up appearing before a timeout occurs, I think that's
preferable to the status quo, which is that image scalers would go down
entirely in those situations.

On Tue, Apr 22, 2014 at 1:31 PM, Faidon Liambotis &lt;faidon(a)wikimedia.org&gt;wrote;wrote:

...
  Fabrice,

 I don't see how a feature release can be of a higher priority than
 troubleshooting an outage but regardless:

 The outage's symptoms seem to have been alleviated since, but the
 Commons/GLAM communities are waiting for a response from us to resume
 their work. They've responded to our "pause" request and in turn
 requested our feedback at:
 https://commons.wikimedia.org/wiki/Commons:Batch_uploading/NYPL_Maps
 (see the large red banner with the stop sign at the bottom)

 ...which is also linked from:
 https://commons.wikimedia.org/wiki/User_talk:F%C3%A6#Large_file_uploads

https://commons.wikimedia.org/wiki/Commons:Village_pump#Images_so_big_they_…

 Sadly, I don't have much to offer them, as I previously explained. I
 certainly wouldn't commit to anything considering your response on the
 matter.

 Could you communicate your team's priorities to Fæ and the rest of the
 Commons/GLAM community directly?

 Thanks,
 Faidon

 On Mon, Apr 21, 2014 at 08:57:39AM -0700, Fabrice Florin wrote:
  Dear Faidon, Emmanuel and Guiseppe,

 Thanks so much for investigating this issue so quickly and sharing the  likely
cause of the problem with us.

 This quarter, our team's top priority is to address serious issues  related to
Upload Wizard -- and this seems like a good one for us to take on.

 However, we are still in the process of releasing Media Viewer, which is  likely to
take most of our attention for the next few weeks.

 So we may not be able to troubleshoot it right away. But we are filling  tickets
about this issue, so we can hit the ground running in early may.

 Thanks again for your fine work, as well as for your patience and  understanding.

 Fabrice

 On Apr 21, 2014, at 3:53 AM, Emmanuel Engelhart < 
emmanuel.engelhart(a)wikimedia.ch&gt; wrote:

 > On 21.04.2014 12:05, Faidon Liambotis wrote:
 >> On Mon, Apr 21, 2014 at 10:56:40AM +0200, Giuseppe Lavagetto wrote:
 >>> The problem resolved before I could get to strace the apache 
processes, so
  >>> I don't have more details -
Faidon was investigating as well and may  have
  >>> more info.
 >>
 >> Indeed, I do: this had nothing to do with TMH. The trigger was Commons
 >> User:Fæ uploading hundreds of 100-200MB multipage TIFFs via GWToolset
 >> over the course of 4-5 hours (multiple files per minute), and then
 >> random users/bots viewing Special:NewFiles, which attempts to display  a
  >> thumbnail for all of those new files in
parallel in realtime, and thus
 >> saturating imagescalers' MaxClients setting and basically 
inadvertently
  >> DoSing them.
 >>
 >> The issue was temporary because of
 >> https://bugzilla.wikimedia.org/show_bug.cgi?id=49118 but since the  user
  >> kept uploading new files, it was
recurrent, with different files every
 >> time. Essentially, we would keep having short outages every now and  then
  >> for as long as the upload activity
continued.
 >>
 >> I left a comment over at 
 https://commons.wikimedia.org/wiki/User_talk:Fæ
  >> and contacted Commons admins over at
#wikimedia-commons, as a courtesy
 >> to both before I used my root to elevate my privileges and ban a
 >> long-time prominent Wikimedia user as an emergency countermeasure :)
 >>
 >> It was effective, as Fæ immediately responded and ceased the activity
 >> until further discussion; the Commons community was also helpful in  the
  >> short discussion that followed.
 >>
 >> Andre also pointed out that Fæ had previously began the "Images so big
 >> they break Commons" thread at the Commons Village Pump:
 >> 
https://commons.wikimedia.org/wiki/Commons:Village_pump#Images_so_big_they_…
  >>
 >> As for the more permanent solution: there's not much we, as ops, can 
do
  >> about this but say "no, don't
upload all these files", which is
 >> obviously not a great solution :) The root cause is an architecture
 >> issue with how imagescalers behave with regards to resource-intensive
 >> jobs coming in a short period of time. Perhaps a combination of
 >> poolcounter per file and more capacity (servers) would alleviate the
 >> effect, but ideally we should be able to have some grouping &
 >> prioritization of imagescaling jobs so that large jobs can't 
completely
  >> saturate and DoS the cluster.
 >
 > Commons has big difficulties to deal with big TIFF files and this is a 
serious issue, in particular for Wikipedians in Residence. To me it looks
 like that using the Vipsscaler would help to fix the worse ones.
  >
 > Here is an email I have sent to Andre and Greg a few days ago. I make  it
public with the hope it might help.
  >
 > ===========
 > As a GLAM volunteer and WIR at the Swiss National Library, I encourage 
institutions to upload high quality pictures to increase digital
 sustainability. But, in the worse case (big TIFF files), Commons is not
 able to deal with them and fails to compute the thumbnails.
  >
 > You have a perfect example of the problem with this recently uploaded 
collection of historical plans of the Zurich main station:
  > 
https://commons.wikimedia.org/wiki/Category:Historical_plans_of_Zurich_Main…
  >  > > It seems that this problem
might be fixed by using the VipsScaler for
 TIFF pictures and Greg has already worked on this and proposed a patch. But
 this patch has been waiting a review since 7 months:
 > > https://bugzilla.wikimedia.org/show_bug.cgi?id=52045
  >  > > IMO it would be great if
you could do something to increase the
 priority and the urgency of this ticket. The movement invests pretty much
 resources to build successful collaboration with GLAMs and many of them get
 braked by this "silly" bug.
  >  > > Hope you can help us.
 > > ===========
  >  > > --
 > > Volunteer
 > > Technology, GLAM, Trainings
 > > Zurich
 > > +41 797 670 398
  >  > >
_______________________________________________
 > > Multimedia mailing list
 > > Multimedia(a)lists.wikimedia.org
 > > https://lists.wikimedia.org/mailman/listinfo/multimedia
 >
 > _______________________________
 >
 > Fabrice Florin
 > Product Manager
 > Wikimedia Foundation
 >
 > http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)
 >
 >
 >

  _______________________________________________
 Ops mailing list
 Ops(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/ops 

 _______________________________________________
 Multimedia mailing list
 Multimedia(a)lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/multimedia

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

Re: [Multimedia] [Ops] Brief image scalers outage, Mon Apr 21 03:12 UTC